View mode: basic / threaded / horizontal-split · Log in · Help
August 02, 2006
char-type renaming?
Derek Parnell wrote:
[snip]
>   char  ==> An unsigned 8-bit byte. An alias for ubyte.
>   schar ==> A UTF-8 code unit.
>   wchar ==> A UTF-16 code unit.
>   dchar ==> A UTF-32 code unit.
> 
>   char[] ==> A 'C' string 
>   schar[] ==> A UTF-8 string
>   wchar[] ==> A UTF-16 string
>   dchar[] ==> A UTF-32 string

Sure, although char, utf8, utf16, utf32 are much better choices, IMHO :)

I'd be game to have them changed at this stage. It's not much more than 
some (extensive) global replacements. Don't think there's much need to 
check each instance. There's a nice shareware tool called "Active Search 
& Replace" which I've recently found to be very helpful in this regard.
August 02, 2006
Re: To Walter, about char[] initialization by FF
On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk  
<news@terrainformatica.com> wrote:
> "Derek Parnell" <derek@nomail.afraid.org> wrote in message
> news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg@40tude.net...
>> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
>>
>>> (Hope this long dialog will help all of us to better understand what
>>> UNICODE
>>> is)
>>>
>>> "Walter Bright" <newshound@digitalmars.com> wrote in message
>>> news:eao5st$2r1f$1@digitaldaemon.com...
>>>> Andrew Fedoniouk wrote:
>>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>>> encoded using UTF-16.
>>>>
>>>> BMP is a subset of UTF-16.
>>>
>>> Walter with deepest respect but it is not. Two different things.
>>>
>>> UTF-16 is a variable-length enconding - byte stream.
>>> Unicode BMP is a range of numbers strictly speaking.
>>
>> Andrew is correct. In UTF-16, characters are variable length, from 2 to  
>> 4
>> bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used  
>> to
>> be up to 6 but that has changed). UCS-2 is a subset of Unicode  
>> characters
>> that are all represented by 2-byte integers. Windows NT had implemented
>> UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
>>
>> ...
>>
>>>> ubyte[] will enable you to use any encoding you wish - and that's what
>>>> it's there for.
>>>
>>> Thus the whole set of Windows API headers (and std.c.string for  
>>> example)
>>> seen in D has to be rewrited to accept ubyte[]. As char in D is not  
>>> char
>>> in
>>> C
>>> Is this the idea?
>>
>> Yes. I believe this is how it now should be done. The Phobos library is
>> not
>> correctly using char, char[], and ubyte[] when interfacing with Windows
>> and
>> C functions.
>>
>> My guess is that Walter originally used 'char' to make things easier  
>> for C
>> coders to move over to D, but in doing so, now with UTF support  
>> built-in,
>> has caused more problems that the idea was supposed to solve. The move  
>> to
>> UTF support is good, but the choice of 'char' for the name of a UTF-8
>> code-unit was, and still is, a big mistake. I would have liked something
>> more like ...
>>
>>  char  ==> An unsigned 8-bit byte. An alias for ubyte.
>>  schar ==> A UTF-8 code unit.
>>  wchar ==> A UTF-16 code unit.
>>  dchar ==> A UTF-32 code unit.
>>
>>  char[] ==> A 'C' string
>>  schar[] ==> A UTF-8 string
>>  wchar[] ==> A UTF-16 string
>>  dchar[] ==> A UTF-32 string
>>
>> And then have built-in conversions between the UTF encodings. So if  
>> people
>> want to continue to use code from C/C++ that uses code-pages or similar
>> they can stick with char[].
>>
>>
>
> Yes, Derek, this will be probably near the ideal.

Yet, I don't find it at all difficult to think of them like so:

  ubyte ==> An unsigned 8-bit byte.
  char  ==> A UTF-8 code unit.
  wchar ==> A UTF-16 code unit.
  dchar ==> A UTF-32 code unit.

  ubyte[] ==> A 'C' string
  char[]  ==> A UTF-8 string
  wchar[] ==> A UTF-16 string
  dchar[] ==> A UTF-32 string

If you want to program in D you _will_ have to readjust your thinking in  
some areas, this is one of them.
All you have to realise is that 'char' in D is not the same as 'char' in C.

In quick and dirty ASCII only applications I can adjust my thinking  
further:

  char   ==> An ASCII character
  char[] ==> An ASCII string

I do however agree that C functions used in D should be declared like:
  int strlen(ubyte* s);

and not like (as they currently are):
  int strlen(char* s);

The problem with this is that the code:
  char[] s = "test";
  strlen(s)

would produce a compile error, and require a cast or a conversion function  
(toMBSz perhaps, which in many cases will not need to do anything).

Of course the purists would say "That's perfectly correct, strlen cannot  
tell you the length of a UTF-8 string, only it's byte count", but at the  
same time it would be nice (for quick and dirty ASCII only programs) if it  
worked.

Is it possible to declare them like this?
  int strlen(void* s);

and for char[] to be implicitly 'paintable' as void* as char[] is already  
implicitly 'paintable' as void[]?

It seems like it would nicely solve the problem of people seeing:
  int strlen(char* s);

and thinking D's char is the same as C's char without introducing a  
painful need for cast or conversion in simple ASCII only situations.

Regan
August 02, 2006
Re: To Walter, about char[] initialization by FF
Andrew Fedoniouk wrote:
> "Walter Bright" <newshound@digitalmars.com> wrote in message 
>> BMP is a subset of UTF-16.
> 
> Walter with deepest respect but it is not. Two different things.
> 
> UTF-16 is a variable-length enconding - byte stream.
> Unicode BMP is a range of numbers strictly speaking.
> 
> If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you
> are in trouble. See:
> 
> Sequence of two words D834 DD1E as UTF-16 will give you
> one unicode character with code 0x1D11E  ( musical G clef ).
> And the same sequence interpretted as UCS-2 sequence will
> give you two (invlaid, non-printable but still) character codes.
> You will get different length of the string at least.

The only thing that UTF-16 adds are semantics for characters that are 
invalid for BMP. That makes UTF-16 a superset. It doesn't matter if 
you're strictly speaking, or if the jargon is different. UTF-16 is a 
superset of BMP, once you cut past the jargon and look at the underlying 
reality.


>>> Ok. And how do you call A functions?
>> Take a look at std.file for an example.
> 
> You mean here?:
> 
> char* namez = toMBSz(name);
>  h = CreateFileA(namez,GENERIC_WRITE,0,null,CREATE_ALWAYS,
>      FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,cast(HANDLE)null);
> char* here is far from UTF-8 sequence.

You could argue that for clarity namez should have been written as a 
ubyte*, but in the above code it would make no difference.

>>>> Windows, Java, and Javascript have all had to go back and redo to deal 
>>>> with surrogate pairs.
>>> Why? JavaScript for example has no such things as char.
>>> String.charAt() returns guess what? Correct - String object.
>>> No char - no problem :D
>> See String.fromCharCode() and String.charCodeAt()
> 
> ECMA-262
> 
> String.prototype.charCodeAt (pos)
> Returns a number (a nonnegative integer less than 2^16) representing the 
> code point value of the
> character at position pos in the string....
> 
> As you may see it is returning (unicode) *code point* from BMP set
> but it is far from UTF-16 code unit you've declared above.

There is no difference.

> Relaxing "a nonnegative integer less than 2^16" to
> "a nonnegative integer less than 2^21" will not harm anybody.
> Or at least such probability is vanishingly small.

It'll break any code trying to deal with surrogate pairs.


>>> Again - let people decide of what char is and how to interpret it And 
>>> that will be it.
>> I've already explained the problems C/C++ have with that. They're real 
>> problems, bad and unfixable enough that there are official proposals to 
>> add new UTF basic types to to C++.
> 
> Basic types of what?

Basic types for utf-8 and utf-16. Ironically, they wind up being very 
much like D's char and wchar types.

>>> Phobos can work with utf-8/16 and satisfy you and other UTF-masochists 
>>> (no offence implied).
>> C++'s experience with this demonstrates that char* does not work very well 
>> with UTF-8. It's not just my experience, it's why new types were proposed 
>> for C++ (and not by me).
> Because char in C is not supposed  to hold multy-byte encodings.

Standard functions in the C standard library to deal with multibyte 
encodings have been there since 1989. Compiler extensions to deal with 
shift-JIS and other multibyte encodings have been there since the mid 
80's. They don't work very well, but nevertheless, are there and supported.

> At least std::string is strictly single byte thing by definition. And this
> is perfectly fine.

As long as you're dealing with ASCII only <g>. That world has been left 
behind, though.

> There is wchar_t for holding OS supported range in full.
> On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.

That's just the trouble with wchar_t. It's implementation defined, which 
means its use is non-portable. The Win32 version cannot handle surrogate 
pairs as a single character. Linux has the opposite problem - you can't 
have UTF-16 strings in any non-kludgy way. Trying to write 
internationalized code with wchar_t that works correctly on both Win32 
and Linux is an exercise in frustration. What you wind up doing is 
abstracting away the char type - giving up on help from the standard 
libraries and writing your own text processing code from scratch.

I've been through this with real projects. It doesn't work just fine, 
and is a lot of extra work. Translating the code to D is nice, you 
essentially give that whole mess a heave-ho.

BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit 
wchar_t eats memory like nothing else.

> Thus the whole set of Windows API headers (and std.c.string for example)
> seen in D has to be rewrited to accept ubyte[]. As char in D is not char in 
> C

You're right that a C char isn't a D char. All that means is one must be 
careful when calling C functions that take char*'s to pass it data in 
the form that particular C function expects. This is true for all C's 
data types - even int.

> Is this the idea?

The vast majority (perhaps even all) of C standard string handling 
functions that accept char* will work with UTF-8 without modification. 
No rewrite required.

You've implied all this doesn't work, by saying things must be 
rewritten, that it's extremely difficult to deal with UTF-8, that BMP is 
 not a subset of UTF-16, etc. This is not my experience at all. If 
you've got some persuasive code examples, I'd like to see them.
August 02, 2006
Learn about Unicode (was To Walter, about char[] initialization by FF)
Andrew Fedoniouk wrote:
> (Hope this long dialog will help all of us to better understand what UNICODE 
> is)

Actually, it doesn't help at all, Andrew ~ some of it is thoroughly 
misguided, and some is "cleverly" slanted purely for the benefit of the 
author. In truth, this thread would be the last place one would look to 
learn from an entirely unbiased opinion; one with only the readers 
education in mind.

There are infinitely more useful places to go for that sort of thing. 
For those who have an interest, this tiny selection may help:

http://icu.sourceforge.net/docs/papers/forms_of_unicode/
http://www.hackcraft.net/xmlUnicode/
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www.unicode.org/unicode/faq/utf_bom.html
http://en.wikipedia.org/wiki/UTF-8
http://www.joelonsoftware.com/articles/Unicode.html
August 02, 2006
Re: To Walter, about char[] initialization by FF
"Regan Heath" <regan@netwin.co.nz> wrote in message 
news:optdm2gghi23k2f5@nrage...
> On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk 
> <news@terrainformatica.com> wrote:
>> "Derek Parnell" <derek@nomail.afraid.org> wrote in message
>> news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg@40tude.net...
>>> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
>>>
>>>> (Hope this long dialog will help all of us to better understand what
>>>> UNICODE
>>>> is)
>>>>
>>>> "Walter Bright" <newshound@digitalmars.com> wrote in message
>>>> news:eao5st$2r1f$1@digitaldaemon.com...
>>>>> Andrew Fedoniouk wrote:
>>>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>>>> encoded using UTF-16.
>>>>>
>>>>> BMP is a subset of UTF-16.
>>>>
>>>> Walter with deepest respect but it is not. Two different things.
>>>>
>>>> UTF-16 is a variable-length enconding - byte stream.
>>>> Unicode BMP is a range of numbers strictly speaking.
>>>
>>> Andrew is correct. In UTF-16, characters are variable length, from 2 to 
>>> 4
>>> bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used 
>>> to
>>> be up to 6 but that has changed). UCS-2 is a subset of Unicode 
>>> characters
>>> that are all represented by 2-byte integers. Windows NT had implemented
>>> UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
>>>
>>> ...
>>>
>>>>> ubyte[] will enable you to use any encoding you wish - and that's what
>>>>> it's there for.
>>>>
>>>> Thus the whole set of Windows API headers (and std.c.string for 
>>>> example)
>>>> seen in D has to be rewrited to accept ubyte[]. As char in D is not 
>>>> char
>>>> in
>>>> C
>>>> Is this the idea?
>>>
>>> Yes. I believe this is how it now should be done. The Phobos library is
>>> not
>>> correctly using char, char[], and ubyte[] when interfacing with Windows
>>> and
>>> C functions.
>>>
>>> My guess is that Walter originally used 'char' to make things easier 
>>> for C
>>> coders to move over to D, but in doing so, now with UTF support 
>>> built-in,
>>> has caused more problems that the idea was supposed to solve. The move 
>>> to
>>> UTF support is good, but the choice of 'char' for the name of a UTF-8
>>> code-unit was, and still is, a big mistake. I would have liked something
>>> more like ...
>>>
>>>  char  ==> An unsigned 8-bit byte. An alias for ubyte.
>>>  schar ==> A UTF-8 code unit.
>>>  wchar ==> A UTF-16 code unit.
>>>  dchar ==> A UTF-32 code unit.
>>>
>>>  char[] ==> A 'C' string
>>>  schar[] ==> A UTF-8 string
>>>  wchar[] ==> A UTF-16 string
>>>  dchar[] ==> A UTF-32 string
>>>
>>> And then have built-in conversions between the UTF encodings. So if 
>>> people
>>> want to continue to use code from C/C++ that uses code-pages or similar
>>> they can stick with char[].
>>>
>>>
>>
>> Yes, Derek, this will be probably near the ideal.
>
> Yet, I don't find it at all difficult to think of them like so:
>
>   ubyte ==> An unsigned 8-bit byte.
>   char  ==> A UTF-8 code unit.
>   wchar ==> A UTF-16 code unit.
>   dchar ==> A UTF-32 code unit.
>
>   ubyte[] ==> A 'C' string
>   char[]  ==> A UTF-8 string
>   wchar[] ==> A UTF-16 string
>   dchar[] ==> A UTF-32 string
>
> If you want to program in D you _will_ have to readjust your thinking in 
> some areas, this is one of them.
> All you have to realise is that 'char' in D is not the same as 'char' in 
> C.
>
> In quick and dirty ASCII only applications I can adjust my thinking 
> further:
>
>   char   ==> An ASCII character
>   char[] ==> An ASCII string
>
> I do however agree that C functions used in D should be declared like:
>   int strlen(ubyte* s);
>
> and not like (as they currently are):
>   int strlen(char* s);
>
> The problem with this is that the code:
>   char[] s = "test";
>   strlen(s)
>
> would produce a compile error, and require a cast or a conversion function 
> (toMBSz perhaps, which in many cases will not need to do anything).
>
> Of course the purists would say "That's perfectly correct, strlen cannot 
> tell you the length of a UTF-8 string, only it's byte count", but at the 
> same time it would be nice (for quick and dirty ASCII only programs) if it 
> worked.
>
> Is it possible to declare them like this?
>   int strlen(void* s);
>
> and for char[] to be implicitly 'paintable' as void* as char[] is already 
> implicitly 'paintable' as void[]?
>
> It seems like it would nicely solve the problem of people seeing:
>   int strlen(char* s);
>
> and thinking D's char is the same as C's char without introducing a 
> painful need for cast or conversion in simple ASCII only situations.
>
> Regan

Another option will be to change char.init to 0 and forget about the problem
left it as it is now.  Some good string implementation will
contain encoding field in string instance if needed.

Andrew.
August 02, 2006
Re: To Walter, about char[] initialization by FF
On Wed, 02 Aug 2006 16:22:54 +1200, Regan Heath wrote:

>>>  char  ==> An unsigned 8-bit byte. An alias for ubyte.
>>>  schar ==> A UTF-8 code unit.
>>>  wchar ==> A UTF-16 code unit.
>>>  dchar ==> A UTF-32 code unit.
>>>
>>>  char[] ==> A 'C' string
>>>  schar[] ==> A UTF-8 string
>>>  wchar[] ==> A UTF-16 string
>>>  dchar[] ==> A UTF-32 string
>>>
>>> And then have built-in conversions between the UTF encodings. So if  
>>> people
>>> want to continue to use code from C/C++ that uses code-pages or similar
>>> they can stick with char[].
>>>
>>>
>>
>> Yes, Derek, this will be probably near the ideal.
> 
> Yet, I don't find it at all difficult to think of them like so:
> 
>    ubyte ==> An unsigned 8-bit byte.
>    char  ==> A UTF-8 code unit.
>    wchar ==> A UTF-16 code unit.
>    dchar ==> A UTF-32 code unit.
> 
>    ubyte[] ==> A 'C' string
>    char[]  ==> A UTF-8 string
>    wchar[] ==> A UTF-16 string
>    dchar[] ==> A UTF-32 string

Me too, but that's probably because I've not been immersed in C/C++ for the
last 20 odd years ;-) 

I "think in D" now and char[] is a UTF-8 string in my mind. 

> If you want to program in D you _will_ have to readjust your thinking in  
> some areas, this is one of them.
> All you have to realise is that 'char' in D is not the same as 'char' in C.

True, but Walter seems hell bent of easing the transition to D for C/C++
refugees.

> In quick and dirty ASCII only applications I can adjust my thinking  
> further:
> 
>    char   ==> An ASCII character
>    char[] ==> An ASCII string
> 
> I do however agree that C functions used in D should be declared like:
>    int strlen(ubyte* s);
> 
> and not like (as they currently are):
>    int strlen(char* s);
> 
> The problem with this is that the code:
>    char[] s = "test";
>    strlen(s)
> 
> would produce a compile error, and require a cast or a conversion function  
> (toMBSz perhaps, which in many cases will not need to do anything).
> 
> Of course the purists would say "That's perfectly correct, strlen cannot  
> tell you the length of a UTF-8 string, only it's byte count", but at the  
> same time it would be nice (for quick and dirty ASCII only programs) if it  
> worked.

And I'm a wannabe purist <G>

> Is it possible to declare them like this?
>    int strlen(void* s);
> 
> and for char[] to be implicitly 'paintable' as void* as char[] is already  
> implicitly 'paintable' as void[]?
> 
> It seems like it would nicely solve the problem of people seeing:
>    int strlen(char* s);
> 
> and thinking D's char is the same as C's char without introducing a  
> painful need for cast or conversion in simple ASCII only situations.

Is the zero-terminator for C strings that will get in the way. We need a
nice way of getting the compiler to ensure C-strings are always terminated
correctly.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 2:48:43 PM
August 02, 2006
Re: To Walter, about char[] initialization by FF
>> As you may see it is returning (unicode) *code point* from BMP set
>> but it is far from UTF-16 code unit you've declared above.
>
> There is no difference.
>
>> Relaxing "a nonnegative integer less than 2^16" to
>> "a nonnegative integer less than 2^21" will not harm anybody.
> > Or at least such probability is vanishingly small.
>
> It'll break any code trying to deal with surrogate pairs.
>

There is no such thing as surrogate pair in UCS-2.
JS string is not holding UTF-16 code units - only full code points.
See spec.


>>>>> Phobos can work with utf-8/16 and satisfy you and other UTF-masochists
>>>> (no offence implied).
>>> C++'s experience with this demonstrates that char* does not work very 
>>> well with UTF-8. It's not just my experience, it's why new types were 
>>> proposed for C++ (and not by me).
>> Because char in C is not supposed  to hold multy-byte encodings.
>
> Standard functions in the C standard library to deal with multibyte 
> encodings have been there since 1989. Compiler extensions to deal with 
> shift-JIS and other multibyte encodings have been there since the mid 
> 80's. They don't work very well, but nevertheless, are there and 
> supported.
>
>> At least std::string is strictly single byte thing by definition. And 
>> this
>> is perfectly fine.
>
> As long as you're dealing with ASCII only <g>. That world has been left 
> behind, though.


C string functions can be used with mutibyte encodings for the
sole reason: all byte encodings has char with code 0 defined
as NUL character. All encodings in practical use has no code
byte with code 0 appear in the middle of sequence.
They all built with C string processing in mind.


>
>> There is wchar_t for holding OS supported range in full.
>> On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.
>
> That's just the trouble with wchar_t. It's implementation defined, which 
> means its use is non-portable. The Win32 version cannot handle surrogate 
> pairs as a single character. Linux has the opposite problem - you can't 
> have UTF-16 strings in any non-kludgy way. Trying to write 
> internationalized code with wchar_t that works correctly on both Win32 and 
> Linux is an exercise in frustration. What you wind up doing is abstracting 
> away the char type - giving up on help from the standard libraries and 
> writing your own text processing code from scratch.
>
> I've been through this with real projects. It doesn't work just fine, and 
> is a lot of extra work. Translating the code to D is nice, you essentially 
> give that whole mess a heave-ho.
>
> BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit 
> wchar_t eats memory like nothing else.

Agree. As I said - if you need efficiency use byte/word encodings + 
mapping.

dchar is no better than wchar_t/linux.
Please don't say that I shall use urf-8 for that - simply does not work in 
my cases -
too expencive.

>
>> Thus the whole set of Windows API headers (and std.c.string for example)
>> seen in D has to be rewrited to accept ubyte[]. As char in D is not char 
>> in C
>
> You're right that a C char isn't a D char. All that means is one must be 
> careful when calling C functions that take char*'s to pass it data in the 
> form that particular C function expects. This is true for all C's data 
> types - even int.
>
>> Is this the idea?
>
> The vast majority (perhaps even all) of C standard string handling 
> functions that accept char* will work with UTF-8 without modification. No 
> rewrite required.

Correct. As I said because of 0 is NUL in UTF-8 too. Not
0xFF or anything else exotic.

>
> You've implied all this doesn't work, by saying things must be rewritten, 
> that it's extremely difficult to deal with UTF-8, that BMP is not a subset 
> of UTF-16, etc. This is not my experience at all. If you've got some 
> persuasive code examples, I'd like to see them.

I am not saying that "must be rewritten". Sorry but this is you who propose
to rewrite all string processing functions of standard library mankind has 
for today.

Or I don't quite understand your idea with UTFs.

Java did change string world by introducing just char (single UCS-2 code 
point)
And no variations. Good it or bad? From uniformity point of view - good.
For efficiency - bad. I've seen a lot of reinvented char as byte wheels in 
professional
packages.

Andrew.
August 02, 2006
Re: To Walter, about char[] initialization by FF
I'm trying to understand why this 0 thing is such an issue.  If your 
second statement is valid, it makes the first moot - 0 or no 0.  Why 
does it matter, then?

-[Unknown]


> Another option will be to change char.init to 0 and forget about the problem
> left it as it is now.  Some good string implementation will
> contain encoding field in string instance if needed.
> 
> Andrew.
> 
> 
>
August 02, 2006
Re: To Walter, about char[] initialization by FF
Andrew, I think there's a misunderstanding here.  Perhaps it's a 
language thing.

Let me define two things for you, in English, by my understanding of 
them.  I was born in Utah and raised in Los Angeles as a native speaker, 
so hopefully these definitions aren't far from the standard understanding.

Default: a setting, value, or situation which persists unless action is 
taken otherwise; such a thing that happens unless overridden or canceled.

Null: something which has no current setting, value, or situation (but 
could have one); the absence of a setting, value, or situation.

Therefore, I should conclude that "default" and "null" are very 
different concepts.

The fact that C strings are null terminated, and that encodings provide 
for a "null" character (or code point or muffin or whatever they care to 
call them) does not logically necessitate that this provides for a 
default, or logically default, value.

It is true that, as the above definitions, it would not be wrong for the 
default to be null.  That would fit the definitions above perfectly. 
However, so would a value of ' ' (which might be the default in some 
language out there.)

It would seem logical that 0 could be used as the default, but then as 
Walter pointed out... this can (and tends to) hide bugs which will bite 
you eventually.

Let us suppose you were to have a string displayed in a place.  It is 
possible, were it blank, that you might not notice it.  Next let us 
suppose this space were filled with "?", "`", "ﮘ", or "ß" characters.

Do you think you would be more, or less likely to notice it?

Next, let us suppose that this character could be (in cases) detectable 
as invalid.  Again note that 0 is not invalid, and may appear in 
strings.  This sounds even better.

So a default value of 0 does not, from an implementation or practical 
point of view, seem to make much sense to me.  In fact, I think a 
default value of "42" for int makes sense (surely it reminds you of what 
six by nine is.)

But maybe that's because I never leave things at their defaults.  It's 
like writing a story where you expect the reader to think everyone has 
brown eyes unless you say otherwise.

-[Unknown]


> Correct. As I said because of 0 is NUL in UTF-8 too. Not
> 0xFF or anything else exotic.
August 02, 2006
Re: To Walter, about char[] initialization by FF
On Wed, 2 Aug 2006 14:55:11 +1000, Derek Parnell <derek@nomail.afraid.org>  
wrote:
> Is the zero-terminator for C strings that will get in the way. We need a
> nice way of getting the compiler to ensure C-strings are always  
> terminated
> correctly.

Good point. I neglected to mention that.

Regan
5 6 7 8 9 10 11
Top | Discussion index | About this forum | D home