char-type renaming? (page 9) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » char-type renaming? (page 9)

August 02, 2006

char-type renaming?

Posted by kris
in reply to Derek Parnell

kris

Posted in reply to Derek Parnell

Derek Parnell wrote:
[snip]
>   char  ==> An unsigned 8-bit byte. An alias for ubyte.
>   schar ==> A UTF-8 code unit.
>   wchar ==> A UTF-16 code unit.
>   dchar ==> A UTF-32 code unit.
> 
>   char[] ==> A 'C' string   schar[] ==> A UTF-8 string
>   wchar[] ==> A UTF-16 string
>   dchar[] ==> A UTF-32 string

Sure, although char, utf8, utf16, utf32 are much better choices, IMHO :)

I'd be game to have them changed at this stage. It's not much more than some (extensive) global replacements. Don't think there's much need to check each instance. There's a nice shareware tool called "Active Search & Replace" which I've recently found to be very helpful in this regard.

August 02, 2006

Re: To Walter, about char[] initialization by FF

Posted by Regan Heath
in reply to Andrew Fedoniouk

Regan Heath

Posted in reply to Andrew Fedoniouk

On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk <news@terrainformatica.com> wrote:
> "Derek Parnell" <derek@nomail.afraid.org> wrote in message
> news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg@40tude.net...
>> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
>>
>>> (Hope this long dialog will help all of us to better understand what
>>> UNICODE
>>> is)
>>>
>>> "Walter Bright" <newshound@digitalmars.com> wrote in message
>>> news:eao5st$2r1f$1@digitaldaemon.com...
>>>> Andrew Fedoniouk wrote:
>>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>>> encoded using UTF-16.
>>>>
>>>> BMP is a subset of UTF-16.
>>>
>>> Walter with deepest respect but it is not. Two different things.
>>>
>>> UTF-16 is a variable-length enconding - byte stream.
>>> Unicode BMP is a range of numbers strictly speaking.
>>
>> Andrew is correct. In UTF-16, characters are variable length, from 2 to 4
>> bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to
>> be up to 6 but that has changed). UCS-2 is a subset of Unicode characters
>> that are all represented by 2-byte integers. Windows NT had implemented
>> UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
>>
>> ...
>>
>>>> ubyte[] will enable you to use any encoding you wish - and that's what
>>>> it's there for.
>>>
>>> Thus the whole set of Windows API headers (and std.c.string for example)
>>> seen in D has to be rewrited to accept ubyte[]. As char in D is not char
>>> in
>>> C
>>> Is this the idea?
>>
>> Yes. I believe this is how it now should be done. The Phobos library is
>> not
>> correctly using char, char[], and ubyte[] when interfacing with Windows
>> and
>> C functions.
>>
>> My guess is that Walter originally used 'char' to make things easier for C
>> coders to move over to D, but in doing so, now with UTF support built-in,
>> has caused more problems that the idea was supposed to solve. The move to
>> UTF support is good, but the choice of 'char' for the name of a UTF-8
>> code-unit was, and still is, a big mistake. I would have liked something
>> more like ...
>>
>>  char  ==> An unsigned 8-bit byte. An alias for ubyte.
>>  schar ==> A UTF-8 code unit.
>>  wchar ==> A UTF-16 code unit.
>>  dchar ==> A UTF-32 code unit.
>>
>>  char[] ==> A 'C' string
>>  schar[] ==> A UTF-8 string
>>  wchar[] ==> A UTF-16 string
>>  dchar[] ==> A UTF-32 string
>>
>> And then have built-in conversions between the UTF encodings. So if people
>> want to continue to use code from C/C++ that uses code-pages or similar
>> they can stick with char[].
>>
>>
>
> Yes, Derek, this will be probably near the ideal.

Yet, I don't find it at all difficult to think of them like so:

  ubyte ==> An unsigned 8-bit byte.
  char  ==> A UTF-8 code unit.
  wchar ==> A UTF-16 code unit.
  dchar ==> A UTF-32 code unit.

  ubyte[] ==> A 'C' string
  char[]  ==> A UTF-8 string
  wchar[] ==> A UTF-16 string
  dchar[] ==> A UTF-32 string

If you want to program in D you _will_ have to readjust your thinking in some areas, this is one of them.
All you have to realise is that 'char' in D is not the same as 'char' in C.

In quick and dirty ASCII only applications I can adjust my thinking further:

  char   ==> An ASCII character
  char[] ==> An ASCII string

I do however agree that C functions used in D should be declared like:
  int strlen(ubyte* s);

and not like (as they currently are):
  int strlen(char* s);

The problem with this is that the code:
  char[] s = "test";
  strlen(s)

would produce a compile error, and require a cast or a conversion function (toMBSz perhaps, which in many cases will not need to do anything).

Of course the purists would say "That's perfectly correct, strlen cannot tell you the length of a UTF-8 string, only it's byte count", but at the same time it would be nice (for quick and dirty ASCII only programs) if it worked.

Is it possible to declare them like this?
  int strlen(void* s);

and for char[] to be implicitly 'paintable' as void* as char[] is already implicitly 'paintable' as void[]?

It seems like it would nicely solve the problem of people seeing:
  int strlen(char* s);

and thinking D's char is the same as C's char without introducing a painful need for cast or conversion in simple ASCII only situations.

Regan

August 02, 2006

Re: To Walter, about char[] initialization by FF

Posted by Walter Bright
in reply to Andrew Fedoniouk

Walter Bright

Posted in reply to Andrew Fedoniouk

Andrew Fedoniouk wrote:
> "Walter Bright" <newshound@digitalmars.com> wrote in message 
>> BMP is a subset of UTF-16.
> 
> Walter with deepest respect but it is not. Two different things.
> 
> UTF-16 is a variable-length enconding - byte stream.
> Unicode BMP is a range of numbers strictly speaking.
> 
> If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you
> are in trouble. See:
> 
> Sequence of two words D834 DD1E as UTF-16 will give you
> one unicode character with code 0x1D11E  ( musical G clef ).
> And the same sequence interpretted as UCS-2 sequence will
> give you two (invlaid, non-printable but still) character codes.
> You will get different length of the string at least.

The only thing that UTF-16 adds are semantics for characters that are invalid for BMP. That makes UTF-16 a superset. It doesn't matter if you're strictly speaking, or if the jargon is different. UTF-16 is a superset of BMP, once you cut past the jargon and look at the underlying reality.

>>> Ok. And how do you call A functions?
>> Take a look at std.file for an example.
> 
> You mean here?:
> 
> char* namez = toMBSz(name);
>  h = CreateFileA(namez,GENERIC_WRITE,0,null,CREATE_ALWAYS,
>      FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,cast(HANDLE)null);
> char* here is far from UTF-8 sequence.

You could argue that for clarity namez should have been written as a ubyte*, but in the above code it would make no difference.

>>>> Windows, Java, and Javascript have all had to go back and redo to deal with surrogate pairs.
>>> Why? JavaScript for example has no such things as char.
>>> String.charAt() returns guess what? Correct - String object.
>>> No char - no problem :D
>> See String.fromCharCode() and String.charCodeAt()
> 
> ECMA-262
> 
> String.prototype.charCodeAt (pos)
> Returns a number (a nonnegative integer less than 2^16) representing the code point value of the
> character at position pos in the string....
> 
> As you may see it is returning (unicode) *code point* from BMP set
> but it is far from UTF-16 code unit you've declared above.

There is no difference.

> Relaxing "a nonnegative integer less than 2^16" to
> "a nonnegative integer less than 2^21" will not harm anybody.
> Or at least such probability is vanishingly small.

It'll break any code trying to deal with surrogate pairs.

>>> Again - let people decide of what char is and how to interpret it And that will be it.
>> I've already explained the problems C/C++ have with that. They're real problems, bad and unfixable enough that there are official proposals to add new UTF basic types to to C++.
> 
> Basic types of what?

Basic types for utf-8 and utf-16. Ironically, they wind up being very much like D's char and wchar types.

>>> Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no offence implied).
>> C++'s experience with this demonstrates that char* does not work very well with UTF-8. It's not just my experience, it's why new types were proposed for C++ (and not by me).
> Because char in C is not supposed  to hold multy-byte encodings.

Standard functions in the C standard library to deal with multibyte encodings have been there since 1989. Compiler extensions to deal with shift-JIS and other multibyte encodings have been there since the mid 80's. They don't work very well, but nevertheless, are there and supported.

> At least std::string is strictly single byte thing by definition. And this
> is perfectly fine.

As long as you're dealing with ASCII only <g>. That world has been left behind, though.

> There is wchar_t for holding OS supported range in full.
> On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.

That's just the trouble with wchar_t. It's implementation defined, which means its use is non-portable. The Win32 version cannot handle surrogate pairs as a single character. Linux has the opposite problem - you can't have UTF-16 strings in any non-kludgy way. Trying to write internationalized code with wchar_t that works correctly on both Win32 and Linux is an exercise in frustration. What you wind up doing is abstracting away the char type - giving up on help from the standard libraries and writing your own text processing code from scratch.

I've been through this with real projects. It doesn't work just fine, and is a lot of extra work. Translating the code to D is nice, you essentially give that whole mess a heave-ho.

BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit wchar_t eats memory like nothing else.

> Thus the whole set of Windows API headers (and std.c.string for example)
> seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C

You're right that a C char isn't a D char. All that means is one must be careful when calling C functions that take char*'s to pass it data in the form that particular C function expects. This is true for all C's data types - even int.

> Is this the idea?

The vast majority (perhaps even all) of C standard string handling functions that accept char* will work with UTF-8 without modification. No rewrite required.

You've implied all this doesn't work, by saying things must be rewritten, that it's extremely difficult to deal with UTF-8, that BMP is  not a subset of UTF-16, etc. This is not my experience at all. If you've got some persuasive code examples, I'd like to see them.

August 02, 2006

Learn about Unicode (was To Walter, about char[] initialization by FF)

Posted by kris
in reply to Andrew Fedoniouk

kris

Posted in reply to Andrew Fedoniouk

Andrew Fedoniouk wrote:
> (Hope this long dialog will help all of us to better understand what UNICODE is)

Actually, it doesn't help at all, Andrew ~ some of it is thoroughly misguided, and some is "cleverly" slanted purely for the benefit of the author. In truth, this thread would be the last place one would look to learn from an entirely unbiased opinion; one with only the readers education in mind.

There are infinitely more useful places to go for that sort of thing. For those who have an interest, this tiny selection may help:

http://icu.sourceforge.net/docs/papers/forms_of_unicode/
http://www.hackcraft.net/xmlUnicode/
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www.unicode.org/unicode/faq/utf_bom.html
http://en.wikipedia.org/wiki/UTF-8
http://www.joelonsoftware.com/articles/Unicode.html

August 02, 2006

Re: To Walter, about char[] initialization by FF

Posted by Andrew Fedoniouk
in reply to Regan Heath

Andrew Fedoniouk

Posted in reply to Regan Heath

"Regan Heath" <regan@netwin.co.nz> wrote in message news:optdm2gghi23k2f5@nrage...
> On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk <news@terrainformatica.com> wrote:
>> "Derek Parnell" <derek@nomail.afraid.org> wrote in message news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg@40tude.net...
>>> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
>>>
>>>> (Hope this long dialog will help all of us to better understand what
>>>> UNICODE
>>>> is)
>>>>
>>>> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eao5st$2r1f$1@digitaldaemon.com...
>>>>> Andrew Fedoniouk wrote:
>>>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>>>> encoded using UTF-16.
>>>>>
>>>>> BMP is a subset of UTF-16.
>>>>
>>>> Walter with deepest respect but it is not. Two different things.
>>>>
>>>> UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.
>>>
>>> Andrew is correct. In UTF-16, characters are variable length, from 2 to
>>> 4
>>> bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used
>>> to
>>> be up to 6 but that has changed). UCS-2 is a subset of Unicode
>>> characters
>>> that are all represented by 2-byte integers. Windows NT had implemented
>>> UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
>>>
>>> ...
>>>
>>>>> ubyte[] will enable you to use any encoding you wish - and that's what it's there for.
>>>>
>>>> Thus the whole set of Windows API headers (and std.c.string for
>>>> example)
>>>> seen in D has to be rewrited to accept ubyte[]. As char in D is not
>>>> char
>>>> in
>>>> C
>>>> Is this the idea?
>>>
>>> Yes. I believe this is how it now should be done. The Phobos library is
>>> not
>>> correctly using char, char[], and ubyte[] when interfacing with Windows
>>> and
>>> C functions.
>>>
>>> My guess is that Walter originally used 'char' to make things easier
>>> for C
>>> coders to move over to D, but in doing so, now with UTF support
>>> built-in,
>>> has caused more problems that the idea was supposed to solve. The move
>>> to
>>> UTF support is good, but the choice of 'char' for the name of a UTF-8
>>> code-unit was, and still is, a big mistake. I would have liked something
>>> more like ...
>>>
>>>  char  ==> An unsigned 8-bit byte. An alias for ubyte.
>>>  schar ==> A UTF-8 code unit.
>>>  wchar ==> A UTF-16 code unit.
>>>  dchar ==> A UTF-32 code unit.
>>>
>>>  char[] ==> A 'C' string
>>>  schar[] ==> A UTF-8 string
>>>  wchar[] ==> A UTF-16 string
>>>  dchar[] ==> A UTF-32 string
>>>
>>> And then have built-in conversions between the UTF encodings. So if
>>> people
>>> want to continue to use code from C/C++ that uses code-pages or similar
>>> they can stick with char[].
>>>
>>>
>>
>> Yes, Derek, this will be probably near the ideal.
>
> Yet, I don't find it at all difficult to think of them like so:
>
>   ubyte ==> An unsigned 8-bit byte.
>   char  ==> A UTF-8 code unit.
>   wchar ==> A UTF-16 code unit.
>   dchar ==> A UTF-32 code unit.
>
>   ubyte[] ==> A 'C' string
>   char[]  ==> A UTF-8 string
>   wchar[] ==> A UTF-16 string
>   dchar[] ==> A UTF-32 string
>
> If you want to program in D you _will_ have to readjust your thinking in
> some areas, this is one of them.
> All you have to realise is that 'char' in D is not the same as 'char' in
> C.
>
> In quick and dirty ASCII only applications I can adjust my thinking further:
>
>   char   ==> An ASCII character
>   char[] ==> An ASCII string
>
> I do however agree that C functions used in D should be declared like:
>   int strlen(ubyte* s);
>
> and not like (as they currently are):
>   int strlen(char* s);
>
> The problem with this is that the code:
>   char[] s = "test";
>   strlen(s)
>
> would produce a compile error, and require a cast or a conversion function (toMBSz perhaps, which in many cases will not need to do anything).
>
> Of course the purists would say "That's perfectly correct, strlen cannot tell you the length of a UTF-8 string, only it's byte count", but at the same time it would be nice (for quick and dirty ASCII only programs) if it worked.
>
> Is it possible to declare them like this?
>   int strlen(void* s);
>
> and for char[] to be implicitly 'paintable' as void* as char[] is already implicitly 'paintable' as void[]?
>
> It seems like it would nicely solve the problem of people seeing:
>   int strlen(char* s);
>
> and thinking D's char is the same as C's char without introducing a painful need for cast or conversion in simple ASCII only situations.
>
> Regan

Another option will be to change char.init to 0 and forget about the problem
left it as it is now.  Some good string implementation will
contain encoding field in string instance if needed.

Andrew.

August 02, 2006

Re: To Walter, about char[] initialization by FF

Posted by Derek Parnell
in reply to Regan Heath

Derek Parnell

Posted in reply to Regan Heath

On Wed, 02 Aug 2006 16:22:54 +1200, Regan Heath wrote:

>>>  char  ==> An unsigned 8-bit byte. An alias for ubyte.
>>>  schar ==> A UTF-8 code unit.
>>>  wchar ==> A UTF-16 code unit.
>>>  dchar ==> A UTF-32 code unit.
>>>
>>>  char[] ==> A 'C' string
>>>  schar[] ==> A UTF-8 string
>>>  wchar[] ==> A UTF-16 string
>>>  dchar[] ==> A UTF-32 string
>>>
>>> And then have built-in conversions between the UTF encodings. So if
>>> people
>>> want to continue to use code from C/C++ that uses code-pages or similar
>>> they can stick with char[].
>>>
>>>
>>
>> Yes, Derek, this will be probably near the ideal.
> 
> Yet, I don't find it at all difficult to think of them like so:
> 
>    ubyte ==> An unsigned 8-bit byte.
>    char  ==> A UTF-8 code unit.
>    wchar ==> A UTF-16 code unit.
>    dchar ==> A UTF-32 code unit.
> 
>    ubyte[] ==> A 'C' string
>    char[]  ==> A UTF-8 string
>    wchar[] ==> A UTF-16 string
>    dchar[] ==> A UTF-32 string

Me too, but that's probably because I've not been immersed in C/C++ for the last 20 odd years ;-)

I "think in D" now and char[] is a UTF-8 string in my mind.

> If you want to program in D you _will_ have to readjust your thinking in
> some areas, this is one of them.
> All you have to realise is that 'char' in D is not the same as 'char' in C.

True, but Walter seems hell bent of easing the transition to D for C/C++
refugees.

> In quick and dirty ASCII only applications I can adjust my thinking further:
> 
>    char   ==> An ASCII character
>    char[] ==> An ASCII string
> 
> I do however agree that C functions used in D should be declared like:
>    int strlen(ubyte* s);
> 
> and not like (as they currently are):
>    int strlen(char* s);
> 
> The problem with this is that the code:
>    char[] s = "test";
>    strlen(s)
> 
> would produce a compile error, and require a cast or a conversion function (toMBSz perhaps, which in many cases will not need to do anything).
> 
> Of course the purists would say "That's perfectly correct, strlen cannot tell you the length of a UTF-8 string, only it's byte count", but at the same time it would be nice (for quick and dirty ASCII only programs) if it worked.

And I'm a wannabe purist <G>

> Is it possible to declare them like this?
>    int strlen(void* s);
> 
> and for char[] to be implicitly 'paintable' as void* as char[] is already implicitly 'paintable' as void[]?
> 
> It seems like it would nicely solve the problem of people seeing:
>    int strlen(char* s);
> 
> and thinking D's char is the same as C's char without introducing a painful need for cast or conversion in simple ASCII only situations.

Is the zero-terminator for C strings that will get in the way. We need a nice way of getting the compiler to ensure C-strings are always terminated correctly.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 2:48:43 PM

August 02, 2006

Re: To Walter, about char[] initialization by FF

Posted by Andrew Fedoniouk
in reply to Walter Bright

Andrew Fedoniouk

Posted in reply to Walter Bright

>> As you may see it is returning (unicode) *code point* from BMP set but it is far from UTF-16 code unit you've declared above.
>
> There is no difference.
>
>> Relaxing "a nonnegative integer less than 2^16" to
>> "a nonnegative integer less than 2^21" will not harm anybody.
> > Or at least such probability is vanishingly small.
>
> It'll break any code trying to deal with surrogate pairs.
>

There is no such thing as surrogate pair in UCS-2.
 JS string is not holding UTF-16 code units - only full code points.
See spec.


>>>>> Phobos can work with utf-8/16 and satisfy you and other UTF-masochists
>>>> (no offence implied).
>>> C++'s experience with this demonstrates that char* does not work very well with UTF-8. It's not just my experience, it's why new types were proposed for C++ (and not by me).
>> Because char in C is not supposed  to hold multy-byte encodings.
>
> Standard functions in the C standard library to deal with multibyte encodings have been there since 1989. Compiler extensions to deal with shift-JIS and other multibyte encodings have been there since the mid 80's. They don't work very well, but nevertheless, are there and supported.
>
>> At least std::string is strictly single byte thing by definition. And
>> this
>> is perfectly fine.
>
> As long as you're dealing with ASCII only <g>. That world has been left behind, though.


C string functions can be used with mutibyte encodings for the
sole reason: all byte encodings has char with code 0 defined
as NUL character. All encodings in practical use has no code
byte with code 0 appear in the middle of sequence.
They all built with C string processing in mind.


>
>> There is wchar_t for holding OS supported range in full.
>> On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.
>
> That's just the trouble with wchar_t. It's implementation defined, which means its use is non-portable. The Win32 version cannot handle surrogate pairs as a single character. Linux has the opposite problem - you can't have UTF-16 strings in any non-kludgy way. Trying to write internationalized code with wchar_t that works correctly on both Win32 and Linux is an exercise in frustration. What you wind up doing is abstracting away the char type - giving up on help from the standard libraries and writing your own text processing code from scratch.
>
> I've been through this with real projects. It doesn't work just fine, and is a lot of extra work. Translating the code to D is nice, you essentially give that whole mess a heave-ho.
>
> BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit wchar_t eats memory like nothing else.

Agree. As I said - if you need efficiency use byte/word encodings + mapping.

dchar is no better than wchar_t/linux.
Please don't say that I shall use urf-8 for that - simply does not work in
my cases -
too expencive.

>
>> Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C
>
> You're right that a C char isn't a D char. All that means is one must be careful when calling C functions that take char*'s to pass it data in the form that particular C function expects. This is true for all C's data types - even int.
>
>> Is this the idea?
>
> The vast majority (perhaps even all) of C standard string handling functions that accept char* will work with UTF-8 without modification. No rewrite required.

Correct. As I said because of 0 is NUL in UTF-8 too. Not 0xFF or anything else exotic.

>
> You've implied all this doesn't work, by saying things must be rewritten, that it's extremely difficult to deal with UTF-8, that BMP is not a subset of UTF-16, etc. This is not my experience at all. If you've got some persuasive code examples, I'd like to see them.

I am not saying that "must be rewritten". Sorry but this is you who propose to rewrite all string processing functions of standard library mankind has for today.

Or I don't quite understand your idea with UTFs.

Java did change string world by introducing just char (single UCS-2 code
point)
And no variations. Good it or bad? From uniformity point of view - good.
For efficiency - bad. I've seen a lot of reinvented char as byte wheels in
professional
packages.

Andrew.

August 02, 2006

Re: To Walter, about char[] initialization by FF

Posted by Unknown W. Brackets
in reply to Andrew Fedoniouk

Unknown W. Brackets

Posted in reply to Andrew Fedoniouk

I'm trying to understand why this 0 thing is such an issue.  If your second statement is valid, it makes the first moot - 0 or no 0.  Why does it matter, then?

-[Unknown]


> Another option will be to change char.init to 0 and forget about the problem
> left it as it is now.  Some good string implementation will
> contain encoding field in string instance if needed.
> 
> Andrew.
> 
> 
>

August 02, 2006

Re: To Walter, about char[] initialization by FF

Posted by Unknown W. Brackets
in reply to Andrew Fedoniouk

Unknown W. Brackets

Posted in reply to Andrew Fedoniouk

Andrew, I think there's a misunderstanding here.  Perhaps it's a language thing.

Let me define two things for you, in English, by my understanding of them.  I was born in Utah and raised in Los Angeles as a native speaker, so hopefully these definitions aren't far from the standard understanding.

Default: a setting, value, or situation which persists unless action is taken otherwise; such a thing that happens unless overridden or canceled.

Null: something which has no current setting, value, or situation (but could have one); the absence of a setting, value, or situation.

Therefore, I should conclude that "default" and "null" are very different concepts.

The fact that C strings are null terminated, and that encodings provide for a "null" character (or code point or muffin or whatever they care to call them) does not logically necessitate that this provides for a default, or logically default, value.

It is true that, as the above definitions, it would not be wrong for the default to be null.  That would fit the definitions above perfectly. However, so would a value of ' ' (which might be the default in some language out there.)

It would seem logical that 0 could be used as the default, but then as Walter pointed out... this can (and tends to) hide bugs which will bite you eventually.

Let us suppose you were to have a string displayed in a place.  It is possible, were it blank, that you might not notice it.  Next let us suppose this space were filled with "?", "`", "ﮘ", or "ß" characters.

Do you think you would be more, or less likely to notice it?

Next, let us suppose that this character could be (in cases) detectable as invalid.  Again note that 0 is not invalid, and may appear in strings.  This sounds even better.

So a default value of 0 does not, from an implementation or practical point of view, seem to make much sense to me.  In fact, I think a default value of "42" for int makes sense (surely it reminds you of what six by nine is.)

But maybe that's because I never leave things at their defaults.  It's like writing a story where you expect the reader to think everyone has brown eyes unless you say otherwise.

-[Unknown]


> Correct. As I said because of 0 is NUL in UTF-8 too. Not
> 0xFF or anything else exotic.

August 02, 2006

Re: To Walter, about char[] initialization by FF

Posted by Regan Heath
in reply to Derek Parnell

Regan Heath

Posted in reply to Derek Parnell

On Wed, 2 Aug 2006 14:55:11 +1000, Derek Parnell <derek@nomail.afraid.org> wrote:
> Is the zero-terminator for C strings that will get in the way. We need a
> nice way of getting the compiler to ensure C-strings are always terminated
> correctly.

Good point. I neglected to mention that.

Regan

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation