August 02, 2006
"Unknown W. Brackets" <unknown@simplemachines.org> wrote in message news:eapdsg$qeo$1@digitaldaemon.com...
> I'm trying to understand why this 0 thing is such an issue.  If your second statement is valid, it makes the first moot - 0 or no 0.  Why does it matter, then?

Declaration of char.init == 0 pretty much means that
D has no strict requirement that char[] shall contain only UTF-8
encoded sequences but any other encodings suitable for
the application.

char.init == 0 will resolve situation we see in Phobos now. char[] de facto is used for other than utf-8 encodings.

char.init == 0 tells everybody that char can also be used
for representing unicode *code points* with asuumption
that offset value (mapping on full Unicode set, aka codepage) is stored
somewhere in application or well known to it.

char.init == 0 also highlights the fact that it is safe to
use char[] as C string processing functions and passing them to non D
modules and libraries.
Is it UTF-8 encoded or not - does not matter - type is universal enough.

Andrew.







>
> -[Unknown]
>
>
>> Another option will be to change char.init to 0 and forget about the
>> problem
>> left it as it is now.  Some good string implementation will
>> contain encoding field in string instance if needed.
>>
>> Andrew.
>>
>> 

August 02, 2006
On Tue, 01 Aug 2006 22:40:56 -0700, Unknown W. Brackets wrote:

> I'm trying to understand why this 0 thing is such an issue.  If your second statement is valid, it makes the first moot - 0 or no 0.  Why does it matter, then?

I think the issue is more that Andrew wants to have hex-FF as a legitimate byte value anywhere in a char[] variable. He misses the point that the purpose of not allowing it in so we can detected uninitialized UTF-8 strings at run-time.

Andrew, just use ubyte[] variables and you won't have a problem, apart from conversions between code-pages and Unicode <G>.

In D, ubyte[] is the data structure designed to hold variable length arrays of unsigned bytes, which is exactly what you need to implement the type strings you have in KOI-8 encoding.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 4:24:27 PM
August 02, 2006
> But maybe that's because I never leave things at their defaults.  It's like writing a story where you expect the reader to think everyone has brown eyes unless you say otherwise.
>

Consider this:

char[6] buf;
strncpy(buf, "1234567", 5);

What will be a content of you buffer?

Answer is: 12345\xff . Surprise? It is.

In modern D reliable implementation of this shall be as:

char[6] buf; // memset(buf,0xFF,6); under the hood.
uint n = strncpy(buf, "1234567", 5);
buf[n] = 0;

if you are going to use this with non D modules.

Needless to say that this is a bit redundant.

If D in any case initializes that memory why you need
this uint n and buf[n] = 0; ?

Don't tell me please that this is because your spent
your childhood in boyscout camps and got some high principles.
Lets' put aside that matters - it is purely technical discussion.

Andrew.























August 02, 2006
Andrew Fedoniouk wrote:
> "Unknown W. Brackets" <unknown@simplemachines.org> wrote in message news:eapdsg$qeo$1@digitaldaemon.com...
>> I'm trying to understand why this 0 thing is such an issue.  If your second statement is valid, it makes the first moot - 0 or no 0.  Why does it matter, then?
> 
> Declaration of char.init == 0 pretty much means that
> D has no strict requirement that char[] shall contain only UTF-8
> encoded sequences but any other encodings suitable for
> the application.

Why is this good?

> char.init == 0 will resolve situation we see in Phobos now.
> char[] de facto is used for other than utf-8 encodings.

You mean data with other encodings that still want to use the std.string functions? I have written template versions that replaces (almost) all std.string functions that do not rely on encoding.

> char.init == 0 tells everybody that char can also be used
> for representing unicode *code points* with asuumption
> that offset value (mapping on full Unicode set, aka codepage) is stored
> somewhere in application or well known to it.

Maybe it would tell people that. A good thing it isn't so then. Again, why do you want to store non utf-8 data in a char[]?. What is wrong with ubyte[] or a suitable typedef?

> char.init == 0 also highlights the fact that it is safe to
> use char[] as C string processing functions and passing them to non D modules and libraries.
> Is it UTF-8 encoded or not - does not matter - type is universal enough.

I can't see how that would make it considerably safer.

/Oskar
August 02, 2006
> I think the issue is more that Andrew wants to have hex-FF as a legitimate byte value anywhere in a char[] variable. He misses the point that the purpose of not allowing it in so we can detected uninitialized UTF-8 strings at run-time.
>

What does it mean uninitialized? They *are* initialized. This is the main point. For any types you can declare initial value. I bet you are choosing not non existent values for say enums but some really meaningfull default values.

having strings filled by ff's means that you will get problems of different kinds - partially initialized strings.

Could you tell me do you ever had situation when
ffffff strings helped you to find problem?
And if yes how it is in principle different from
catching strings with 00000?

Can anyone here say that this fffffffs helped to find problem?

Andrew.






August 02, 2006
Derek Parnell wrote:
> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
> 
>> (Hope this long dialog will help all of us to better understand what UNICODE is)
>>
>> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eao5st$2r1f$1@digitaldaemon.com...
>>> Andrew Fedoniouk wrote:
>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>> encoded using UTF-16.
>>>
>>> BMP is a subset of UTF-16.
>> Walter with deepest respect but it is not. Two different things.
>>
>> UTF-16 is a variable-length enconding - byte stream.
>> Unicode BMP is a range of numbers strictly speaking.
> 
> Andrew is correct. In UTF-16, characters are variable length, from 2 to 4
> bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to
> be up to 6 but that has changed). UCS-2 is a subset of Unicode characters
> that are all represented by 2-byte integers. Windows NT had implemented
> UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid UTF-16?

August 02, 2006
Andrew Fedoniouk wrote:
>> But maybe that's because I never leave things at their defaults.  It's
>> like writing a story where you expect the reader to think everyone has brown eyes unless you say otherwise.
>>
> 
> Consider this:
> 
> char[6] buf;
> strncpy(buf, "1234567", 5);
> 
> What will be a content of you buffer?
> 
> Answer is: 12345\xff . Surprise? It is.

Not really surprising. Had you compiled this in a C program (you are using C functions after all), you would have gotten:

12345\x?? <- some garbage. Not a zero terminated string.

My manual for strncpy explicitly states:

" if there is no null byte among the first n
       bytes of src, the result will not be null-terminated."

/Oskar
August 02, 2006
On Tue, 1 Aug 2006 23:45:26 -0700, Andrew Fedoniouk wrote:

>> But maybe that's because I never leave things at their defaults.  It's like writing a story where you expect the reader to think everyone has brown eyes unless you say otherwise.
>>
> 
> Consider this:
> 
> char[6] buf;
> strncpy(buf, "1234567", 5);
> 
> What will be a content of you buffer?
> 
> Answer is: 12345\xff . Surprise? It is.

No, not surprised, just wondering why you didn't code it correctly though.

If you insist on using C functions then it should be coded ...

  extern(C) uint strncpy(ubyte *, ubyte *, uint );
  ubyte[6] buf;
  strncpy(buf.ptr, cast(ubyte*)"1234567", 5);


> In modern D reliable implementation of this shall be as:
> 
> char[6] buf; // memset(buf,0xFF,6); under the hood.
> uint n = strncpy(buf, "1234567", 5);
> buf[n] = 0;

Well that is debatable. I'd do it more like ...

 char[6] buf;  // An array of UTF-8 code units.
 uint n = strncpy(buf, "1234567", 5); // Replace the first 5 code-units.
 buf[n..$] = 0; // Set remaining code-units to zero.

> if you are going to use this with non D modules.
> 
> Needless to say that this is a bit redundant.
> 
> If D in any case initializes that memory why you need
> this uint n and buf[n] = 0; ?
> 
> Don't tell me please that this is because your spent
> your childhood in boyscout camps and got some high principles.
> Lets' put aside that matters - it is purely technical discussion.

Exactly. And technically you should be using ubyte[] and not char[].

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 4:57:15 PM
August 02, 2006
Andrew Fedoniouk wrote:
> Can anyone here say that this fffffffs helped to find
> problem?

Yes, I found two bugs in my own code with it that would have been hidden with the 0 initialization.
August 02, 2006
I fail to understand why I want another ambiguous type in my programming.  I am glad that when I type "int", I know I have a number and not a pointer.

I am glad that when I type char, I again know what I have.  No guesswork.  Your proposals sound like shooting myself in the foot.

No fun.  I'll take that helmet you offered first.

-[Unknown]


> "Unknown W. Brackets" <unknown@simplemachines.org> wrote in message news:eapdsg$qeo$1@digitaldaemon.com...
>> I'm trying to understand why this 0 thing is such an issue.  If your second statement is valid, it makes the first moot - 0 or no 0.  Why does it matter, then?
> 
> Declaration of char.init == 0 pretty much means that
> D has no strict requirement that char[] shall contain only UTF-8
> encoded sequences but any other encodings suitable for
> the application.
> 
> char.init == 0 will resolve situation we see in Phobos now.
> char[] de facto is used for other than utf-8 encodings.
> 
> char.init == 0 tells everybody that char can also be used
> for representing unicode *code points* with asuumption
> that offset value (mapping on full Unicode set, aka codepage) is stored
> somewhere in application or well known to it.
> 
> char.init == 0 also highlights the fact that it is safe to
> use char[] as C string processing functions and passing them to non D modules and libraries.
> Is it UTF-8 encoded or not - does not matter - type is universal enough.
> 
> Andrew.
> 
> 
> 
> 
> 
> 
> 
>> -[Unknown]
>>
>>
>>> Another option will be to change char.init to 0 and forget about the problem
>>> left it as it is now.  Some good string implementation will
>>> contain encoding field in string instance if needed.
>>>
>>> Andrew.
>>>
>>>
>