August 02, 2006
Why would I ever use strncat() in a D program?

Consider this: if you do not wear a helmet while riding a motorcycle (read: I don't like helmets) you could break your head and die.  Guess what?  I don't ride motorcycles.  Problem solved.

I don't like null terminated strings.  I think they are the root of much evil.  Describing why having 0 as a default benefits null terminated strings is like describing how having less police help burglars to me. Obviously I'm being over-dramatic, but I remain unconvinced...

Also I did spend (some of) my childhood in Boy Scout camps and I did learn many principles (none of which related to programming in the slightest.)  I mean that literally.  But you're right, that's beside the point.

-[Unknown]


>> But maybe that's because I never leave things at their defaults.  It's
>> like writing a story where you expect the reader to think everyone has brown eyes unless you say otherwise.
>>
> 
> Consider this:
> 
> char[6] buf;
> strncpy(buf, "1234567", 5);
> 
> What will be a content of you buffer?
> 
> Answer is: 12345\xff . Surprise? It is.
> 
> In modern D reliable implementation of this shall be as:
> 
> char[6] buf; // memset(buf,0xFF,6); under the hood.
> uint n = strncpy(buf, "1234567", 5);
> buf[n] = 0;
> 
> if you are going to use this with non D modules.
> 
> Needless to say that this is a bit redundant.
> 
> If D in any case initializes that memory why you need
> this uint n and buf[n] = 0; ?
> 
> Don't tell me please that this is because your spent
> your childhood in boyscout camps and got some high principles.
> Lets' put aside that matters - it is purely technical discussion.
> 
> Andrew.
August 02, 2006
Correction: strncpy().  They're all evil.

-[Unknown]


> Why would I ever use strncat() in a D program?
> 
> Consider this: if you do not wear a helmet while riding a motorcycle (read: I don't like helmets) you could break your head and die.  Guess what?  I don't ride motorcycles.  Problem solved.
> 
> I don't like null terminated strings.  I think they are the root of much evil.  Describing why having 0 as a default benefits null terminated strings is like describing how having less police help burglars to me. Obviously I'm being over-dramatic, but I remain unconvinced...
> 
> Also I did spend (some of) my childhood in Boy Scout camps and I did learn many principles (none of which related to programming in the slightest.)  I mean that literally.  But you're right, that's beside the point.
> 
> -[Unknown]
> 
> 
>>> But maybe that's because I never leave things at their defaults.  It's
>>> like writing a story where you expect the reader to think everyone has brown eyes unless you say otherwise.
>>>
>>
>> Consider this:
>>
>> char[6] buf;
>> strncpy(buf, "1234567", 5);
>>
>> What will be a content of you buffer?
>>
>> Answer is: 12345\xff . Surprise? It is.
>>
>> In modern D reliable implementation of this shall be as:
>>
>> char[6] buf; // memset(buf,0xFF,6); under the hood.
>> uint n = strncpy(buf, "1234567", 5);
>> buf[n] = 0;
>>
>> if you are going to use this with non D modules.
>>
>> Needless to say that this is a bit redundant.
>>
>> If D in any case initializes that memory why you need
>> this uint n and buf[n] = 0; ?
>>
>> Don't tell me please that this is because your spent
>> your childhood in boyscout camps and got some high principles.
>> Lets' put aside that matters - it is purely technical discussion.
>>
>> Andrew.
August 02, 2006
On Wed, 02 Aug 2006 00:11:26 -0700, Walter Bright wrote:

> Derek Parnell wrote:
>> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
>> 
>>> (Hope this long dialog will help all of us to better understand what UNICODE is)
>>>
>>> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eao5st$2r1f$1@digitaldaemon.com...
>>>> Andrew Fedoniouk wrote:
>>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>>> encoded using UTF-16.
>>>>
>>>> BMP is a subset of UTF-16.
>>> Walter with deepest respect but it is not. Two different things.
>>>
>>> UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.
>> 
>> Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
> 
> If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid UTF-16?

Huh??? I said "UCS-2 is a subset of Unicode characters" Did you miss that? UTF-16 is not a subset as it can be used to encode every Unicode code point. UCS-2 is a subset as it can *not* encode code points that are outside of the "basic multilingual plane" (aka BMP).

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 5:43:18 PM
August 02, 2006
On Wed, 2 Aug 2006 00:08:42 -0700, Andrew Fedoniouk wrote:

>> I think the issue is more that Andrew wants to have hex-FF as a legitimate byte value anywhere in a char[] variable. He misses the point that the purpose of not allowing it in so we can detected uninitialized UTF-8 strings at run-time.
>>
> 
> What does it mean uninitialized? They *are* initialized.

Andrew, I will assume you are not trying to be difficult but that maybe your English is a bit too literal.

Of course in the clinical sense they are initialized because data is moved into them before your code has a chance to do anything. However, when I say "detected uninitialized UTF-8 strings" I mean "detect UTF-8 strings that have not been initialized by your own code". Is that better?

> This is the main point. For any types you can declare initial value. I bet you are choosing not non existent values for say enums but some really meaningfull default values.

Huh??? Now you are being difficult. The purpose of enums is to have them initialized to values that make sense in their context. But the default values for enum generally work for me as the exact value doesn't really matter in most cases.

  enum AccountType
  {
     Savings,
     Investment,
     FixedLoan,
     Club,
     LineOfCredit
  }

I really don't care what values the compiler assigns to these enums. Sure I
could choose specific values but it doesn't really matter.

> having strings filled by ff's means that you will get problems of different kinds - partially initialized strings.

Huh???? Why would I always get partially initialized strings, as you imply? And even if I did, then having 0xFF in them is going to help me track down some stupid code that I wrote.

> Could you tell me do you ever had situation when
> ffffff strings helped you to find problem?

No. I haven't made that kind of mistake yet with my code.

> And if yes how it is in principle different from
> catching strings with 00000?

Because if I found a 0x00 in a string, I wouldn't know if its legitimate or
not.

> Can anyone here say that this fffffffs helped to find problem?

But if I found 0xFF I would know straight away that I've made a mistake somewhere. Actually, come to think about it, I did make a mistake once when my code was incorrectly interpreting a BOM in a text file. I loaded the file as if was UTF-8 but it should have been UTF-16. DMD correctly told me I had a bad UTF strings when I tried to write it out.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 5:49:46 PM
August 02, 2006
Derek Parnell wrote:
> On Wed, 02 Aug 2006 00:11:26 -0700, Walter Bright wrote:
>> Derek Parnell wrote:
>>> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
>>>> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eao5st$2r1f$1@digitaldaemon.com...
>>>>> Andrew Fedoniouk wrote:
>>>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>>>> encoded using UTF-16.
>>>>>
>>>>> BMP is a subset of UTF-16.
>>>> Walter with deepest respect but it is not. Two different things.
>>>>
>>>> UTF-16 is a variable-length enconding - byte stream.
>>>> Unicode BMP is a range of numbers strictly speaking.
>>> Andrew is correct. In UTF-16, characters are variable length, from 2 to 4
>>> bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to
>>> be up to 6 but that has changed). UCS-2 is a subset of Unicode characters
>>> that are all represented by 2-byte integers. Windows NT had implemented
>>> UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
>> If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid UTF-16?
> 
> Huh??? I said "UCS-2 is a subset of Unicode characters" Did you miss that?

I saw it, but that statement is not the same as "UCS-2 is a subset of UTF-16". The issue I was talking about is "BMP [UCS-2] is a subset of UTF-16", which Andrew keeps replying "it is not". You said "Andrew is correct", so I inferred you were agreeing that UCS-2 is not a subset of UTF-16.

> UTF-16 is not a subset as it can be used to encode every Unicode code
> point. UCS-2 is a subset as it can *not* encode code points that are
> outside of the "basic multilingual plane" (aka BMP). 

I think you and I are in agreement.
August 02, 2006
Andrew Fedoniouk schrieb am 2006-07-31:
>
> "Thomas Kuehne" <thomas-dloop@kuehne.cn> wrote in message news:ls52q3-3o8.ln1@birke.kuehne.cn...
>>
>> Oskar Linde schrieb am 2006-07-31:
>>> Serg Kovrov wrote:
>>
>>>> For example,
>>>> char[] str = "????";
>>>> word "test" in russian - 4 cyrillic characters, would give you
>>>> str.length 8, which make no use of this length property if you not sure
>>>> that string is latin characters only.
>>>
>>> It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units. Counting the number of characters is also a rather expensive operation. All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters.
>>>
>>> It is easy to implement your own character count though:
>>>
>>> size_t count(char[] arr) {
>>> size_t c = 0;
>>> foreach(dchar c;arr)
>>> c++;
>>> return c;
>>> }
>>>
>>> assert("????".count() == 4);
>>>
>>> Also note that:
>>>
>>> assert("????"d.length == 4);
>>
>> I hate to be pedantic but dchar[] can only be used to count the code points - not the characters. A "character" can be composed by more than one code point/dchar. This feature is frequent used for accents, marks and some Asian scripts.
>>
>> - -> http://www.unicode.org
>>
>
>
> Right, Thomas,
>
> umlaut as a separate code point can exist
> so A with umlaut can be represented by two code points.
> But as far as I remember the intention was and is
> to have in Unicode also all full forms like "A-with-umlaut"

http://www.unicode.org/faq/char_combmark.html#13

I won't argue about the intention here.
Post this statement on
<unicode@unicode.org> (http://www.unicode.org/consortium/distlist.html)
an let's see the various responces ;)


> So you can always "compress" multi code point forms into single point counterparts.

Not allways. For a common use case see http://www.unicode.org/faq/han_cjk.html#7 http://www.unicode.org/faq/han_cjk.html#9

Thomas


August 03, 2006
Andrew Fedoniouk wrote:
> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eao5st$2r1f$1@digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>> Compiler accepts input stream as either BMP codes or full unicode set
>> encoded using UTF-16.
>>
>> BMP is a subset of UTF-16.
> 
> Walter with deepest respect but it is not. Two different things.
> 
> UTF-16 is a variable-length enconding - byte stream.
> Unicode BMP is a range of numbers strictly speaking.
> 
> If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you
> are in trouble. See:
> 

Uh, the statement "BMP is a subset of UTF-16" means that you can read a BMP sequence as an UTF-16 sequence, not the opposite as you said: "If you will treat utf-16 sequence as a sequence of UCS-2 (BMP)".


>>> Ordinary people will do their own strings anyway. Just give them opAssign and dtor in structs and you will see explosion of perfect strings. That char#[] (read-only arrays) will also benefit here. oh.....
>>>
>>> Changing char init value to 0 will not harm anybody but will allow to use char for other than
>>>
>>> utf-8 purposes - it is only one from 40 in active use encodings anyway.
>>>
>>> For persistence purposes (in compiled EXE) utf is the best choice probably. But in runtime - please not on language level.
>> ubyte[] will enable you to use any encoding you wish - and that's what it's there for.
> 
> Thus the whole set of Windows API headers (and std.c.string for example)
> seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C
> Is this the idea?
> 
> Andrew.
> 
> 

Just a note, not to ubyte[] but to ubyte* .


-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
1 2 3 4 5 6 7 8 9 10 11
Next ›   Last »