July 30, 2006
Unknown W. Brackets wrote:
> Walter mentioned that "0 is a valid UTF-8 character."  Andrew asked what this meant, so I explained that in this case (as you also clarified) it doesn't make any difference.  Regardless, it's a valid [whatever it is] and that meaning is not unclear.

I confess I often misuse the terminology.
July 30, 2006
On Sat, 29 Jul 2006 20:37:56 -0400, Walter Bright <newshound@digitalmars.com> wrote:

> In D, char[] is a UTF-8 sequence. It's well defined, and therefore portable. It supports every human language.

Even body language? :)
July 30, 2006
On Sun, 30 Jul 2006 03:25:22 -0700, Paolo Invernizzi <arathorn@NOSPAM_fastwebnet.it> wrote:

> LOL!!!
>
> ---
> Paolo
>
> Walter Bright wrote:
>
>> "One Encoding to rule them all, One Encoding to replace them,
>> One Encoding to handle them all and in the darkness bind them"
>> -- UTF Tolkien


Okay, that clears things up. Now we know that UTF is a conspiracy for world domination. ;)

-JJR
July 30, 2006
John Reimer wrote:
> On Sun, 30 Jul 2006 03:25:22 -0700, Paolo Invernizzi  <arathorn@NOSPAM_fastwebnet.it> wrote:
> 
>> LOL!!!
>>
>> ---
>> Paolo
>>
>> Walter Bright wrote:
>>
>>> "One Encoding to rule them all, One Encoding to replace them,
>>> One Encoding to handle them all and in the darkness bind them"
>>> -- UTF Tolkien
> 
> 
> 
> Okay, that clears things up. Now we know that UTF is a conspiracy for  world domination. ;)
> 
> -JJR


And created on the back of a napkin in a New Jersey diner ... way to go, Ken
July 31, 2006
Maybe I missed the point here, correct me if I misunderstood.

This is how I see the problem with char[] as utf-8 *string*. The length of array of chars is not always count of characters, but rather size of array in bytes. Which makes no sense for me. For that purpose I would like to see separate properties.

For example,
char[] str = "тест";
word "test" in russian - 4 cyrillic characters, would give you str.length 8, which make no use of this length property if you not sure that string is latin characters only.
July 31, 2006
Serg Kovrov wrote:
> Maybe I missed the point here, correct me if I misunderstood.

You have understood correctly.

> This is how I see the problem with char[] as utf-8 *string*. The length of array of chars is not always count of characters, but rather size of array in bytes. Which makes no sense for me. For that purpose I would like to see separate properties.

Having char[].length return something other than the actual number of char-units would break it's array semantics.

> For example,
> char[] str = "тест";
> word "test" in russian - 4 cyrillic characters, would give you str.length 8, which make no use of this length property if you not sure that string is latin characters only.

It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units. Counting the number of characters is also a rather expensive operation. All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters.

It is easy to implement your own character count though:

size_t count(char[] arr) {
	size_t c = 0;
	foreach(dchar c;arr)
		c++;
	return c;
}

assert("тест".count() == 4);

Also note that:

assert("тест"d.length == 4);

/Oskar


July 31, 2006
* Oskar Linde:
> Having char[].length return something other than the actual number
> of char-units would break it's array semantics.

Yes, I see. Thats why I do not like much char[] as substitute for string
type.

> It is actually not very often that you need to count the number
> of characters as opposed to the number of (UTF-8) code units.

Why not use separate properties for that?

> Counting the number of characters is also a rather expensive
> operation. 

Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.

> All the ordinary operations (searching, slicing, concatenation, sub-string  search, etc) operate on code units rather than
> characters.

Yes that's tough one. If you want to slice an array - use array unit's count for that. But if you want to slice a *string* (substring, search, etc) - use character's count for that.

Maybe there should be interchangeable types - string and char[]. For different length, slice, find, etc. behaviors? I mean it could be same actual type, but different contexts for properties.

And besides, string as opposite to char[] is more pleasant for my eyes =)
July 31, 2006
Serg Kovrov wrote:
> * Oskar Linde:
>> Counting the number of characters is also a rather expensive
>> operation. 
> 
> Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.

Store where? You can't put it in the array data itself without breaking slicing, and you putting it in the reference introduces problems with it getting out of date if the array is modified through another reference (without enforcing COW, that is).
July 31, 2006
Serg Kovrov wrote:
> * Oskar Linde:
> 
>> Having char[].length return something other than the actual number
>> of char-units would break it's array semantics.
> 
> 
> Yes, I see. Thats why I do not like much char[] as substitute for string
> type.
> 
>> It is actually not very often that you need to count the number
>> of characters as opposed to the number of (UTF-8) code units.
> 
> 
> Why not use separate properties for that?
> 
>> Counting the number of characters is also a rather expensive
>> operation. 
> 
> 
> Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.
> 
>> All the ordinary operations (searching, slicing, concatenation, sub-string  search, etc) operate on code units rather than
>> characters.
> 
> 
> Yes that's tough one. If you want to slice an array - use array unit's count for that. But if you want to slice a *string* (substring, search, etc) - use character's count for that.
> 
> Maybe there should be interchangeable types - string and char[]. For different length, slice, find, etc. behaviors? I mean it could be same actual type, but different contexts for properties.
> 
> And besides, string as opposite to char[] is more pleasant for my eyes =)


I say this calls for a proper *standard* String class ... <g>
July 31, 2006
* Frits van Bommel:
> Serg Kovrov wrote:
>> * Oskar Linde:
>>> Counting the number of characters is also a rather expensive
>>> operation. 
>>
>> Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.
> 
> Store where? You can't put it in the array data itself without breaking slicing, and you putting it in the reference introduces problems with it getting out of date if the array is modified through another reference (without enforcing COW, that is).

Need to say that I no not have an idea where to store it, neither where current length property stored. I'm really glad that compiler do it for me.

As language user I just want to be confident that compiler do it wisely, and focus on my domain problems.