December 19, 2003
"Sean L. Palmer" <palmer.sean@verizon.net> wrote in message news:brssrg$135p$1@digitaldaemon.com...
> So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and dchar[] means UTF-32?

Yes. Exactly.

> Unfortunately then a char won't hold a single Unicode character,

Correct. But a dchar will.

> you have to mix char and dchar.
>
> It would be nice to have a library function to pull the first character
out
> of a UTF-8 string and increment the iterator pointer past it. dchar extractFirstChar(inout char* utf8string);

Check out the functions in std.utf.

> That seems like an insanely useful text processing function.  Maybe the
> reverse as well:
> void appendChar(char[] utf8string, dchar c);

Actually, a wrapper class around the string, overloading opApply, [], etc., will do the job nicely.


December 19, 2003
"Roald Ribe" <rr.no@spam.teikom.no> wrote in message news:brsfkq$dpl$1@digitaldaemon.com...
> Yes, but that statement does not stop clueless/lazy programmers from using chars in libraries/programs where UTF-32 should have been used.

I can't really stop clueless/lazy programmers from writing bad code <g>.


> I think the profiling might have shown very different numbers if the native language of the profiling crew/test files were traditional chinese texts, mixed with a lot of different languages.

If an app is going to process primarilly chinese, it will probably be more efficient using dchar[]. If an app is going to process primarilly english, then char[] is the right choice. The server app I wrote was for use primarilly by american and european companies. It had to handle chinese, but far and away the bulk of the data it needed to process was plain old ascii.

D doesn't force such a choice on the app programmer - he can pick char[], wchar[] or dchar[] to match the probability of the bulk of the text it will be dealing with.

> I agree with you, speed is important. But if what you are serving is 8-bit .html files (latin language), why not treat the data as usigned bytes? You are describing the "special case" as the explanation of why UTF-32 should not be the general case.

For overloading reasons. I never liked the C way of conflating chars with bytes. Having a utf type separate from a byte type enables more reasonable ways of handling things like string literals.

> > You have a valid point, but things are always a tradeoff. D offers the flexibility of allowing the programmer to choose whether he wants to
build
> > his app around char, wchar, or dchar's.
> With all due respect, I believe you are trading off in the wrong
direction.
> Because you have a personal interest in good performance (which is good) you seem to not want to consider the more general cases as being the general ones. I propose (as an experiment) that you try to think "what would I do if I were a chinese?" each time you want to make a tradeoff on string handling. This is what good design is all about.

I assume that a chinese programmer writing chinese apps would prefer to use dchar[]. And that is fully supported by D, so I am misunderstanding what our disagreement is about.

> In the performance trail of thought: do we all agree that the general String _manipulation_ handling in all programs will perform much better if choosing UTF-32 over UTF-8, when considering that the natural language data of the program would be traditional chinese?

Sure. But if the data the program will see is not chinese, then performance will suffer. As a language designer, I cannot determine what data the programmer will see, so D provides char[], wchar[] and dchar[] and the programmer can make the choice based on the data for his app.


> Another one: If UTF-32 were the base type of String, would it be applicable to have a "Compressed" attribute on each String? That way it could have as small as possible i/o, storage and memcpy size most of the time, and could be uncompressed for manipulation? This should take care of most of the "data size"/trashing related arguments...

An intriguing idea, but I am not convinced it would be superior to UTF-8. Data compression is relatively slow.

> > D programmers can use dchars if they want to.
> The option to do so, is the problem. Because the programmers from a latin letter using country will most likely choose chars, because that is what they are used to, and because it will perform better (on systems with too little RAM). And that will be a loss for the international applicapability of D.

D is not going to force one to write internationalized apps, just make it easy to write them if the programmer cares about it. As opposed to C where it is rather difficult to write internationalized apps, so few bother.

> Thanks to all who took the time to read my take on these issues.

It's a fun discussion!


December 19, 2003
Walter wrote:

> "Roald Ribe" <rr.no@spam.teikom.no> wrote in message
> news:brsfkq$dpl$1@digitaldaemon.com...
> 
>>Yes, but that statement does not stop clueless/lazy programmers from
>>using chars in libraries/programs where UTF-32 should have been used.
> 
> I can't really stop clueless/lazy programmers from writing bad code <g>.

But it is possible to make it harder to do so. I believe that is what this discussion is all about.

>>I think the profiling might have shown very different numbers if the
>>native language of the profiling crew/test files were traditional
>>chinese texts, mixed with a lot of different languages.
> 
> If an app is going to process primarilly chinese, it will probably be more
> efficient using dchar[]. If an app is going to process primarilly english,
> then char[] is the right choice. The server app I wrote was for use
> primarilly by american and european companies. It had to handle chinese, but
> far and away the bulk of the data it needed to process was plain old ascii.

I don't think most programmers (at the time of writing the code) is aware of the fact that his application is going to be used outside the local region.

An example is the current project I'm working in, the old application that out new one is designed to replace, is already exported throughout the world. Even though that is the case, when I came into the project there was absolutely zero understanding that we needed to support anything else than ISO-8859-1. As a result, we have lost a lot of time rewriting parts of the system.

Now, I agree that the current D way would have made it a lot easier, but it could be even easier.

> D doesn't force such a choice on the app programmer - he can pick char[],
> wchar[] or dchar[] to match the probability of the bulk of the text it will
> be dealing with.

In the end, I think most people (including me) would be a lot happier if all that was done was renaming dchar into char. No functionality change at all, just a rename of the types. I think most people can see the advantage of D supporting UTF-8 natively, it just feels wrong with an array of "char" which isn't really an array of characters.

> For overloading reasons. I never liked the C way of conflating chars with
> bytes. Having a utf type separate from a byte type enables more reasonable
> ways of handling things like string literals.

Right, I can see your reasoning, but does the type _really_ have to be named "char"?

> I assume that a chinese programmer writing chinese apps would prefer to use
> dchar[]. And that is fully supported by D, so I am misunderstanding what our
> disagreement is about.

Possibly, but in todays world it's not unusual that an application is developed in europe but used in china, or developed in india but used in new zealand.

Regards

Elias MÃ¥rtenson
December 19, 2003
Walter wrote:
>>I don't see how the design of the UTF-8 encoding adds any advantage over
>>other multibyte encodings that might cause people to use it properly.
> 
> 
> UTF-8 has some nice advantages over other multibyte encodings in that it is
> possible to find the start of a sequence without backing up to the
> beginning, none of the multibyte encodings have bit 7 clear (so they never
> conflict with ascii), and no additional information like code pages are
> necessary to decode them.

The only situation I can think of where this might be useful is if you want to jump directly into the middle of a string. And that isn't really useful for UTF-8 because you do not know how many characters were before that - so you have no idea where you've "landed".

>>And about computing complexity: if you ignore the overhead introduced by
>>having to move more (or sometimes less) memory then manipulating UTF-32
>>strings is a LOT faster than UTF-8. Simply because random access is
>>possible and you do not have to perform an expensive decode operation on
>>each character.
> 
> 
> Interestingly, it was rarely necessary to decode the UTF-8 strings. Far and
> away most operations on strings were copying them, storing them, hashing
> them, etc.

Hmmm. That IS interesting. Now that you mention it, I think this would also apply to most of my own code. Though it might depend on the kind of application.

>>So assuming that your application uses 100.000
>>lines of text (which is a lot more than anything I've ever seen in a
>>program), each 100 characters long and everything held in memory at
>>once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32.
>>These are hardly numbers that will bring a modern OS to its knees
>>anymore. In a few years this might even fit completely into the CPU's
> 
> cache!
> 
> Server applications usually get maxed out on memory, and they deal
> primarilly with text. The bottom line is D will not be competitive with C++
> if it does chars as 32 bits each. I doubt many realize this, but Java and C#
> pay a heavy price for using 2 bytes for a char. (Most benchmarks I've seen
> do not measure char processing speed or memory consumption.)

I hadn't thought of applications that do nothing but serve data/text to others. That's a good counter-example against some of my arguments. Having the server run at 1/2 capacity because of string encoding seems to be too much.

So I think you're right in having multiple "native" encodings. That still leaves the problems of providing easy ways to work with strings, though, to ensure that newbies will "automatically" write Unicode capable applications. That's the only way I see to avoid the situation we see in C/C++ code right now.

What's bad about multiple encodings is that all libraries would have to support 3 kinds of strings for everything. That's not really feasible in the real world - I certainly don't want to write every function 3 times. I can think of only two ways around that:

1) some sort of automatic conversion when the function is called. This might cause quite a bit of overhead.

2) using some sort of template and let the compiler generate the 3 special cases. I don't think normal templates will work here, because we also need to support string functions in interfaces. Maybe we need some kind of universal string argument type? So that the compiler can automatically generate 3 functions if that type is used in the parameter list? Seems a bit of a hack....

3) making the string type abstract so that string objects are compatible, no matter what their encoding is. This has the added benefit (as I have mentioned a few times before ;)) that users could have strings in their own encoding, which comes in handy when you're dealing with legacy code that does not use US-ASCII.

I think 3 would be the most feasible. You decide about the encoding when you create the string object and everything else is completely transparent.



Hauke
December 19, 2003
"Elias Martenson" <elias-m@algonet.se> wrote in message news:bruen1$f05$1@digitaldaemon.com...
> I don't think most programmers (at the time of writing the code) is aware of the fact that his application is going to be used outside the local region.

Probably true.

> > D doesn't force such a choice on the app programmer - he can pick
char[],
> > wchar[] or dchar[] to match the probability of the bulk of the text it
will
> > be dealing with.
>
> In the end, I think most people (including me) would be a lot happier if all that was done was renaming dchar into char. No functionality change at all, just a rename of the types. I think most people can see the advantage of D supporting UTF-8 natively, it just feels wrong with an array of "char" which isn't really an array of characters.


Even despite the fact that in C and C++, char is byte-sized, it would probably be preferrable to just rename "char" to "bchar" and "dchar" to "char".  This corresponds to byte and int, but then wchar seems out of place since in D there is short and ushort but not word.  "schar" sounds like "signed char" and I believe we should stay away from that.  What to do, what to do?

> > For overloading reasons. I never liked the C way of conflating chars
with
> > bytes. Having a utf type separate from a byte type enables more
reasonable
> > ways of handling things like string literals.
>
> Right, I can see your reasoning, but does the type _really_ have to be named "char"?

Good point.  But there is the backward compatibility thing, which kind of sucks.  It would subtly break any C app ported to D that allocated memory using malloc(N) and then stored a N-character string into it.

> > I assume that a chinese programmer writing chinese apps would prefer to
use
> > dchar[]. And that is fully supported by D, so I am misunderstanding what
our
> > disagreement is about.
>
> Possibly, but in todays world it's not unusual that an application is developed in europe but used in china, or developed in india but used in new zealand.

It will still work, but won't be as efficient as it could be.

Sean


December 19, 2003
There has been a lot of talk about doing things, but very little has actually happened. Consequently, I have made a string interface and two rough and ready string classes for UTF-8 and UTF-32, which are attached to this message.

Currently they only do a few things, one of which is to provide a consistent interface for character manipulation. The UTF-8 class also provides direct access to the bytes for when the user can do things more efficiently with these. They can also be appended to each other. In addition, each provides a constructor taking the other one as a parameter.

Please bear in mind that I am only an amateur programmer, who knows very little about Unicode and has no experience of programming in the real world. Nevertheless, I can appreciate some of the issues here and I hope that these classes can be the foundation of something more useful.

From,

Rupert





December 19, 2003
Cool beans!  Thanks, Rupert!

This brings up a point.  The main reason that I do not like opAssign/opAdd syntax for operator overloading is that it is not self-documenting that opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or that opCatAssign corresponds to a ~= b.  This information either has to be present in a comment or you have to go look it up.  Yeah, D gurus will have it memorized, but I'd rather there be just one "name" for the function, and it should be the same both in the definition and at the point of call.

Sean

"Rupert Millard" <rupertamillard@hotmail.DELETE.THIS.com> wrote in message news:brvghd$21n8$2@digitaldaemon.com...
> There has been a lot of talk about doing things, but very little has actually happened. Consequently, I have made a string interface and two rough and ready string classes for UTF-8 and UTF-32, which are attached to this message.
>
> Currently they only do a few things, one of which is to provide a
consistent
> interface for character manipulation. The UTF-8 class also provides direct access to the bytes for when the user can do things more efficiently with these. They can also be appended to each other. In addition, each provides
a
> constructor taking the other one as a parameter.
>
> Please bear in mind that I am only an amateur programmer, who knows very little about Unicode and has no experience of programming in the real
world.
> Nevertheless, I can appreciate some of the issues here and I hope that
these
> classes can be the foundation of something more useful.
>
> From,
>
> Rupert


December 19, 2003
I agree with you, but we just have to grin and bear it, unless / until Walter changes his mind. I suppose I could have commented my code better though. Hopefully as I become more experienced, I will be a better judge of these things.

"Sean L. Palmer" <palmer.sean@verizon.net> wrote in message news:brvlj9$29qh$1@digitaldaemon.com...
> Cool beans!  Thanks, Rupert!
>
> This brings up a point.  The main reason that I do not like opAssign/opAdd syntax for operator overloading is that it is not self-documenting that opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or that opCatAssign corresponds to a ~= b.  This information either has to be present in a comment or you have to go look it up.  Yeah, D gurus will
have
> it memorized, but I'd rather there be just one "name" for the function,
and
> it should be the same both in the definition and at the point of call.
>
> Sean
>
> "Rupert Millard" <rupertamillard@hotmail.DELETE.THIS.com> wrote in message news:brvghd$21n8$2@digitaldaemon.com...
> > There has been a lot of talk about doing things, but very little has actually happened. Consequently, I have made a string interface and two rough and ready string classes for UTF-8 and UTF-32, which are attached
to
> > this message.
> >
> > Currently they only do a few things, one of which is to provide a
> consistent
> > interface for character manipulation. The UTF-8 class also provides
direct
> > access to the bytes for when the user can do things more efficiently
with
> > these. They can also be appended to each other. In addition, each
provides
> a
> > constructor taking the other one as a parameter.
> >
> > Please bear in mind that I am only an amateur programmer, who knows very little about Unicode and has no experience of programming in the real
> world.
> > Nevertheless, I can appreciate some of the issues here and I hope that
> these
> > classes can be the foundation of something more useful.
> >
> > From,
> >
> > Rupert
>
>


December 19, 2003
"Hauke Duden" <H.NS.Duden@gmx.net> wrote in message news:bruief$kav$1@digitaldaemon.com...
> What's bad about multiple encodings is that all libraries would have to support 3 kinds of strings for everything. That's not really feasible in the real world - I certainly don't want to write every function 3 times.

I had the same thoughts!

> I can think of only two ways around that:
>
> 1) some sort of automatic conversion when the function is called. This might cause quite a bit of overhead.
>
> 2) using some sort of template and let the compiler generate the 3 special cases. I don't think normal templates will work here, because we also need to support string functions in interfaces. Maybe we need some kind of universal string argument type? So that the compiler can automatically generate 3 functions if that type is used in the parameter list? Seems a bit of a hack....

My first thought was to template all functions taking a string. It just got too complicated.

> 3) making the string type abstract so that string objects are compatible, no matter what their encoding is. This has the added benefit (as I have mentioned a few times before ;)) that users could have strings in their own encoding, which comes in handy when you're dealing with legacy code that does not use US-ASCII.
>
> I think 3 would be the most feasible. You decide about the encoding when you create the string object and everything else is completely
transparent.

I think 3 is the same as 1!


December 20, 2003
Walter wrote:
>>I can think of only two ways around that:
>>
>>1) some sort of automatic conversion when the function is called. This
>>might cause quite a bit of overhead.
<snip>
>>3) making the string type abstract so that string objects are
>>compatible, no matter what their encoding is. This has the added benefit
>>(as I have mentioned a few times before ;)) that users could have
>>strings in their own encoding, which comes in handy when you're dealing
>>with legacy code that does not use US-ASCII.
>>
>>I think 3 would be the most feasible. You decide about the encoding when
>>you create the string object and everything else is completely
> 
> transparent.
> 
> I think 3 is the same as 1!

Not really ;).

With 1 I meant having unrelated string classes (maybe source code compatible, but not derived from a common base class). That would mean that a temporary object would have to be created if a function takes, say, a UTF-8 string as an argument but you pass it a UTF-32 string.

Pros: the compiler can do more inlining, since it knows the object type.
Cons: the performance gain of the inlining is probably lost with all the conversions that will be going on if you use different libs. It is also not possible to easily add new string types without having to add the corresponding copy constructor and =operators to the existing ones.

With 3 there would not be such a problem. All functions would have to use the common string interface for their arguments, so any kind of string object that implements this interface could be passed without a conversion.

Pros: adding new string encodings is no problem, passing string objects never causes new objects to be created or data to be converted.
Cons: most calls can probably not be inlined, since the functions will never know the actual class of the strings they work with. Also, if you want to pass a string constant to a function you'll have to explicitly wrap it in an object, since the compiler doesn't know what kind of object to create to convert a char[] to a string interface reference.

The last point would go away if string constants were also string objects. I think that would be a good idea anyway, since that'd make the string interface the default way to deal with strings.

Another solution would be if there was some way to write global conversion functions that are called to do implicit conversions between different types. Such functions could also be useful in many other circumstances, so that might be an idea to think about.

Hauke