Unicode discussion (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » Unicode discussion (page 5)

December 17, 2003

Re: Unicode discussion

Posted by Walter
in reply to Elias Martenson

Walter

Posted in reply to Elias Martenson

I think we're mostly in agreement!

December 17, 2003

Re: Unicode discussion

Posted by Sean L. Palmer
in reply to Sean L. Palmer

Sean L. Palmer

Posted in reply to Sean L. Palmer

Sorry, "sign" of char.

"Sean L. Palmer" <palmer.sean@verizon.net> wrote in message news:brp52o$1gc6$1@digitaldaemon.com...
> It's stupid to not agree on a standard size for char, since it's easy to "fix" the sign of a char register by biasing it by 128 (xor 0x80 works
too),

December 17, 2003

Re: Unicode discussion

Posted by Walter
in reply to Matthias Becker

Walter

Posted in reply to Matthias Becker

"Matthias Becker" <Matthias_member@pathlink.com> wrote in message news:brpr00$2grc$1@digitaldaemon.com...
> >In a higher level language, yes. But in doing systems work, one always
seems
> >to be looking at the lower level elements anyway. I wrestled with this
for a
> >while, and eventually decided that char[], wchar[], and dchar[] would be
low
> >level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.
>
> Shouldn't this wrapper be part of Phobos?

Eventually, yes. First things first, though, and the first step was making the innards of the D language and compiler fully unicode enabled.

December 18, 2003

Re: Unicode discussion

Posted by Walter
in reply to Hauke Duden

Walter

Posted in reply to Hauke Duden

"Hauke Duden" <H.NS.Duden@gmx.net> wrote in message news:brpvmn$2o0t$1@digitaldaemon.com...
> I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly.

UTF-8 has some nice advantages over other multibyte encodings in that it is possible to find the start of a sequence without backing up to the beginning, none of the multibyte encodings have bit 7 clear (so they never conflict with ascii), and no additional information like code pages are necessary to decode them.


> Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8.

That's correct. And D supports UTF-32 programming if that works better for the particular application.

> And about computing complexity: if you ignore the overhead introduced by having to move more (or sometimes less) memory then manipulating UTF-32 strings is a LOT faster than UTF-8. Simply because random access is possible and you do not have to perform an expensive decode operation on each character.

Interestingly, it was rarely necessary to decode the UTF-8 strings. Far and away most operations on strings were copying them, storing them, hashing them, etc.

> Also, how much text did your "bad experience" application use?

Maybe 100 megs. Extensive profiling and analysis showed that it would have run much faster if it was UTF-8 rather than UTF-32, not the least of which was it would have hit the 'wall' of thrashing the virtual memory much later.

> It seems
> to me that even if you assume best-case for UTF-8 (e.g. one byte per
> character) then the memory overhead should not be much of an issue. It's
> only factor 4, after all.

It's a huge (!) issue. When you're pushing a web server to the max, using 4x
memory means it runs 4x slower. (Actually about 2x slower because of other
factors.)

> So assuming that your application uses 100.000
> lines of text (which is a lot more than anything I've ever seen in a
> program), each 100 characters long and everything held in memory at
> once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32.
> These are hardly numbers that will bring a modern OS to its knees
> anymore. In a few years this might even fit completely into the CPU's
cache!

Server applications usually get maxed out on memory, and they deal primarilly with text. The bottom line is D will not be competitive with C++ if it does chars as 32 bits each. I doubt many realize this, but Java and C# pay a heavy price for using 2 bytes for a char. (Most benchmarks I've seen do not measure char processing speed or memory consumption.)

> I think it's more important to have proper localization ability and programming ease than trying to conserve a few bytes for a limited group of people (i.e. english speakers). Being greedy with memory consumption when making long-term design decisions has always caused problems. For instance, it caused that major Y2K panic in the industry a few years ago!

You have a valid point, but things are always a tradeoff. D offers the flexibility of allowing the programmer to choose whether he wants to build his app around char, wchar, or dchar's.

(None of my programs dating back to the 70's had any Y2K bugs in them <g>)

> Please also keep in mind that a factor 4 will be compensated by memory enhancements in only 1-2 years time.

I don't agree that memory is improving that fast. Even if it is, people just load them up with more data to fill the memory up. I will agree that program code size is no longer that relevant, but data size is still pretty relevant. Stuff we were forced to do back in the bad old DOS 640k days seem pretty quaint now <g>.

> Most people already have several
> hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit
> shortsighted to make the lives of D programmers harder forever, just to
> save a few megabytes of memory that people will laugh about in 5 years
> (or already laugh about right now)?

D programmers can use dchars if they want to.

December 18, 2003

Re: Unicode discussion

Posted by Roald Ribe
in reply to Walter

Roald Ribe

Posted in reply to Walter

"Walter" <walter@digitalmars.com> wrote in message news:brqr8e$vmh$1@digitaldaemon.com...
>
> "Hauke Duden" <H.NS.Duden@gmx.net> wrote in message news:brpvmn$2o0t$1@digitaldaemon.com...
> > I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly.
>
> UTF-8 has some nice advantages over other multibyte encodings in that it
is
> possible to find the start of a sequence without backing up to the beginning, none of the multibyte encodings have bit 7 clear (so they never conflict with ascii), and no additional information like code pages are necessary to decode them.

But with UTF-32, this is not an issue at all.

> > Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8.
>
> That's correct. And D supports UTF-32 programming if that works better for the particular application.

Yes, but that statement does not stop clueless/lazy programmers from using chars in libraries/programs where UTF-32 should have been used.

> > And about computing complexity: if you ignore the overhead introduced by having to move more (or sometimes less) memory then manipulating UTF-32 strings is a LOT faster than UTF-8. Simply because random access is possible and you do not have to perform an expensive decode operation on each character.
>
> Interestingly, it was rarely necessary to decode the UTF-8 strings. Far
and
> away most operations on strings were copying them, storing them, hashing them, etc.

If that is correct, it might be just as correct, and even faster, to treat it as binary data in most cases. No need to have that data reoresented as String at all times.

> > Also, how much text did your "bad experience" application use?
>
> Maybe 100 megs. Extensive profiling and analysis showed that it would have run much faster if it was UTF-8 rather than UTF-32, not the least of which was it would have hit the 'wall' of thrashing the virtual memory much
later.

I think the profiling might have shown very different numbers if the native language of the profiling crew/test files were traditional chinese texts, mixed with a lot of different languages.

> > It seems
> > to me that even if you assume best-case for UTF-8 (e.g. one byte per
> > character) then the memory overhead should not be much of an issue. It's
> > only factor 4, after all.
>
> It's a huge (!) issue. When you're pushing a web server to the max, using
4x
> memory means it runs 4x slower. (Actually about 2x slower because of other
> factors.)

I agree with you, speed is important. But if what you are serving is 8-bit .html files (latin language), why not treat the data as usigned bytes? You are describing the "special case" as the explanation of why UTF-32 should not be the general case.

The definition of the language is what people are interested in at this point. What "dirty" tricks you use in the implementation to make it faster (right now, in some special cases, with a limited set of language data) is less interesting.

> > So assuming that your application uses 100.000
> > lines of text (which is a lot more than anything I've ever seen in a
> > program), each 100 characters long and everything held in memory at
> > once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32.
> > These are hardly numbers that will bring a modern OS to its knees
> > anymore. In a few years this might even fit completely into the CPU's
> cache!
>
> Server applications usually get maxed out on memory, and they deal primarilly with text. The bottom line is D will not be competitive with
C++
> if it does chars as 32 bits each. I doubt many realize this, but Java and
C#
> pay a heavy price for using 2 bytes for a char. (Most benchmarks I've seen do not measure char processing speed or memory consumption.)

I think this is a brilliant observation. I had not thought much about this. But I think my thought from above is still correct: why should the data for this special case be String at all? A good server software writer could obtain the ultimate speed by using unsigned bytes. That would give ultimate speed when necessary, and generally applicable String handling for all spoken languages would be enforced for String at the same time.

> > I think it's more important to have proper localization ability and programming ease than trying to conserve a few bytes for a limited group of people (i.e. english speakers). Being greedy with memory consumption when making long-term design decisions has always caused problems. For instance, it caused that major Y2K panic in the industry a few years
ago!
>
> You have a valid point, but things are always a tradeoff. D offers the flexibility of allowing the programmer to choose whether he wants to build his app around char, wchar, or dchar's.

With all due respect, I believe you are trading off in the wrong direction.
Because you have a personal interest in good performance (which is good)
you seem to not want to consider the more general cases as being the
general ones. I propose (as an experiment) that you try to think "what
would I do if I were a chinese?" each time you want to make a tradeoff
on string handling. This is what good design is all about.

In the performance trail of thought: do we all agree that the general String _manipulation_ handling in all programs will perform much better if choosing UTF-32 over UTF-8, when considering that the natural language data of the program would be traditional chinese?

Another one: If UTF-32 were the base type of String, would it be applicable to have a "Compressed" attribute on each String? That way it could have as small as possible i/o, storage and memcpy size most of the time, and could be uncompressed for manipulation? This should take care of most of the "data size"/trashing related arguments...

> (None of my programs dating back to the 70's had any Y2K bugs in them <g>)
>
> > Please also keep in mind that a factor 4 will be compensated by memory enhancements in only 1-2 years time.
>
> I don't agree that memory is improving that fast. Even if it is, people
just
> load them up with more data to fill the memory up. I will agree that
program
> code size is no longer that relevant, but data size is still pretty relevant. Stuff we were forced to do back in the bad old DOS 640k days
seem
> pretty quaint now <g>.
>
> > Most people already have several
> > hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit
> > shortsighted to make the lives of D programmers harder forever, just to
> > save a few megabytes of memory that people will laugh about in 5 years
> > (or already laugh about right now)?
>
> D programmers can use dchars if they want to.

The option to do so, is the problem. Because the programmers from a latin letter using country will most likely choose chars, because that is what they are used to, and because it will perform better (on systems with too little RAM). And that will be a loss for the international applicapability of D.

Thanks to all who took the time to read my take on these issues.

Regards,
Roald

December 18, 2003

Re: Unicode discussion

Posted by Sean L. Palmer
in reply to Walter

Sean L. Palmer

Posted in reply to Walter

"Walter" <walter@digitalmars.com> wrote in message news:brqr8e$vmh$1@digitaldaemon.com...
> Interestingly, it was rarely necessary to decode the UTF-8 strings. Far
and
> away most operations on strings were copying them, storing them, hashing them, etc.

That is my experience as well.  Either that or it's parsing them more or less linearly.

> > Please also keep in mind that a factor 4 will be compensated by memory enhancements in only 1-2 years time.
>
> I don't agree that memory is improving that fast. Even if it is, people
just
> load them up with more data to fill the memory up. I will agree that
program
> code size is no longer that relevant, but data size is still pretty relevant. Stuff we were forced to do back in the bad old DOS 640k days
seem
> pretty quaint now <g>.

Code size is actually still important on embedded apps (console video games)
where the machine has small code cache size (8K or less)
On PS2, optimizing for size produces faster code in most cases than
optimizing for speed.

> > Most people already have several
> > hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit
> > shortsighted to make the lives of D programmers harder forever, just to
> > save a few megabytes of memory that people will laugh about in 5 years
> > (or already laugh about right now)?
>
> D programmers can use dchars if they want to.

So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and dchar[] means UTF-32?

Unfortunately then a char won't hold a single Unicode character, you have to mix char and dchar.

It would be nice to have a library function to pull the first character out of a UTF-8 string and increment the iterator pointer past it.

dchar extractFirstChar(inout char* utf8string);

That seems like an insanely useful text processing function.  Maybe the reverse as well:

void appendChar(char[] utf8string, dchar c);

Sean

December 18, 2003

Re: Unicode discussion

Posted by Sean L. Palmer
in reply to Roald Ribe

Sean L. Palmer

Posted in reply to Roald Ribe

"Roald Ribe" <rr.no@spam.teikom.no> wrote in message news:brsfkq$dpl$1@digitaldaemon.com...
> > D programmers can use dchars if they want to.
>
> The option to do so, is the problem. Because the programmers from a latin letter using country will most likely choose chars, because that is what they are used to, and because it will perform better (on systems with too little RAM). And that will be a loss for the international applicapability of D.
>
> Thanks to all who took the time to read my take on these issues.

You raise some good points.

This issue should not be treated too lightly.  It should be possible to work with text as bytes, (for performance on interfacing with legacy non-Unicode strings) but that should definitely not be the preferred way.

I think that there should be no char or wchar, and that dchar should be renamed char.  That way if you see byte[] in the code you won't be tempted to think of it as a string but more like raw data.  UTF-8 can be well represented by byte[], and if you want to work directly with UTF-8, you can use a wrapper class from the D standard library.

Sean

December 18, 2003

Re: Unicode discussion

Posted by Lewis
in reply to Sean L. Palmer

Lewis

Posted in reply to Sean L. Palmer

> I think that there should be no char or wchar, and that dchar should be
> renamed char.  

Sorry if im stating something i lack knowledge in, but if there were no wchar what would you use to call windows wide api?

regards

December 18, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Lewis

Elias Martenson

Posted in reply to Lewis

Den Thu, 18 Dec 2003 15:27:02 -0500 skrev Lewis:

> 
>> I think that there should be no char or wchar, and that dchar should be renamed char.
> 
> Sorry if im stating something i lack knowledge in, but if there were no wchar what would you use to call windows wide api?

Most likely ushort[].

Regards

Elias Mårtenson

December 18, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Sean L. Palmer

Elias Martenson

Posted in reply to Sean L. Palmer

Den Thu, 18 Dec 2003 10:49:31 -0800 skrev Sean L. Palmer:

> So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and dchar[] means UTF-32?
> 
> Unfortunately then a char won't hold a single Unicode character, you have to mix char and dchar.

This is why I have advocated a rename of dchar to char, and the current char to something else (my first suggestion was utf8byte, but I can see why it was rejected off hand. :-) ).

> It would be nice to have a library function to pull the first character out of a UTF-8 string and increment the iterator pointer past it.
> 
> dchar extractFirstChar(inout char* utf8string);
> 
> That seems like an insanely useful text processing function.  Maybe the reverse as well:
> 
> void appendChar(char[] utf8string, dchar c);

At least my intention when starting this second round of discussion was to iron out what the "D way" of handling strings is, so we can get to work on these library functions that you request.

Regards

Elias Mårtenson

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation