Unicode discussion (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » Unicode discussion (page 4)

December 16, 2003

Re: Unicode discussion

Posted by Andy Friesen
in reply to Carlos Santander B.

Andy Friesen

Posted in reply to Carlos Santander B.

Carlos Santander B. wrote:
> "Elias Martenson" <elias-m@algonet.se> wrote in message
> news:brml3p$7hp$1@digitaldaemon.com...
> | for example). Why not use the same names as are used in C? mbstowcs()
> | and wcstombs()?
> |
> 
> Sorry to ask, but what do those do? What do they stand for?

Ironically enough, you question answers Elias's question quite succinctly. ;)

 -- andy

December 17, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Walter

Elias Martenson

Posted in reply to Walter

Walter wrote:

> "Sean L. Palmer" <palmer.sean@verizon.net> wrote in message
> news:brmeos$2v9c$1@digitaldaemon.com...
> 
>>"Walter" <walter@digitalmars.com> wrote in message
>>
>>>One could design a wrapper class for them that
>>>overloads [] to provide automatic decoding if desired.
>>
>>The problem is that [] would be a horribly inefficient way to index UTF-8
>>characters.  foreach would be ok.
> 
> You're right.

Agreed. Some kind of itarator for strings are desperately needed.

May I ask that they be designed in such a way that they are compatible/consistent with other iterators, such as the collections and things like the break iterator (also for strings).

Regards

Elias Mårtenson

December 17, 2003

Re: Unicode discussion

Posted by Sean L. Palmer
in reply to Walter

Sean L. Palmer

Posted in reply to Walter

"Walter" <walter@digitalmars.com> wrote in message news:brnurb$2bc5$1@digitaldaemon.com...
> > Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character".
>
> Frankly, I think the C standard is out to lunch on this. wchar_t should be unicode, and there really isn't a problem with using it as unicode. The C standard is also not helpful in the undefined size of wchar_t, or the sign of 'char'.

It's stupid to not agree on a standard size for char, since it's easy to "fix" the sign of a char register by biasing it by 128 (xor 0x80 works too), doing the operation, then biasing it again (un-biasing it).  If all else fails, you can promote it.  How often is this important anyway?  If it's crucial, it's worth the time to emulate the sign if you have to.  It is no good to run fast if the wrong results are generated.  It's just a portability landmine, waiting for the unwary programmer, and shame on whoever let it get into a so-called "standard".


> > The Unix standard goes a step
> > further and defines wchar_t to be a unicode character. Obviously D goes
> > the Unix route here (for dchar), and that is very good.
> >
> > However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally.
>
> Windows made the right decision given what was known at the time, it was
the
> unicode folks who goofed by not defining unicode right in the first place.

I still don't understand why they couldn't have packed all the languages that actually get used into the lowest 16 bits, and put all the crud like box-drawing characters and visible control codes and byzantine musical notes and runes and Aleutian indian that won't fit into the next 16 pages. There's lots of gaps in the first 65536 anyway.  And probably plenty of overlap, duplicated symbols (lots of languages have the same characters, especially latin-based ones).  Hell they should probably have done away with accented characters being distinct characters and enforced a combining rule from the start.  But the Unicode standards body wanted to please the typesetters, as opposed to giving the world a computer encoding that would actually be usable as a common text-storage and processing medium.  This thread shows just how convoluted Unicode really is.

I think someone can (and probably will) do better.  Unfortunately I also believe that such an effort is doomed to failure.

Sean

December 17, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Walter

Elias Martenson

Posted in reply to Walter

Walter wrote:

> "Elias Martenson" <elias-m@algonet.se> wrote in message
> news:brml3p$7hp$1@digitaldaemon.com...
> 
>>As for the functions that handle individual characters, the first thing
>>that absolutely has to be done is to change them to accept dchar instead
>>of char.
> 
> Yes.

Good, I like it. :-)

>>Which one of these should I use? Should I use all of them? Today, people
>>seems to use the first option, but UTF-8 is horribly inefficient
>>performance-wise.
> 
> Do it as char[]. Have the internal implementation convert it to whatever
> format the underling operating system API uses. I don't agree that UTF-8 is
> horribly inefficient (this is from experience, UTF-32 is much, much worse).

Memory-wise perhaps. But for everything else UTF-8 is always slower. Consider what happens when the program is used with russian? Every single character will need special decoding, except punctuation of course. Now think about chinese and japenese. These are even worse.

>>Also, in the case of char and wchar strings, how do I access an
>>individual character? Unless I missed something, the only way today is
>>to use decode(). This is a fairly common operation which needs a better
>>syntax, or people will keep accessing individual elements using the
>>array notation (str[n]).
> 
> It's fairly easy to write a wrapper class for it that decodes it
> automatically with foreach and [] overloads.

Indeed. But they will be slow.

Now, personally I can accept the slowness. Again, it's your call.

What we do need to make sure is that the string/character handling package that we build is comprehensive in terms on Unicode support, and also that every single string handling function handles UTF-32 as well as UTF-8. This way a developer who is having performance problems with the default UTF-8 strings can easily change his hotspots to work with UTF-32 instead.

>>I.e. the "string" data type would be a wrapper or supertype for the
>>three different string types.
> 
> The best thing is to stick with one scheme for a program.

Unless the developer is bitten by the poor performance of UTF-8 that is. A package with perl-like functionality would be horribly slow if using UTF-8 rather than UTF-32. If we are to stick with UTF-8 as default internal string format, UTF-32 must be available as an option, and it must be easy to use.

> For char types, yes. But not for UTF-16, and win32 internally is all UTF-16.
> There are no locale-specific encodings in UTF-16.

True. But I can't see any use for UTF-16 outside communicating with external windows libraries. UTF-16 really is the worst of both worlds compared to UTF-8 and UTF-32.

UTF-16 should really be considered the "native encoding" and left at that. Just like [the content of LC_CTYPE] is the native encoding when run in Unix. The developer should be shielded from the native encoding in that he should be able to say: "convert my string to the encoding my operating system wants (i.e. the native encoding)". As it happens, this is what wcstombs() does.

>>In Unix the platform specific encoding is determined by the environment
>>variable LC_CTYPE, although the trend is to be moving towards UTF-8 for
>>all locales. We're not quite there yet though. Check out
>>http://www.utf-8.org/ for some information about this.
> 
> Since we're moving to UTF-8 for all locales, D will be there with UTF-8 <g>.
> Let's look forward instead of those backward locale dependent encodings.

Agreed. I am heavily lobbying for proper Unicode support everywhere. I've been bitten by too many broken applications.

However, Windows has decided on UTF-16. Unix has decided on UTF-8. We need a way of transprently inputting and outputting strings so that they are converted to whatever encoding the host operating system uses. If we don't do this we are going to end up with a lot of conditional code that checks which OS (and encoding) is being used.

> No, I think D will provide an optional filter for I/O which will translate
> to/from locale dependent encodings. Wherever possible, the UTF-16 API's will
> be used to avoid any need for locale dependent encodings.

Why UTF-16? There is no need to involve platform specifics at this level.

Remember that UTF-16 can be considered platform specific for Windows.

> 'cuz I can never remember how they're spelled <g>.

Allright... So how about adding to the utf8 package some functions called... Hmm... nativeToUTF8(), nativeToUTF32() and then an overloaded function utfToNative() (which accepts char[], wchar[] and dchar[]}. "native" in this case would be a byte[] or ubyte[] to point out that this form is not supposed to be used in the program.

>>Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually
>>stems from the fact that according to the C standard wchar_t is not
>>Unicode. It's simply a "wide character".
> 
> 
> Frankly, I think the C standard is out to lunch on this. wchar_t should be
> unicode, and there really isn't a problem with using it as unicode. The C
> standard is also not helpful in the undefined size of wchar_t, or the sign
> of 'char'.

Indeed. That's why the Unix standard went a bit forther and specified a wchar_t to be a Unicode character. The problem is with Windows where wchar_t is 16-bit and thus cannot hold a Unicode character. And thus we end up with the current situation where using wchar_t in Windows really doesn't buy you anything because you have the same problems as you would with UTF-8. You still cannot assume that a wchar_t can hold a single character. You still need all the funky iterators and decoding stuff to be able to extract individual characters.

This is why I'm saying that the UTF-16 in Windows is horrible, and that UTF-16 is the worst of both worlds.

> Windows made the right decision given what was known at the time, it was the
> unicode folks who goofed by not defining unicode right in the first place.

I agree 100%. Java is in the same boat. How many people know that from JDK1.5 and onwards it's a bad idea to use String.charAt()? (in JDK1.5 the internal representation for String will change from UCS-2 to UTF-16). In other words, the exact same problem Windows faced.

The Unicode people argues that they never guaranteed that it was a 16-bit character set, and while this is technically true, they are really trying to cover up their mess.

> It already does that for string literals. I've thought about implicit
> conversions for runtime strings, but sometimes trouble results from too many
> implicit conversions, so I'm hanging back a bit on this to see how things
> evolve.

True. We suffer from this in C++ (costly implicit conversions) and it would be nice to be able to avoid this.

Regards

Elias Mårtenson

December 17, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Ben Hinkle

Elias Martenson

Posted in reply to Ben Hinkle

Ben Hinkle wrote:

>>     char c = str[8999];
>>     // now play happily(?) with the char "c" that probably isn't the
>>     // 9000'th character and maybe was a part of a UTF-8 multi byte
>>     // character
> 
> 
> which was why I suggested doing away with the generic "char" type entirely.
> If str was declared as an ascii array then it would be
>  ascii c = str[8999];
> Which is completely safe and reasonable.

No, it would certainly NOT be safe. You must remember that ASCII doesn't exist anymore. It's a legacy character set. It's dead. Gone. Bye bye.

And yes, sometimes it's needed for backwards compatibility, but in those cases it should be made explicit that you are throwing away information when converting.

> If it was declared as utf8[] then
> when the user writes
>  ubyte c = str[8999]
> and they don't have any a-priori knowledge about str they should feel very
> nervous since I completely agree indexing into an arbitrary utf-8 encoded
> array is pretty meaningless. Plus in my experience using individual
> characters isn't that common - I'd say easily 90% of the time a variable is
> declared as char* or char[] rather than just char.

You are right. Actually it's probably more than 90%. Especially when dealing with unicode. Very often it's not allowed to split a unicode string because of composite characters. However, you still need to be able to do indicidual character classification, such as isspace().

> By the way, I also think any utf8, utf16 and utf32 types should be aliased
> to ubyte, ushort, and uint. Should ascii be aliased to ubyte as well? I
> dunno.

ASCII has no business in a modern programming language.

> About Java and D: when I program in Java I never worry about the size of a
> char because Java is very different than C and you have to jump through
> hoops to call C. But when I program in D I feel like it is an extension of C
> like C++. Imagine if C++ decided that char should be 32 bits. That would
> have been very painful.

All I was suggesting was a renaming of the types so that it's made explicit what type you have to use in order to be able to hold a single character. In D, this type is called "dchar", char doesn't cut it. In C on unix, it's called wchar_t. In C on windows the type to use is called "int" or "long". And finally in Java, you ahve to use "int". In all of these languages, "char" is insufficient to hold a character.

Don't you think it's logical that the data type that can hold a character is called "char"?

Regards

Elias Mårtenson

December 17, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Andy Friesen

Elias Martenson

Posted in reply to Andy Friesen

Andy Friesen wrote:

> Carlos Santander B. wrote:
> 
>> "Elias Martenson" <elias-m@algonet.se> wrote in message
>> news:brml3p$7hp$1@digitaldaemon.com...
>> | for example). Why not use the same names as are used in C? mbstowcs()
>> | and wcstombs()?
>> |
>>
>> Sorry to ask, but what do those do? What do they stand for?
> 
> Ironically enough, you question answers Elias's question quite succinctly. ;)

Dang! How do you americans say? Three strikes, I'm out. :-)

Regards

Elias Mårtenson

December 17, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Sean L. Palmer

Elias Martenson

Posted in reply to Sean L. Palmer

Sean L. Palmer wrote:

> It's stupid to not agree on a standard size for char, since it's easy to
> "fix" the sign of a char register by biasing it by 128 (xor 0x80 works too),
> doing the operation, then biasing it again (un-biasing it).  If all else
> fails, you can promote it.  How often is this important anyway?  If it's
> crucial, it's worth the time to emulate the sign if you have to.  It is no
> good to run fast if the wrong results are generated.  It's just a
> portability landmine, waiting for the unwary programmer, and shame on
> whoever let it get into a so-called "standard".

C doesn't define any standard sizes at all (well, you do have stdint.h these days). This is both a curse and a blessing. More often than not, it's a curse though.

> I still don't understand why they couldn't have packed all the languages
> that actually get used into the lowest 16 bits, and put all the crud like
> box-drawing characters and visible control codes and byzantine musical notes
> and runes and Aleutian indian that won't fit into the next 16 pages.
> There's lots of gaps in the first 65536 anyway.  And probably plenty of
> overlap, duplicated symbols (lots of languages have the same characters,
> especially latin-based ones).  Hell they should probably have done away with
> accented characters being distinct characters and enforced a combining rule
> from the start.  But the Unicode standards body wanted to please the
> typesetters, as opposed to giving the world a computer encoding that would
> actually be usable as a common text-storage and processing medium.  This
> thread shows just how convoluted Unicode really is.
> 
> I think someone can (and probably will) do better.  Unfortunately I also
> believe that such an effort is doomed to failure.

Agreed. Unicode has a lot of cruft. One of my favourite pet peeves are the two characters:

00C5 Å: LATIN CAPITAL LETTER A WITH RING ABOVE

and

212B Å: ANGSTROM SIGN

The comment even says that the preferred representation is the latin Å.

But, like you say, trying to do it once again will not succeed. It has taken us 10 or so years to get where we are. I'd say we accept Unicode for what it is. It's a hell of a lot better than the previous mess.

Regards

Elias Mårtenson

December 17, 2003

Re: Unicode discussion

Posted by Matthias Becker
in reply to Walter

Matthias Becker

Posted in reply to Walter

>In a higher level language, yes. But in doing systems work, one always seems to be looking at the lower level elements anyway. I wrestled with this for a while, and eventually decided that char[], wchar[], and dchar[] would be low level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.

Shouldn't this wrapper be part of Phobos?

December 17, 2003

Re: Unicode discussion

Posted by Hauke Duden
in reply to Elias Martenson

Hauke Duden

Posted in reply to Elias Martenson

Elias Martenson wrote:
>> accepting an ascii* format string, not a char* as it is currently declared
>> (same for fopen etc etc).
> 
> 
> But printf() works very well with UTF-8 in most cases.

None of these alternatives is correct. printf will only work correctly with UTF-8 if the string data is either ASCII or UTF-8 happens to be the current system code page. And ASCII will only work for english systems, which is even worse.

As I said before, the C functions should be passed strings encoded in current system code page. That way all strings that are written in the system language will be printed perfectly. Also, characters that are not in the code page can be replaced with ? during the conversion, which is better than having printf output garbage.

Hauke

December 17, 2003

Re: Unicode discussion

Posted by Hauke Duden
in reply to Walter

Hauke Duden

Posted in reply to Walter

Walter wrote:
>>This is simply not true, Walter. The world has not gotten used to
>>multibyte chars in C at all.
> 
> 
> Multibyte char programming in C has been common on the IBM PC for 20 years
> now (my C compiler has supported it for that long, since it was distributed
> to an international community), and it was standardized into C in 1989. I
> agree that many ignore it, but that's because it's badly designed. Dealing
> with locale-dependent encodings is a real chore in C.

Right, it has been around for decades. And people still don't use it properly. Don't make that same mistake again!

I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly.

>>Also, pure D code will
>>automatically be UTF-32, which is exactly what you need if you want to
>>make the lives of newbies easier. Otherwise people WILL end up using
>>ASCII strings when they start out.
> 
> 
> Over the last 10 years, I wrote two major internationalized apps. One used
> UTF-8 internally, and converted other encodings to/from it on input/output.
> The other used wchar_t throughout, and was ported to win32 and linux which
> mapped wchar_t to UTF-16 and UTF-32, respectively.
> 
> The former project ran much faster, consumed far less memory, and (aside
> from the lack of support from C for UTF-8) simply had far fewer problems.
> The latter was big and slow. Especially and linux with the wchar_t's being
> UTF-32, it really hogged the memory.

Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8.

And about computing complexity: if you ignore the overhead introduced by having to move more (or sometimes less) memory then manipulating UTF-32 strings is a LOT faster than UTF-8. Simply because random access is possible and you do not have to perform an expensive decode operation on each character.

Also, how much text did your "bad experience" application use? It seems to me that even if you assume best-case for UTF-8 (e.g. one byte per character) then the memory overhead should not be much of an issue. It's only factor 4, after all. So assuming that your application uses 100.000 lines of text (which is a lot more than anything I've ever seen in a program), each 100 characters long and everything held in memory at once, then you'd end up requiring 10 MB for UTF-8 and 40 MB for UTF-32. These are hardly numbers that will bring a modern OS to its knees anymore. In a few years this might even fit completely into the CPU's cache!

I think it's more important to have proper localization ability and programming ease than trying to conserve a few bytes for a limited group of people (i.e. english speakers). Being greedy with memory consumption when making long-term design decisions has always caused problems. For instance, it caused that major Y2K panic in the industry a few years ago!

Please also keep in mind that a factor 4 will be compensated by memory enhancements in only 1-2 years time. Most people already have several hundred megabytes of RAM and it will soon be gigabytes. Isn't it a bit shortsighted to make the lives of D programmers harder forever, just to save a few megabytes of memory that people will laugh about in 5 years (or already laugh about right now)?

Hauke

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation