UTF-8 char[] consistency (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » UTF-8 char[] consistency (page 3)

September 28, 2004

Re: UTF-8 char[] consistency

Posted by Benjamin Herr
in reply to Arcane Jill

Benjamin Herr

Posted in reply to Arcane Jill

Arcane Jill wrote:
> *) If you just want basic Unicode support which works in all but exceptional
> circumstances, you can make do with UTF-16, and the pretence that characters are
> 16-bits wide.
I guess I really do not get it. I thought I was just told that codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars?
*more clueless*


-ben

September 28, 2004

Re: UTF-8 char[] consistency

Posted by Sean Kelly
in reply to Benjamin Herr

Sean Kelly

Posted in reply to Benjamin Herr

In article <cjc38q$2jna$2@digitaldaemon.com>, Benjamin Herr says...
>
>Arcane Jill wrote:
>> *) If you just want basic Unicode support which works in all but exceptional circumstances, you can make do with UTF-16, and the pretence that characters are 16-bits wide.
>I guess I really do not get it. I thought I was just told that
>codepoints might be only 16-bits wide but that I always have to account
>for multi-codepointy chars?
>*more clueless*

I think what Jill was saying is that in most cases, UTF-16 will represent any character you care about with a single wchar (ie. in 16 bits).  So if you code an application to use wchars you can generally pretend as if there is a 1 to 1 correspondence between wchars and characters.  It's *possible* that some users (Chinese perhaps) could break your application, but if this isn't your target market then it may not be a concern.  I think the point is that if you're worried that dchars will use up too much memory, you can usually get away with pretending UTF-16 is not a multi-char encoding scheme.

Sean

September 28, 2004

Re: [OT] UTF-8 char[] consistency

Posted by Thomas Kuehne
in reply to Benjamin Herr

Thomas Kuehne

Posted in reply to Benjamin Herr

Benjamin Herr Schrieb:
>> *) If you just want basic Unicode support which works in all but exceptional circumstances, you can make do with UTF-16, and the pretence that characters are 16-bits wide.
> I guess I really do not get it. I thought I was just told that
> codepoints might be only 16-bits wide but that I always have to account
> for multi-codepointy chars?
> *more clueless*

Potentially codepoints are 64 bit. The highes currently assigned codepoint
fits in 32 bit. For the majority of living languages only codepoints fit in
16 bit.
The bit-size of a codepoint has nothing todo with multi-codepoint "chars".
Again if you ensure that neither Korean/Hebrew/Arabic, (Zero-Width-)Joiners
nor combining accents are used you might trade a 16-bit char as a
"character" in most cases. Exceptions: sorting, display and advanced text
analysis.

Thomas

September 29, 2004

Re: UTF-8 char[] consistency

Posted by David L. Davis
in reply to Jaap Geurts

David L. Davis

Posted in reply to Jaap Geurts

In article <cjal85$1oia$1@digitaldaemon.com>, Jaap Geurts says...
>
>David,
>
>I've examined your wstring library, and noticed that the
>case(islower,isupper) family functions cannot do other languages than plain
>latin ascii. Am I right in this?
>What is needed I guess is for the user to supply a conversion table (are the
>functions in phobos suitable?). I don't know enough about locale support in
>OS's but if it is not available there we'd have to code it into the lib.
>
>I'll do some probing about how to code it first and if you wish I can provide you the one for Vietnamese.
>
>Regards, Jaap
>

Jaap: Currently for anything unicode based, I've been waiting on work that Arcane Jill is doing. StringW.d was mainly created to make it easier to work with 16-bit characters (string.d made it a real pain...you nearly have to cast everything), and hopefully in turn it will work with Windows' 16-bit wide character API functions. But at this point I haven't tested it, plus I don't understand enough to know the real different between the 16-bit characters and unicode characters (some real example data and code would be helpful in this area...Jill?, Ben?, and/or anyone?). Anywayz, needless to say I've mirrored string.d functions like tolower(), toupper() and my very own asciiProperCase() functions to still work on ascii characters only. In my last reply I was mainly to point you to where stringw.d could be found, and if you found it useful and to let you know that if you needed anything that string.d had that's missing in it...that I would add it if you needed it. I hope I did give the impression that it did unicode? Also, I'm afraid I don't know much about "locale support" either. But if you do something in that area I wouldn't mind taking a look at it. :))

Good Luck in your project,
David L.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

September 29, 2004

Re: UTF-8 char[] consistency

Posted by Arcane Jill
in reply to Benjamin Herr

Arcane Jill

Posted in reply to Benjamin Herr

In article <cjc38q$2jna$2@digitaldaemon.com>, Benjamin Herr says...
>
>Arcane Jill wrote:
>> *) If you just want basic Unicode support which works in all but exceptional circumstances, you can make do with UTF-16, and the pretence that characters are 16-bits wide.
>I guess I really do not get it. I thought I was just told that
>codepoints might be only 16-bits wide but that I always have to account
>for multi-codepointy chars?
>*more clueless*

Head out to www.unicode.org and check out their various FAQs. They do a much better job at explaining things than I.

For what it's worth, here's my potted summary:

"code unit" = the technical name for a single primitive fragment of either UTF-8, UTF-16 or UTF-32 (that is, the value held in a single char, wchar or dchar). I tend to use the phrases UTF-8 fragment, UTF-16 fragment and UTF-32 fragment to express this concept.

"code point" = the technical name for the numerical value associated with a character. In Unicode, valid codepoints go from 0 to 0x10FFFF inclusive. In D, a codepoint can only be stored in a dchar.

"character" = officially, the smallest unit of textual information with semantic meaning. Practically speaking, this means either (a) a control code; (b) something printable; or (c) a combiner, such as an accent you can place over another character. Every character has a unique codepoint. Conversely, every codepoint in the range 0 to 0x10FFFF corresponds to a unique Unicode character. Unicode characters are often written in the form U+#### (for example, U+20AC, which is the character corresponding to codepoint 0x20AC).

As an observation, over 99% of all the characters you are likely to use, and which are involved in text processing, will occur in the range U+0000 to U+FFFF. Therefore an array of sixteen-bit values interpretted as characters will likely be sufficient for most purposes. (A UTF-16 string may be interpretted in this way). If you want that extra 1%, as some apps will, you'll need to go the whole hog and recognise characters all the way up to U+10FFFF.

"grapheme" = a printable base character which may have been modified by zero or more combining characters (for example 'a' followed by combining-acute-accent).

"glyph" = one or more graphemes glued together to form a single printable symbol. The Unicode character zero-width-joiner usually acts as the glue.

For more detailed information, as I suggested above, please feel free to go to the Unicode website, and get all the details from the people who organize the whole thing.

Arcane Jill

September 29, 2004

Re: [OT] UTF-8 char[] consistency

Posted by Arcane Jill
in reply to Thomas Kuehne

Arcane Jill

Posted in reply to Thomas Kuehne

In article <cjc8ag$2nb2$1@digitaldaemon.com>, Thomas Kuehne says...

>Potentially codepoints are 64 bit.

First I've heard of it. Do you have a source for this information?

So far as I am aware, the UC are /adamant/ that they will never go beyond 21 bits. Programming languages tend to use 32 bits because (a) 32 bits is a more natural length for computers, and (b) they're not taking chances - once upon a time the UC thought that 16 bits would be sufficient. But I have never heard /anyone/ claim that codepoints are potentially 64 bits before. Whence does this originate?

Arcane Jill

September 29, 2004

Re: UTF-8 char[] consistency

Posted by Arcane Jill
in reply to Sean Kelly

Arcane Jill

Posted in reply to Sean Kelly

In article <cjc7pb$2n3d$1@digitaldaemon.com>, Sean Kelly says...
>

>I think what Jill was saying is that in most cases, UTF-16 will represent any character you care about with a single wchar (ie. in 16 bits).  So if you code an application to use wchars you can generally pretend as if there is a 1 to 1 correspondence between wchars and characters.  It's *possible* that some users (Chinese perhaps) could break your application, but if this isn't your target market then it may not be a concern.  I think the point is that if you're worried that dchars will use up too much memory, you can usually get away with pretending UTF-16 is not a multi-char encoding scheme.
>
>Sean

Yes, exactly. And to some extent, the same is also true of UTF-8 if your application only cares about ASCII. /Many/ algorithms will work just fine if you pretend that /UTF-8/ is a character set, and that a char[] is an actual string of 8-bit-wide "characters". For example, concatenation (strcat, ~); finding a character or a substring (strchr, strstr, find); splitting on boundaries determined by strchr/strstr/find; tokenizing using ASCII separators such as space or tab; identification of C/C++/D comments; parsing XML; ... the list is endless. So long as you don't try to interpret or manipulate the characters you don't "understand", these encodings are robust enough to withstand most other manipulations.

The major reason for preferring UTF-16 over UTF-8, however, is that UTF-16 is likely to contain over 99% of all characters in which you are likely to be interested. The same cannot be said of UTF-8, which contains only ASCII characters.

The major reason for preferring UTF-16 over UTF-32 is that you get a lot of wasted space with UTF-32. As noted above, >99% of your characters will only need two bytes, so that's two bytes of zeroes for each such character. Even the
>U+FFFF characters are still guaranteed to have /over one third/ of its bits unused. UTF-32 text files (and strings), therefore, /will/ have between a third and a half (and maybe even more if the text is mostly ASCII) of all of its bits wasted.

So it's just a space/speed compromise, that's all. But a pretty good one in most cases.

Jill

September 29, 2004

Re: UTF-8 char[] consistency

Posted by David L. Davis
in reply to Jaap Geurts

David L. Davis

Posted in reply to Jaap Geurts

In article <cjal85$1oia$1@digitaldaemon.com>, Jaap Geurts says...
>
>David,
>
>I've examined your wstring library, and noticed that the
>case(islower,isupper) family functions cannot do other languages than plain
>latin ascii. Am I right in this?
>What is needed I guess is for the user to supply a conversion table (are the
>functions in phobos suitable?). I don't know enough about locale support in
>OS's but if it is not available there we'd have to code it into the lib.
>
>I'll do some probing about how to code it first and if you wish I can provide you the one for Vietnamese.
>
>Regards, Jaap

Jaap: Currently for anything unicode based, I've been waiting on work that Arcane Jill is doing. StringW.d was mainly created to make it easier to work with 16-bit characters (string.d made it a real pain...you nearly have to cast everything), and hopefully in turn it will work with Windows' 16-bit wide character API functions. But at this point I haven't tested it, plus I don't understand enough to know the real different between the 16-bit characters and unicode characters (some real example data and code would be helpful in this area...Jill?, Ben?, and/or anyone?). Anywayz, needless to say I've mirrored string.d functions like tolower(), toupper() and my very own asciiProperCase() functions to still work on ascii characters only. In my last reply I was mainly to point you to where stringw.d could be found, and if you found it useful and to let you know that if you needed anything that string.d had that's missing in it...that I would add it if you needed it. I hope I did give the impression that it did unicode? Also, I'm afraid I don't know much about "locale support" either. But if you do something in that area I wouldn't mind taking a look at it. :))

Good Luck in your project,
David L.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

September 29, 2004

Re: UTF-8 char[] consistency

Posted by David L. Davis
in reply to David L. Davis

David L. Davis

Posted in reply to David L. Davis

Everyone: Oops!!! Sorry about the repost everyone. I had a bad storm in my area last night and my connection to the internet wasn't working right, so I didn't think my message had gotten posted. Again sorry.

David L.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

September 29, 2004

Re: UTF-16 wchar[] consistency

Posted by Arcane Jill
in reply to David L. Davis

Arcane Jill

Posted in reply to David L. Davis

In article <cje51f$q8t$1@digitaldaemon.com>, David L. Davis says...

>plus I don't
>understand enough to know the real different between the 16-bit characters and
>unicode characters (some real example data and code would be helpful in this
>area...Jill?, Ben?, and/or anyone?).

Unlike UTF-8, UTF-16 is very cunning - and this is basically because Unicode and UTF-16 were designed together, to work with each other. Here's how it works - there are two different perspectives: the 16-bit perspective, and the 21-bit perspective.

In the 21-bit perspective, characters run from U+0000 to U+10FFFF - /but/, the range U+D800 to U+DFFF is illegal and invalid. There are /no/ Unicode characters in this range. Any application built to view the Unicode world from this point of view should be prepared to correctly handle and display all valid characters (which excludes U+D800 to U+DFFF).

In the 16-bit perspective, characters run from U+0000 to U+FFFF - and, in this world, the range U+D800 to U+DFFF are just hunky dory. In this perspective, they are called "surrogate characters". They always occur in pairs, with a high surrogate (a character in the range U+D800 to U+DBFF) always immediately followed by a low surrogate (a character in the range U+DC00 to U_DFFF). There are plenty of applications built to view the Unicode world from this point of view (in particular, legacy applications written before Unicode 3.0, when all Unicode characters actually /were/ 16 bits wide).

Let's take an example. The Unicode character U+1D11E (musical symbol G clef). When viewed by an application which sees 21-bit wide characters, what you see is U+1D11E, which you interpret as a single character, and display as ... well ... as musical symbol G clef.

A legacy 16-bit-Unicode application looking at the same text file (assuming it to have been saved in UTF-16) will see two "characters": U+D874 followed by U+DD1E. (These are the UTF-16 fragments which together represent U+1D11E). Such an application may safely interpret these wchars as "unknown character" followed by "unknown character", and nothing will break. A slightly more sophisticated application might even interpret them as "high surrogate" followed by "low surrogate", and still, nothing would break. These pseudo-characters would likely both display as "unknown character" glyphs, but some fonts may give high surrogates a different glyph from low surrogates. (And, indeed, the Mac's "last chance" fallback font will actually display each psuedo-character as a tiny little hex representation of its codepoint!)

Of course, all of this will fail completely if UTF-8 is used instead of UTF-16. In UTF-8, the representation of U+1D11E is: F0 9D 84 9E. Every UTF-8-aware application will decode this as 0x1D11E, and an application which is unaware of characters beyond U+FFFF would fall over badly here. (It might even truncate it to U+D11E: Hangul syllable TYAELM). But of course, you can still transcode into UTF-16 and deal with it that way - which is another reason why UTF-16 is very good for the internal workings of an application.

Arcane Jill

PS. It is worth noting that the vast majority of fonts available today which are either free or come bundled with an OS do not render characters beyond U+FFFF at all. In fact, I have yet to find /even one/ free font which contains U+1D11E (musical symbol G clef). [I would be very happy to be shown to be wrong on this point - anyone know of one?]. This means that if you stick such characters in a web page, nobody will be able to see them - so you'll have to use a gif after all. :( Unicode may be the future, but sadly it is not the present.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation