January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Theodore Reed | > UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.)
0x00..0x7F --> 1 byte
- ASCII
0x80..0x7FF --> 2 byte
- Latin extended, Greek, Cyrillic, Hebrew, Arabic, etc...
0x800..0xFFFF --> 3 byte
- most of the scripts in use.
0x10000..0x10FFFF --> 4 byte
- rare/dead/... scripts
|
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | > 3) Is Win32's "wide char" really UTF-16, including the multi word encodings? WinXP, WinCE : UTF-16 Win2K : was UCS-2, but some service pack made it UTF-16 WinNT4 : UCS-2 Win9x : must die. > 5) 16 bit accesses on Intel CPUs can be pretty slow compared to byte or > dword accesses (varies by CPU type). 16bit prefix can slow down instruction decoding (mostly for Intel CPUs, but P4 uses pre-decoded instructions anyhow), while instruction processing is more cache-branch-sensitive. > 6) Sure, UTF-16 reduces the frequency of multi character encodings, but the > code to deal with it must still be there and must still execute. Just an idea : string class may have 2 values for the string length: 1 - number of "units" ( 8bit for UTF-8, 16bit for UTF-16 ) 2 - number of characters. In case if these numbers are equal, string processing library may use simplified and faster functions. > 7) I've converted some large Java text processing apps to C++, and converted > the Java 16 bit char's to using UTF-8. That change resulted in *substantial* > performance improvements. > > 8) I suspect that 99% of the text processed in computers is ascii. UTF-8 is > a big win in memory and speed for processing english text. You think, that 99% of the computer users - english speaking? Think again... btw, something about UTF-8 & UTF-16 efficiency: http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_u nicode.html#Test_Results For latin script based languages - UTF-8 takes ~51% less space than UTF-16. For greek (expect the same for cyrillic)- ~88% - not that better than UTF-16. For japanese, chinese, korean, hindi - 115%..140% - UTF-16 is more space efficient. |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mike Wynn | UTF-8 does lead to the problem of what is meant by: char[] c; c[5] Is it the 5th byte of c[], or the 5th decoded 32 bit character? Saying it's the 5th decoded character has all kinds of implications for slicing and .length. 8 bit ascii isn't a problem, just cast it to a byte[], as in: byte[] b = cast(byte[])c; I'm not sure about the Java 00 issue, I didn't think Java supported UTF-8. D does not have the "what to do about embedded 0" problem, as the length is carried along separately. "Mike Wynn" <mike.wynn@l8night.co.uk> wrote in message news:b0a8eg$ivc$1@digitaldaemon.com... > > "Walter" <walter@digitalmars.com> wrote in message news:b0a7ft$iei$1@digitaldaemon.com... > > > > "Burton Radons" <loth@users.sourceforge.net> wrote in message news:b0a6rl$i4m$1@digitaldaemon.com... > > > Walter wrote: > > > > 10) Interestingly, making char[] in D to be UTF-8 does not seem to > step > > on > > > > or prevent dealing with wchar_t[] arrays being UTF-16. > > > You're planning on making this a part of char[]? I was thinking of generating a StringUTF8 instance during compilation, but whatever. > > > > I think making char[] a UTF-8 is the right way. > > I would be more in favor of a String class that was utf8 internally > the problem with utf8 is that the the number of bytes and the number of > chars are dependant on the data > char[] to me implies an array of char's so > char [] foo ="aa"\0x0555; > is 4 bytes, but only 3 chars > so what is foo[2] ? and what if I set foo[1] = \0x467; > and what about wanting 8 bit ascii strings ? > > if you are going UTF8 then think about the minor extension Java added to the > encoding by allowing a two byte 0, which allows embedded 0 in strings without messing up the C strlen (which returns the byte length). |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Serge K | "Serge K" <skarebo@programmer.net> wrote in message news:b0anmt$r7g$1@digitaldaemon.com... > > 3) Is Win32's "wide char" really UTF-16, including the multi word > encodings? > > WinXP, WinCE : UTF-16 > Win2K : was UCS-2, but some service pack made it UTF-16 > WinNT4 : UCS-2 > Win9x : must die. LOL! Looking forward, then, one can treat it as UTF-16. > > 8) I suspect that 99% of the text processed in computers is ascii. UTF-8 > is > > a big win in memory and speed for processing english text. > You think, that 99% of the computer users - english speaking? Not at all. But the text processed - yes. But I imagine it would be pretty tough to come by figures for that that are better than speculation. > something about UTF-8 & UTF-16 efficiency: > http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_u > nicode.html#Test_Results > For latin script based languages - UTF-8 takes ~51% less space than UTF-16. > For greek (expect the same for cyrillic)- ~88% - not that better than > UTF-16. > For japanese, chinese, korean, hindi - 115%..140% - UTF-16 is more space > efficient. Thanks for the info. That's about what I would have guessed. Another valuable statistic would be how well UTF-8 compressed with LZW as opposed to the same thing in UTF-16. |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Daniel Yokomiso | I once wrote a large project that dealt with mixed ascii and unicode. There was bug after bug when the two collided. Finally, I threw in the towel and made the entire program unicode - every string in it. The trouble in D is that in the current scheme, everything dealing with text has to be written twice, once for char[] and again for wchar_t[]. In C, there's that wretched tchar.h to swap back and forth. It may just be easier in the long run to just make UTF-8 the native type, and then at least try and make sure the standard D library is correct. -Walter "Daniel Yokomiso" <daniel_yokomiso@yahoo.com.br> wrote in message news:b0agdq$ni9$1@digitaldaemon.com... > > "globalization guy" <globalization_member@pathlink.com> escreveu na mensagem > news:b05pdd$13bv$1@digitaldaemon.com... > > I think you'll be making a big mistake if you adopt C's obsolete char == > byte > > concept of strings. Savvy language designers these days realize that, like > int's > > and float's, char's should be a fundamental data type at a higher-level of > > abstraction than raw bytes. The model that most modern language designers > are > > turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit. > > > > If you do so, you make it possible for strings in your language to have a > > single, canonical form that all APIs use. Instead of the nightmare that > C/C++ > > programmers face when passing string parameters ("now, let's see, is this > a > > char* or a const char* or an ISO C++ string or an ISO wstring or a > wchar_t* or a > > char[] or a wchar_t[] or an instance of one of countless string > classes...?). > > The fact that not just every library but practically every project feels > the > > need to reinvent its own string type is proof of the need for a good, > solid, > > canonical form built right into the language. > > > > Most language designers these days either get this from the start of they > later > > figure it out and have to screw up their language with multiple string > types. > > > > Having canonical UTF-16 chars and strings internally does not mean that > you > > can't deal with other character encodings externally. You can can convert > to > > canonical form on import and convert back to some legacy encoding on > export. > > When you create the strings yourself, or when they are created in Java or > C# or > > Javascript or default XML or most new text protocols, no conversion will > be > > necessary. It will only be needed for legacy data (or a very lightweight > switch > > between UTF-8 and UTF-16). And for those cases where you have to work with > > legacy data and yet don't want to incur the overhead of encoding > conversion in > > and out, you can still treat the external strings as byte arrays instead > of > > strings, assuming you have a "byte" data type, and do direct byte > manipulation > > on them. That's essentially what you would have been doing anyway if you > had > > used the old char == byte model I see in your docs. You just call it > "byte" > > instead of "char" so it doesn't end up being your default string type. > > > > Having a modern UTF-16 char type, separate from arrays of "byte", gives > you a > > consistency that allows for the creation of great libraries (since text is > such > > a fundamental type). Java and C# designers figured this out from the > start, and > > their libraries universally use a single string type. Perl figured it out > pretty > > late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's > never > > clear which CPAN modules will work and which ones will fail, so you have > to use > > pragmas ("use utf-8" vs. "use bytes") and do lots of testing. > > > > I hope you'll consider making this change to your design. Have an 8-bit > unsigned > > "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this > "8-bit > > char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff > or I'm > > quite sure you'll later regret it. C/C++ are in that sorry state for > legacy > > reasons only, not because their designers were foolish, but any new > language > > that intentionally copies that "design" is likely to regret that decision. > > > > Hi, > > There was a thread a year ago in the smalleiffel mailing list (starting > at http://groups.yahoo.com/group/smalleiffel/message/4075 ) about unicode > strings in Eiffel. It's a quite interesting read about the problems of > adding string-like Unicode classes. The main point is that true Unicode > support is very difficult to achieve just some libraries provide good, > correct and complete unicode encoders/decoders/renderers/etc. > While I agree that some Unicode support is a necessity today (main > mother tongue is brazilian portuguese so I use non-ascii characters > everyday), we can't just add some base types and pretend everything is > allright. We won't correct incorrect written code with a primitive unicode > string. Most programmers don't think about unicode when they develop their > software, so almost every line of code dealing with texts contain some > assumptions about the character sets being used. Java has a primitive 16 bit > char, but basic library functions (because they need good performance) use incorrect code for string handling stuff (the correct classes are in java.text, providing means to correctly collate strings). Some times we are > just using plain old ASCII but we're bitten by the encoding issues. And when > we need to deal with true unicode support the libraries tricky us into > believing everything is ok. > IMO D should support a simple char array to deal with ASCII (as it does > today) and some kind of standard library module to deal with unicode glyphs > and text. This could be included in phobos or even in deimos. Any volunteers? With this we could force the programmer to deal with another set > of tools (albeit similar) when dealing with each kind of string: ASCII or unicode. This module should allow creation of variable sized string and glyphs through an opaque ADT. Each kind of usage has different semantics and > optimization strategies (e.g. Boyer-Moore is good for ASCII but with unicode > the space and time usage are worse). > > Best regards, > Daniel Yokomiso. > > P.S.: I had to written some libraries and components (EJBs) in several Java > projects to deal with data-transfer in plain ASCII (communication with IBM mainframes). Each day I dreamed of using a language with simple one byte character strings, without problems with encoding and endianess (Solaris vs. > Linux vs. Windows NT have some nice "features" in their JVMs if you aren't careful when writing Java code that uses "ASCII" String). But Java has a 16 > bit character type and a SIGNED byte type, both awkward for this usage. A language shouldn't get in the way of simple code. > > "Never argue with an idiot. They drag you down to their level then beat you > with experience." > > > --- > Outgoing mail is certified Virus Free. > Checked by AVG anti-virus system (http://www.grisoft.com). > Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003 > > |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | "Walter" <walter@digitalmars.com> escreveu na mensagem news:b0b0up$vk7$1@digitaldaemon.com... > I once wrote a large project that dealt with mixed ascii and unicode. There > was bug after bug when the two collided. Finally, I threw in the towel and made the entire program unicode - every string in it. > > The trouble in D is that in the current scheme, everything dealing with text > has to be written twice, once for char[] and again for wchar_t[]. In C, there's that wretched tchar.h to swap back and forth. It may just be easier > in the long run to just make UTF-8 the native type, and then at least try and make sure the standard D library is correct. > > -Walter > [snip] Hi, Current D uses char[] as the string type. If we declare each char to be UTF-8 we'll have all the problems with what does "myString[13] = someChar;" means. I think a opaque string datatype may be better in this case. We could have a glyph datatype that represents one unicode glyph in UTF-8 encoding, and use it together with a string class. Also I don't think a mutable string type is a good idea. Python and Java use immutable strings, and this leads to better programs (you don't need to worry about copying your strings when you get or give them). Some nice tricks, like caching hashCode results for strings are possible, because the values won't change. We could also provide a mutable string class. If this is the way to go we need lots of test cases, specially from people with experience writing unicode libraries. The Unicode spec has lots of particularities, like correct regular expression support, that may lead to subtle bugs. Best regards, Daniel Yokomiso. "Before you criticize someone, walk a mile in their shoes. That way you're a mile away and you have their shoes, too." --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.443 / Virus Database: 248 - Release Date: 10/1/2003 |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Serge K | > > First, UTF-16 is just one of the many standard encodings for the Unicode. UTF-16 allows more then 16bit characters - with surrogates it can represent > all >1M codes. > (Unicode v2 used UCS-2 which is 16bit-only encoding) right, me getting confused. too many tla's too many standards (as ever). > > > I was under the impression UTF-16 was glyph based > > from The Unicode Standard, ch2 General Structure > http://www.unicode.org/uni2book/ch02.pdf > "Characters, not glyphs - The Unicode Standard encodes characters, not > glyphs. > The Unicode Standard draws a distinction between characters, which are the > smallest components of written language that have semantic value, and > glyphs, which represent the shapes that characters can have when they are > rendered or displayed. Various relationships may exist between characters > and glyphs: a single glyph may correspond to a single character, or to a > number of characters, or multiple glyphs may result from a single > character." > > btw, there are many precomposed characters in the Unicode which can be represented with combining characters as well. ( [â] and [a,(combining ^)] - > equally valid representations for [a with circumflex] ). so if I read this right ... (been using UTF8 for ages and ignored what it represents, keeps me sane (er) ) I can't understand arabic file names anyway :) so a string (no matter how its encoded) contains 3 lengths the byte length, then number of unicode entites (16 bit UCS-2) and the number of "characters" so cât as UTF8 is 4 bytes, as UTF-16 is 6 bytes, its 3 UCS-2 entities, and 3 "characters" but if the â was [a,(combining ^)] not the single â UCS-2 value then cât would be UTF8 is 8+ bytes, as UTF-16 is 8 bytes, its 4 UCS-2 entities, but still 3 "characters" which is why I think String should be a class not a thing[] you should be able to get a utf8 encoded byte[], utf-16 short[], UCS-2 short[] (for win32/api), (32 bit unicode) int[] (for linux) and ideally a Character[] from the string. how a String is stored utf8, utf16 or 32bit/64bit values is only relevant for performance and different people will want different internal representations. but semantically they should be all the same. this is all another reason why I also think that arrays should be templated classes that have an index method (operator []) so the Character[] from the string can modify the String it represents. Mike. |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | "Walter" <walter@digitalmars.com> wrote in message news:b0b0up$vk7$1@digitaldaemon.com... > I once wrote a large project that dealt with mixed ascii and unicode. There > was bug after bug when the two collided. Finally, I threw in the towel and made the entire program unicode - every string in it. > > The trouble in D is that in the current scheme, everything dealing with text > has to be written twice, once for char[] and again for wchar_t[]. In C, there's that wretched tchar.h to swap back and forth. It may just be easier > in the long run to just make UTF-8 the native type, and then at least try and make sure the standard D library is correct. I've gotten a little confused reading this thread. Here are some questions swimming in my head: 1) What does it mean to make UTF-8 the native type? 2) What is char.size? 3) Does char[] differ from byte[] or is it a typedef? 4) How does one get a UTF-16 encoding of a char[], or get the length, or get the 5th character, or set the 5th character to a given unicode character (expressed in UTF-16, say)? Here are my guesses to the answers: 1) string literals are encoded in UTF-8 2) char.size = 8 3) it's a typedef 4) through the library or directly if you know enough about the char[] you are manipulating. Is this correct? thanks, -Ben > > -Walter > > "Daniel Yokomiso" <daniel_yokomiso@yahoo.com.br> wrote in message news:b0agdq$ni9$1@digitaldaemon.com... > > > > "globalization guy" <globalization_member@pathlink.com> escreveu na > mensagem > > news:b05pdd$13bv$1@digitaldaemon.com... > > > I think you'll be making a big mistake if you adopt C's obsolete char == > > byte > > > concept of strings. Savvy language designers these days realize that, > like > > int's > > > and float's, char's should be a fundamental data type at a higher-level > of > > > abstraction than raw bytes. The model that most modern language > designers > > are > > > turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit. > > > > > > If you do so, you make it possible for strings in your language to have > a > > > single, canonical form that all APIs use. Instead of the nightmare that > > C/C++ > > > programmers face when passing string parameters ("now, let's see, is > this > > a > > > char* or a const char* or an ISO C++ string or an ISO wstring or a > > wchar_t* or a > > > char[] or a wchar_t[] or an instance of one of countless string > > classes...?). > > > The fact that not just every library but practically every project feels > > the > > > need to reinvent its own string type is proof of the need for a good, > > solid, > > > canonical form built right into the language. > > > > > > Most language designers these days either get this from the start of > they > > later > > > figure it out and have to screw up their language with multiple string > > types. > > > > > > Having canonical UTF-16 chars and strings internally does not mean that > > you > > > can't deal with other character encodings externally. You can can > convert > > to > > > canonical form on import and convert back to some legacy encoding on > > export. > > > When you create the strings yourself, or when they are created in Java > or > > C# or > > > Javascript or default XML or most new text protocols, no conversion will > > be > > > necessary. It will only be needed for legacy data (or a very lightweight > > switch > > > between UTF-8 and UTF-16). And for those cases where you have to work > with > > > legacy data and yet don't want to incur the overhead of encoding > > conversion in > > > and out, you can still treat the external strings as byte arrays instead > > of > > > strings, assuming you have a "byte" data type, and do direct byte > > manipulation > > > on them. That's essentially what you would have been doing anyway if you > > had > > > used the old char == byte model I see in your docs. You just call it > > "byte" > > > instead of "char" so it doesn't end up being your default string type. > > > > > > Having a modern UTF-16 char type, separate from arrays of "byte", gives > > you a > > > consistency that allows for the creation of great libraries (since text > is > > such > > > a fundamental type). Java and C# designers figured this out from the > > start, and > > > their libraries universally use a single string type. Perl figured it > out > > pretty > > > late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's > > never > > > clear which CPAN modules will work and which ones will fail, so you have > > to use > > > pragmas ("use utf-8" vs. "use bytes") and do lots of testing. > > > > > > I hope you'll consider making this change to your design. Have an 8-bit > > unsigned > > > "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this > > "8-bit > > > char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff > > or I'm > > > quite sure you'll later regret it. C/C++ are in that sorry state for > > legacy > > > reasons only, not because their designers were foolish, but any new > > language > > > that intentionally copies that "design" is likely to regret that > decision. > > > > > > > Hi, > > > > There was a thread a year ago in the smalleiffel mailing list > (starting > > at http://groups.yahoo.com/group/smalleiffel/message/4075 ) about unicode > > strings in Eiffel. It's a quite interesting read about the problems of > > adding string-like Unicode classes. The main point is that true Unicode > > support is very difficult to achieve just some libraries provide good, > > correct and complete unicode encoders/decoders/renderers/etc. > > While I agree that some Unicode support is a necessity today (main > > mother tongue is brazilian portuguese so I use non-ascii characters > > everyday), we can't just add some base types and pretend everything is > > allright. We won't correct incorrect written code with a primitive unicode > > string. Most programmers don't think about unicode when they develop their > > software, so almost every line of code dealing with texts contain some assumptions about the character sets being used. Java has a primitive 16 > bit > > char, but basic library functions (because they need good performance) use > > incorrect code for string handling stuff (the correct classes are in java.text, providing means to correctly collate strings). Some times we > are > > just using plain old ASCII but we're bitten by the encoding issues. And > when > > we need to deal with true unicode support the libraries tricky us into > > believing everything is ok. > > IMO D should support a simple char array to deal with ASCII (as it > does > > today) and some kind of standard library module to deal with unicode > glyphs > > and text. This could be included in phobos or even in deimos. Any volunteers? With this we could force the programmer to deal with another > set > > of tools (albeit similar) when dealing with each kind of string: ASCII or > > unicode. This module should allow creation of variable sized string and glyphs through an opaque ADT. Each kind of usage has different semantics > and > > optimization strategies (e.g. Boyer-Moore is good for ASCII but with > unicode > > the space and time usage are worse). > > > > Best regards, > > Daniel Yokomiso. > > > > P.S.: I had to written some libraries and components (EJBs) in several > Java > > projects to deal with data-transfer in plain ASCII (communication with IBM > > mainframes). Each day I dreamed of using a language with simple one byte character strings, without problems with encoding and endianess (Solaris > vs. > > Linux vs. Windows NT have some nice "features" in their JVMs if you aren't > > careful when writing Java code that uses "ASCII" String). But Java has a > 16 > > bit character type and a SIGNED byte type, both awkward for this usage. A > > language shouldn't get in the way of simple code. > > > > "Never argue with an idiot. They drag you down to their level then beat > you > > with experience." > > > > > > --- > > Outgoing mail is certified Virus Free. > > Checked by AVG anti-virus system (http://www.grisoft.com). > > Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003 > > > > > > |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Daniel Yokomiso | On Sat, 18 Jan 2003 12:51:42 -0300 "Daniel Yokomiso" <daniel_yokomiso@yahoo.com.br> wrote: > Current D uses char[] as the string type. If we declare each char > to be > UTF-8 we'll have all the problems with what does "myString[13] = > someChar;" means. I think a opaque string datatype may be better in > this case. We could have a glyph datatype that represents one unicode > glyph in UTF-8 encoding, and use it together with a string class. Also So what does "myString[13] = someGlyph" mean? char doesn't have to be a byte, we can have another data byte for that. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "Yesterday no longer exists Tomarrow's forever a day away And we are cell-mates, held together in the shoreless stream that is today." |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Daniel Yokomiso | "Daniel Yokomiso" <daniel_yokomiso@yahoo.com.br> wrote in message news:b0bpq9$1d3d$1@digitaldaemon.com... > Current D uses char[] as the string type. If we declare each char to be > UTF-8 we'll have all the problems with what does "myString[13] = someChar;" > means. I think a opaque string datatype may be better in this case. We could > have a glyph datatype that represents one unicode glyph in UTF-8 encoding, and use it together with a string class. I'm thinking that myString[13] should simply set the byte at myString[13]. Trying to fiddle with the multibyte stuff with simple array access semantics just looks to be too confusing and error prone. To access the unicode characters from it would be via a function or property. > Also I don't think a mutable string > type is a good idea. Python and Java use immutable strings, and this leads > to better programs (you don't need to worry about copying your strings when > you get or give them). Some nice tricks, like caching hashCode results for strings are possible, because the values won't change. We could also provide > a mutable string class. I think the copy-on-write approach to strings is the right idea. Unfortunately, if done by the language semantics, it can have severe adverse performance results (think of a toupper() function, copying the string again each time a character is converted). Using it instead as a coding style, which is currently how it's done in Phobos, seems to work well. My javascript implementation (DMDScript) does cache the hash for each string, and that works well for the semantics of javascript. But I don't think it is appropriate for lower level language like D to do as much for strings. > If this is the way to go we need lots of test cases, specially from > people with experience writing unicode libraries. The Unicode spec has lots > of particularities, like correct regular expression support, that may lead to subtle bugs. Regular expression implementations naturally lend themselves to subtle bugs :-(. Having a good test suite is a lifesaver. |
Copyright © 1999-2021 by the D Language Foundation