January 17, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to globalization guy | In article <b07jht$22v4$1@digitaldaemon.com>, globalization guy says... >That's the kind of advantage modern developers get from Java that they don't get from good ol' C. provided solaris/jvm is configured correctly by (friggin) service provider |
January 17, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to globalization guy | You make some great points. I have to ask, though, why UTF-16 as opposed to UTF-8? |
January 17, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | In article <b08cdr$2fld$1@digitaldaemon.com>, Walter says... > >You make some great points. I have to ask, though, why UTF-16 as opposed to UTF-8? Good question, and actually it's not an open and shut case. UTF-8 would not be a big mistake, but it might not be quite as good as UTF-16. The biggest reason I think UTF-16 has the edge is that I think you'll probably want to treat your strings as arrays of characters on many occasions, and that's *almost* as easy to do with UTF-16 as with ASCII. It's really not very practical with UTF-8, though. UTF-16 characters are almost always a single 16-bit code unit. Once in a billion characters or so, you get a character that is composed of two "surrogates". Sort of like half characters. Your code does have to keep this exceptional case in mind and handle it when necessary, though that is usually the type of problem you delegate to the standard library. In most cases, a function can just think of each surrogate as a character and not worry that it might be just half of the representation of a character -- as long as the two don't get separated. In almost all cases, though, you can think of a character as a single 16-bit entity, which is almost as simple as thinking of it as a single 8-bit entity. You can do bit operations on them and other C-like things and it should be very efficient. Unlike UTF-16's two cases, one of which is very rare, UTF-8 has four cases, three of which are very common. All of your code needs to do a good job with those three cases. Only the fourth can be considered exceptional. (Of course it has to be handled, too, but it is like the exceptional UTF-16 case, where you don't have to optimize for it because it rarely occurs). Most strings will tend to have mixed-width characters, so a model of an array of elements isn't a very good one. You can still implement your language with accessors that reach into a UTF-8 string and parse out the right character when you say "str[5]", but it will be further removed from the physical implementation than if you use UTF-16. For a somewhat lower-level language like "D", this probably isn't a very good fit. The main benefit of UTF-8 is when exchanging text data with arbitrary external parties. UTF-8 has no endianness problem, so you don't have to worry about the *internal* memory model of the recipient. It has some other features that make it easier to digest by legacy systems that can only handle ASCII. They won't work right outside ASCII, but they'll often work for ASCII and they'll fail more gracefully than would be the case with UTF-16 (that is likely to contain embedded \0 bytes.) None of these issues are relevant to your own program's *internal* text model. Internally, you're not worried about endianness. (You don't worry about the endianness of your int variables, do you?) You don't have to worry about losing a byte in RAM, etc. When talking to external APIs, you'll still have to output in a form that the API can handle. Win32 APIs want UTF-16. Mac APIs want UTF-16. Java APIs want UTF-16, as do .Net APIs. Unix APIs are problematic, since there are so many and they aren't coordinated by a single body. Some will only be able to handle ASCII, others will be upgraded to UTF-8. I don't think the Unix system APIs will become UTF-16 because legacy is such a ball and chain in the Unix world, but the process is underway to upgrade the standard system encoding for all major Linux distributions to UTF-8. If Linux APIs (and probably most Unix APIs eventually) are of primary importance, UTF-8 is still a possibility. I'm not totally ruling it out. It wouldn't hurt you much to use UTF-8 internally, but accessing strings as arrays of characters would require sort of a virtual string model that doesn't match the physical model quite as closely as you could get with UTF-16. The additional abstraction might have more overhead than you would prefer internally. If it's a choice between internal inefficiency and inefficiency when calling external APIs, I would usually go for the latter. Most language designers who understand internationalization have decided to go with UTF-16 for languages that have their own rich set of internal libraries, and they have mechanisms for calling external APIs that convert the string encodings. |
January 17, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to globalization guy | I read your post with great interest. However, I'm leaning towards UTF-8 for the following reasons (some of which you've covered): 1) In googling around and reading various articles, it seems that UTF-8 is gaining momentum as the encoding of choice, including html. 2) Linux is moving towards UTF-8 permeating the OS. Doing UTF-8 in D means that D will mesh naturally with Linux system api's. 3) Is Win32's "wide char" really UTF-16, including the multi word encodings? 4) I like the fact of no endianness issues, which is important when writing files and transmitting text - it's much more important an issue than the endianness of ints. 5) 16 bit accesses on Intel CPUs can be pretty slow compared to byte or dword accesses (varies by CPU type). 6) Sure, UTF-16 reduces the frequency of multi character encodings, but the code to deal with it must still be there and must still execute. 7) I've converted some large Java text processing apps to C++, and converted the Java 16 bit char's to using UTF-8. That change resulted in *substantial* performance improvements. 8) I suspect that 99% of the text processed in computers is ascii. UTF-8 is a big win in memory and speed for processing english text. 9) A lot of diverse systems and lightweight embedded systems need to work with 8 bit chars. Going to UTF-16 would, I think, reduce the scope of applications and systems that D would be useful for. Going to UTF-8 would make it as broad as possible. 10) Interestingly, making char[] in D to be UTF-8 does not seem to step on or prevent dealing with wchar_t[] arrays being UTF-16. 11) I'm not convinced the char[i] indexing problem will be a big one. Most operations done on ascii strings remain unchanged for UTF-8, including things like sorting & searching. See http://www.cl.cam.ac.uk/~mgk25/unicode.html "globalization guy" <globalization_member@pathlink.com> wrote in message news:b09qpe$aff$1@digitaldaemon.com... > In article <b08cdr$2fld$1@digitaldaemon.com>, Walter says... > > > >You make some great points. I have to ask, though, why UTF-16 as opposed to > >UTF-8? > > Good question, and actually it's not an open and shut case. UTF-8 would not be a > big mistake, but it might not be quite as good as UTF-16. > > The biggest reason I think UTF-16 has the edge is that I think you'll probably > want to treat your strings as arrays of characters on many occasions, and that's > *almost* as easy to do with UTF-16 as with ASCII. It's really not very practical > with UTF-8, though. > > UTF-16 characters are almost always a single 16-bit code unit. Once in a billion > characters or so, you get a character that is composed of two "surrogates". Sort > of like half characters. Your code does have to keep this exceptional case in > mind and handle it when necessary, though that is usually the type of problem > you delegate to the standard library. In most cases, a function can just think > of each surrogate as a character and not worry that it might be just half of the > representation of a character -- as long as the two don't get separated. In > almost all cases, though, you can think of a character as a single 16-bit entity, which is almost as simple as thinking of it as a single 8-bit entity. > You can do bit operations on them and other C-like things and it should be very > efficient. > > Unlike UTF-16's two cases, one of which is very rare, UTF-8 has four cases, > three of which are very common. All of your code needs to do a good job with > those three cases. Only the fourth can be considered exceptional. (Of course it > has to be handled, too, but it is like the exceptional UTF-16 case, where you > don't have to optimize for it because it rarely occurs). Most strings will tend > to have mixed-width characters, so a model of an array of elements isn't a very > good one. > > You can still implement your language with accessors that reach into a UTF-8 > string and parse out the right character when you say "str[5]", but it will be > further removed from the physical implementation than if you use UTF-16. For a > somewhat lower-level language like "D", this probably isn't a very good fit. > > The main benefit of UTF-8 is when exchanging text data with arbitrary external > parties. UTF-8 has no endianness problem, so you don't have to worry about the > *internal* memory model of the recipient. It has some other features that make > it easier to digest by legacy systems that can only handle ASCII. They won't > work right outside ASCII, but they'll often work for ASCII and they'll fail more > gracefully than would be the case with UTF-16 (that is likely to contain > embedded \0 bytes.) > > None of these issues are relevant to your own program's *internal* text model. > Internally, you're not worried about endianness. (You don't worry about the > endianness of your int variables, do you?) You don't have to worry about losing > a byte in RAM, etc. > > When talking to external APIs, you'll still have to output in a form that the > API can handle. Win32 APIs want UTF-16. Mac APIs want UTF-16. Java APIs want > UTF-16, as do .Net APIs. Unix APIs are problematic, since there are so many and > they aren't coordinated by a single body. Some will only be able to handle ASCII, others will be upgraded to UTF-8. I don't think the Unix system APIs will > become UTF-16 because legacy is such a ball and chain in the Unix world, but the > process is underway to upgrade the standard system encoding for all major Linux > distributions to UTF-8. > > If Linux APIs (and probably most Unix APIs eventually) are of primary importance, UTF-8 is still a possibility. I'm not totally ruling it out. It > wouldn't hurt you much to use UTF-8 internally, but accessing strings as arrays > of characters would require sort of a virtual string model that doesn't match > the physical model quite as closely as you could get with UTF-16. The additional > abstraction might have more overhead than you would prefer internally. If it's a > choice between internal inefficiency and inefficiency when calling external > APIs, I would usually go for the latter. > > Most language designers who understand internationalization have decided to go > with UTF-16 for languages that have their own rich set of internal libraries, > and they have mechanisms for calling external APIs that convert the string encodings. > > > |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | Walter wrote: > 10) Interestingly, making char[] in D to be UTF-8 does not seem to step on > or prevent dealing with wchar_t[] arrays being UTF-16. You're planning on making this a part of char[]? I was thinking of generating a StringUTF8 instance during compilation, but whatever. I think we should kill off wchar if we go in this direction. The char/wchar conflict is probably the worst part of D's design right now as it doesn't fit well with the rest of the language (limited and ambiguous overloading), and it would provide absolutely nothing that char doesn't already encapsulate. If you need different encodings, use a library. > 11) I'm not convinced the char[i] indexing problem will be a big one. Most > operations done on ascii strings remain unchanged for UTF-8, including > things like sorting & searching. It's not such a speed hit any longer that all code absolutely must use slicing and iterators to be useful. 12) UTF-8 doesn't embed ANY control characters, so it can interface with unintelligent C libraries natively. That's not a minor advantage when you're trying to get people to switch to it! |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Burton Radons | "Burton Radons" <loth@users.sourceforge.net> wrote in message news:b0a6rl$i4m$1@digitaldaemon.com... > Walter wrote: > > 10) Interestingly, making char[] in D to be UTF-8 does not seem to step on > > or prevent dealing with wchar_t[] arrays being UTF-16. > You're planning on making this a part of char[]? I was thinking of generating a StringUTF8 instance during compilation, but whatever. I think making char[] a UTF-8 is the right way. > I think we should kill off wchar if we go in this direction. The char/wchar conflict is probably the worst part of D's design right now as it doesn't fit well with the rest of the language (limited and ambiguous overloading), and it would provide absolutely nothing that char doesn't already encapsulate. If you need different encodings, use a library. I agree that the char/wchar conflict is a screwup in D's design, and one I've not been happy with. UTF-8 offers a way out. wchar_t should still be retained, though, for interfacing with the win32 api. > > 11) I'm not convinced the char[i] indexing problem will be a big one. Most > > operations done on ascii strings remain unchanged for UTF-8, including things like sorting & searching. > It's not such a speed hit any longer that all code absolutely must use slicing and iterators to be useful. Interestingly, if foreach is done right, iterating through char[] will work right, UTF-8 or not. > 12) UTF-8 doesn't embed ANY control characters, so it can interface with unintelligent C libraries natively. That's not a minor advantage when you're trying to get people to switch to it! You're right. |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | "Walter" <walter@digitalmars.com> wrote in message news:b0a7ft$iei$1@digitaldaemon.com... > > "Burton Radons" <loth@users.sourceforge.net> wrote in message news:b0a6rl$i4m$1@digitaldaemon.com... > > Walter wrote: > > > 10) Interestingly, making char[] in D to be UTF-8 does not seem to step > on > > > or prevent dealing with wchar_t[] arrays being UTF-16. > > You're planning on making this a part of char[]? I was thinking of generating a StringUTF8 instance during compilation, but whatever. > > I think making char[] a UTF-8 is the right way. I would be more in favor of a String class that was utf8 internally the problem with utf8 is that the the number of bytes and the number of chars are dependant on the data char[] to me implies an array of char's so char [] foo ="aa"\0x0555; is 4 bytes, but only 3 chars so what is foo[2] ? and what if I set foo[1] = \0x467; and what about wanting 8 bit ascii strings ? if you are going UTF8 then think about the minor extension Java added to the encoding by allowing a two byte 0, which allows embedded 0 in strings without messing up the C strlen (which returns the byte length). > > I think we should kill off wchar if we go in this direction. The char/wchar conflict is probably the worst part of D's design right now as it doesn't fit well with the rest of the language (limited and ambiguous overloading), and it would provide absolutely nothing that char doesn't already encapsulate. If you need different encodings, use a library. > I agree that the char/wchar conflict is a screwup in D's design, and one I've not been happy with. UTF-8 offers a way out. wchar_t should still be retained, though, for interfacing with the win32 api. > > > > 11) I'm not convinced the char[i] indexing problem will be a big one. > Most > > > operations done on ascii strings remain unchanged for UTF-8, including things like sorting & searching. > > It's not such a speed hit any longer that all code absolutely must use slicing and iterators to be useful. > > Interestingly, if foreach is done right, iterating through char[] will work > right, UTF-8 or not. > > 12) UTF-8 doesn't embed ANY control characters, so it can interface with unintelligent C libraries natively. That's not a minor advantage when you're trying to get people to switch to it! > > You're right. > > |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | > > 6) Sure, UTF-16 reduces the frequency of multi character encodings, but the > code to deal with it must still be there and must still execute. > I was under the impression UTF-16 was glyph based, so each char (16bits) was a glyph of some form, not all glyph cause the graphics to move to the next char, so accents can be encoded as a postfix to the char they are over/under and charsets like chinesse have sequences that generate the correct visual reprosentation; UTF-8 is just a way to encode UTF-16 so the it is compatable with ascii, 0..127 map to 0.127 then using 128..256 as special values identifing multi byte values the string can be processed as 8bit ascii by software without problem, only the visual reprosentation changes 128..256 on dos are the box drawing and intl chars. however a 3 UTF-16 char sequence will encode to 3 utf 8 encoded sequences and if they are all >127 then that would be 6 or more bytes, so if you consider the 3 UTF-16 values to be one "char" then the UTF8 should also consider the 6 or more byte sequence as one "char" rather than 3 "chars" |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to globalization guy | "globalization guy" <globalization_member@pathlink.com> escreveu na mensagem news:b05pdd$13bv$1@digitaldaemon.com... > I think you'll be making a big mistake if you adopt C's obsolete char == byte > concept of strings. Savvy language designers these days realize that, like int's > and float's, char's should be a fundamental data type at a higher-level of abstraction than raw bytes. The model that most modern language designers are > turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit. > > If you do so, you make it possible for strings in your language to have a single, canonical form that all APIs use. Instead of the nightmare that C/C++ > programmers face when passing string parameters ("now, let's see, is this a > char* or a const char* or an ISO C++ string or an ISO wstring or a wchar_t* or a > char[] or a wchar_t[] or an instance of one of countless string classes...?). > The fact that not just every library but practically every project feels the > need to reinvent its own string type is proof of the need for a good, solid, > canonical form built right into the language. > > Most language designers these days either get this from the start of they later > figure it out and have to screw up their language with multiple string types. > > Having canonical UTF-16 chars and strings internally does not mean that you > can't deal with other character encodings externally. You can can convert to > canonical form on import and convert back to some legacy encoding on export. > When you create the strings yourself, or when they are created in Java or C# or > Javascript or default XML or most new text protocols, no conversion will be > necessary. It will only be needed for legacy data (or a very lightweight switch > between UTF-8 and UTF-16). And for those cases where you have to work with legacy data and yet don't want to incur the overhead of encoding conversion in > and out, you can still treat the external strings as byte arrays instead of > strings, assuming you have a "byte" data type, and do direct byte manipulation > on them. That's essentially what you would have been doing anyway if you had > used the old char == byte model I see in your docs. You just call it "byte" > instead of "char" so it doesn't end up being your default string type. > > Having a modern UTF-16 char type, separate from arrays of "byte", gives you a > consistency that allows for the creation of great libraries (since text is such > a fundamental type). Java and C# designers figured this out from the start, and > their libraries universally use a single string type. Perl figured it out pretty > late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's never > clear which CPAN modules will work and which ones will fail, so you have to use > pragmas ("use utf-8" vs. "use bytes") and do lots of testing. > > I hope you'll consider making this change to your design. Have an 8-bit unsigned > "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this "8-bit > char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff or I'm > quite sure you'll later regret it. C/C++ are in that sorry state for legacy > reasons only, not because their designers were foolish, but any new language > that intentionally copies that "design" is likely to regret that decision. > Hi, There was a thread a year ago in the smalleiffel mailing list (starting at http://groups.yahoo.com/group/smalleiffel/message/4075 ) about unicode strings in Eiffel. It's a quite interesting read about the problems of adding string-like Unicode classes. The main point is that true Unicode support is very difficult to achieve just some libraries provide good, correct and complete unicode encoders/decoders/renderers/etc. While I agree that some Unicode support is a necessity today (main mother tongue is brazilian portuguese so I use non-ascii characters everyday), we can't just add some base types and pretend everything is allright. We won't correct incorrect written code with a primitive unicode string. Most programmers don't think about unicode when they develop their software, so almost every line of code dealing with texts contain some assumptions about the character sets being used. Java has a primitive 16 bit char, but basic library functions (because they need good performance) use incorrect code for string handling stuff (the correct classes are in java.text, providing means to correctly collate strings). Some times we are just using plain old ASCII but we're bitten by the encoding issues. And when we need to deal with true unicode support the libraries tricky us into believing everything is ok. IMO D should support a simple char array to deal with ASCII (as it does today) and some kind of standard library module to deal with unicode glyphs and text. This could be included in phobos or even in deimos. Any volunteers? With this we could force the programmer to deal with another set of tools (albeit similar) when dealing with each kind of string: ASCII or unicode. This module should allow creation of variable sized string and glyphs through an opaque ADT. Each kind of usage has different semantics and optimization strategies (e.g. Boyer-Moore is good for ASCII but with unicode the space and time usage are worse). Best regards, Daniel Yokomiso. P.S.: I had to written some libraries and components (EJBs) in several Java projects to deal with data-transfer in plain ASCII (communication with IBM mainframes). Each day I dreamed of using a language with simple one byte character strings, without problems with encoding and endianess (Solaris vs. Linux vs. Windows NT have some nice "features" in their JVMs if you aren't careful when writing Java code that uses "ASCII" String). But Java has a 16 bit character type and a SIGNED byte type, both awkward for this usage. A language shouldn't get in the way of simple code. "Never argue with an idiot. They drag you down to their level then beat you with experience." --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003 |
January 18, 2003 Re: Unicode in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Mike Wynn | > I was under the impression UTF-16 was glyph based, so each char (16bits) was > a glyph of some form, not all glyph cause the graphics to move to the next char, so accents can be encoded as a postfix to the char they are over/under > and charsets like chinesse have sequences that generate the correct visual reprosentation; First, UTF-16 is just one of the many standard encodings for the Unicode. UTF-16 allows more then 16bit characters - with surrogates it can represent all >1M codes. (Unicode v2 used UCS-2 which is 16bit-only encoding) > I was under the impression UTF-16 was glyph based from The Unicode Standard, ch2 General Structure http://www.unicode.org/uni2book/ch02.pdf "Characters, not glyphs - The Unicode Standard encodes characters, not glyphs. The Unicode Standard draws a distinction between characters, which are the smallest components of written language that have semantic value, and glyphs, which represent the shapes that characters can have when they are rendered or displayed. Various relationships may exist between characters and glyphs: a single glyph may correspond to a single character, or to a number of characters, or multiple glyphs may result from a single character." btw, there are many precomposed characters in the Unicode which can be represented with combining characters as well. ( [รข] and [a,(combining ^)] - equally valid representations for [a with circumflex] ). |
Copyright © 1999-2021 by the D Language Foundation