December 21, 2003 Re: Unicode discussion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rupert Millard | In article <bs4ea9$jo2$1@digitaldaemon.com>, Rupert Millard says... > >> > I would think that the datatype char would be a UTF-8 character, with no >> indication of >> > the amount of storage it used. The compiler would be free to represent >it >> internally >> > however it chose. Indexing should work (perhaps inefficiently) >> >> That would be a higher level view of it, and I suggest a wrapper class around it can provide this. > >On Friday 19th, I posted a class that provides this functionality to this thread. I sorry to interrup (I'm one of the cluless here, in fact I call this the unicorn discussion) but isn't Vathix's String class suppose to cover that? http://www.digitalmars.com/drn-bin/wwwnews?D/19525 It's bigger so it must be better ;) Ant |
December 21, 2003 Re: Unicode discussion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | I think this discussion of "language being wrong" is wrong. It is obviuosly clear that the char[], char, and other associated types don't have a sensible higher-level symantics. The examples are many. Obviously, i find it quite right from the language not to constrain the programmers to high-level types. It is a job for the library. Now, everyone. Walter has quite enough to do of what he does better than all of us. Improving on a standard library is a job which he delegates to us. A library class or struct String should be indexed by a real character scanning, and not by the adress, even if it means more overhead. And the result of this indexing, as well as any single character acess would be a dchar. The internal representation should be still acessible, for the case someone finds high-level semantics a bottleneck within his application. Besides, myself and Mark have proposed a number of solutions a while ago, which would give strings non-standard storage, but would allow the high level representation to be significantly faster, at the cost of ease of operating on a lower-level representation. -eye |
December 21, 2003 Re: Unicode discussion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ant | Ant <Ant_member@pathlink.com> wrote in news:bs4gc8$n2c$1@digitaldaemon.com: > In article <bs4ea9$jo2$1@digitaldaemon.com>, Rupert Millard says... >> >>> > I would think that the datatype char would be a UTF-8 character, with no >>> indication of >>> > the amount of storage it used. The compiler would be free to represent >>it >>> internally >>> > however it chose. Indexing should work (perhaps inefficiently) >>> >>> That would be a higher level view of it, and I suggest a wrapper class around it can provide this. >> >>On Friday 19th, I posted a class that provides this functionality to this thread. > > I sorry to interrup > (I'm one of the cluless here, in fact I call this the unicorn > discussion) but isn't Vathix's String class suppose to cover that? > http://www.digitalmars.com/drn-bin/wwwnews?D/19525 > > It's bigger so it must be better ;) > > Ant You had me worried here because I missed that post! However, they do slightly different things, I think. Mine indexes characters rather than bytes in UTF-8 strings. Vathix's does many other string handling things. (e.g. changing case) My code needs to be integrated into his, if it can be - I'm not sure what implications his use of templates has. You're quite correct - as they currently are, his is vastly more useful - I can't think of many situations where you need to index whole characters rather than bytes. My main reason for writing it was that I enjoy writing code. Rupert |
December 22, 2003 Re: Unicode discussion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Roald Ribe | "Roald Ribe" <rr.no@spam.teikom.no> wrote in message news:bs4ddt$ig4$1@digitaldaemon.com... > > > Can't a single UTF-8 character require multiple bytes for > representation? > > > > No. > > ??? > A unicode character can result in up to 6 bytes used, when encoded > with UTF-8. Which is what the poster meant to ask, I think. Sure, perhaps I misunderstood him. |
December 31, 2003 Re: Unicode discussion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Roald Ribe | > ???
> A unicode character can result in up to 6 bytes used, when encoded
> with UTF-8.
UTF-8 can represent all Unicode characters with no more then 4 bytes. ISO/IEC 10646 (UCS-4) may require up to 6 bytes in UTF-8, but it's the superset for Unicode.
|
December 31, 2003 Re: Unicode discussion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Hauke Duden | > I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly. Well, at least one can convert any Unicode string to UTF-8 without risk of losing information. > Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8. UTF-32 never takes less memory than UTF-8. Period. Any Unicode character takes no more than 4 byte in UTF-8: 1 byte - ASCII 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc... 3 byte - most of the scripts in use. 4 byte - rare/dead/special scripts UTF-8 means multibyte encoding for most of the languages (except English and maybe some others) Most of the European and Asian languages need just one UTF-16 unit per character. For CJK languages occurrence of the UTF-16 surrogates in the real texts is estimated as <1%. Other scripts encoded in "higher planes" cover very rare or dead languages and some special symbols. In most of the cases UTF-16 string can be treated as simple array of UCS-2 characters. You just need to know if it has surrogates // if (number_of_characters < nomber_of_16bit_units) |
December 31, 2003 Re: Unicode discussion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Serge K | "Serge K" <skarebo@programmer.net> wrote in message news:bst8q3$218i$1@digitaldaemon.com... > > I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly. > > Well, at least one can convert any Unicode string to UTF-8 without risk of losing information. This is a good point. But I stand my ground: it may result in up to 6 bytes used for ecah character (worst case). > > Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8. > > UTF-32 never takes less memory than UTF-8. Period. > Any Unicode character takes no more than 4 byte in UTF-8: > 1 byte - ASCII > 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc... > 3 byte - most of the scripts in use. > 4 byte - rare/dead/special scripts This is wrong. Read up on UTF-8 encoding. > UTF-8 means multibyte encoding for most of the languages (except English and > maybe some others) Right. > Most of the European and Asian languages need just one UTF-16 unit per character. Yes most, but not all. > For CJK languages occurrence of the UTF-16 surrogates in the real texts is estimated as <1%. The code to handle it still has to be present... > Other scripts encoded in "higher planes" cover very rare or dead languages and some special symbols. > > In most of the cases UTF-16 string can be treated as simple array of UCS-2 characters. Yes, but "most cases" is not a good argument when the original discussion was initiated to handle ALL laguages, in a way that the developer would find to be "natural", easy and integrated in the D language. > You just need to know if it has surrogates // if (number_of_characters < > nomber_of_16bit_units) There is no such thing as "just" with these issues (IMHO) ;-) Roald |
December 31, 2003 Re: Unicode discussion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | "Walter" <walter@digitalmars.com> wrote in message news:brll85$1oko$1@digitaldaemon.com... > > "Elias Martenson" <no@spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047@spam.spam... > > Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway. > > In a higher level language, yes. But in doing systems work, one always seems > to be looking at the lower level elements anyway. I wrestled with this for a > while, and eventually decided that char[], wchar[], and dchar[] would be low > level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired. > > > > The overloading issue is interesting, but may I suggest that char and > whcar > > are at least renamed to something more appropriate? Maybe utf8byte and utf16byte? I feel it's important to point out that they aren't characters. > > I see your point, but I just can't see making utf8byte into a keyword <g>. The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much > of an issue here. > > > > And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either. > > I think the library functions should be improved to handle unicode chars. But I'm not much of an expert on how to do it right, so it is the way it is > for the moment. > > > I'd love to help out and do these things. But two things are needed first: > > - At least one other person needs to volunteer. > > I've had bad experiences when one person does this by himself, > > You're not by yourself. There's a whole D community here! > > > - The core concepts needs to be decided upon. Things seems to be > > somewhat in flux right now, with three different string types > > and all. At the very least it needs to be deicded what a "string" > > really is, is it a UTF-8 byte sequence or a UTF-32 character > > sequence? I haven't hid the fact that I would prefer the latter. > > A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8, UTF-16, or UTF-32 representations. > > > > That's correct as well. The library's support for unicode is inadequate. > But > > > there also is a nice package (std.utf) which will convert between > char[], > > > wchar[], and dchar[]. This can be used to convert the text strings into > > > whatever unicode stream type the underlying operating system API > supports. > > > (For win32 this would be UTF-16, I am unsure what linux supports.) > > Yes. But this would then assume that char[] is always in native encoding > > and doesn't rhyme very well with the assertion that char[] is a UTF-8 > > byte sequence. > > Or, the specification could be read as the stream actually performs native > > decoding to UTF-8 when reading into a char[] array. > > char[] strings are UTF-8, and as such I don't know what you mean by 'native > decoding'. There is only one possible conversion of UTF-8 to UTF-16. > > > Unless fundamental encoding/decoding is embedded in the streams library, > > it would be best to simply read text data into a byte array and then > > perform native decoding manually afterwards using functions similar > > to the C mbstowcs() and wcstombs(). The drawback to this is that you > > cannot read text data in platform encoding without copying through > > a separate buffer, even in cases when this is not needed. > > If you're talking about win32 code pages, I'm going to draw a line in the sand and assert that D char[] strings are NOT locale or code page dependent. > They are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion. > > > > D is headed that way. The current version of the library I'm working on > > > converts the char[] strings in the file name API's to UTF-16 via std.utf.toUTF16z(), for use calling the win32 API's. > > This can be done in a much better, platform independent way, by using the native<->unicode conversion routines. > > The UTF-8 to UTF-16 conversion is defined and platform independent. The D runtime library includes routines to convert back and forth between them. They could probably be optimized better, but that's another issue. I feel that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale > dependent character sets are pushed off to the side as merely an input or output translation nuisance. The core routines all expect UTF strings, and so are platform and language independent. I personally think the future is UTF, and locale dependent encodings will fall by the wayside. > > > In C, as already mentioned, > > these are called mbstowcs() and wcstombs(). For Windows, these would > > convert to and from UTF-16. For Unix, these would convert to and from > > whatever encoding the application is running under (dictated by the > > LC_CTYPE environment variable). There really is no need to make the > > API's platform dependent in any way here. > > After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language and runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/from UTF. > This should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? How about wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having > #ifdef _UNICODE all over the place? I've done that too much already. No thanks!) > > UTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set. Following this discussion, I have read some more on the subject. In additon to the speed issues that was mentioned, I have had some insights on the issues of endianess, serialization, BOM (Byte Order Mark) ++ Most of it can be found in a reasonably short pdf document: http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf There is even more to this than I first believed... Based on the new knowledge I become more and more convinced that the choice of UTF-8 encoding as the basic "correct thing to do" for general use in a programming language, is well founded. But when text _processing_ comes into play, other rules aplies. But: I still find it objectionable to call one byte in a UTF-8/Unicode based language a char! ;-) The naming will of course make it easier to do a straight port from C to D, but such a port will in most cases be of no use on the "International scene". Oh well, this can be argued for/against well both ways I guess... IMHO there should be no char type at all. Only byte. Or maybe to take more sizes into consideration: bin8, bin16, bin32, bin64... I think porting from C to D should involve renaming char's to bin8's Hmmm... It is sad when learning more makes you want to change less ;-) Anyway, there is more to be learned... Roald |
January 03, 2004 Re: Unicode discussion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Roald Ribe | > > > Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then > > > UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8. > > > > UTF-32 never takes less memory than UTF-8. Period. > > Any Unicode character takes no more than 4 byte in UTF-8: > > 1 byte - ASCII > > 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc... > > 3 byte - most of the scripts in use. > > 4 byte - rare/dead/special scripts > > This is wrong. Read up on UTF-8 encoding. RTFM. [The Unicode Standard, Version 4.0] The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. UTF-8 D36. UTF-8 encoding form: The Unicode encoding form which assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length, as specified in Table 3-5. [Table 3-5. UTF-8 Bit Distribution] Scalar Value 1st Byte 2nd Byte 3rd Byte 4th Byte 00000000 0xxxxxxx 0xxxxxxx 00000yyy yyxxxxxx 110yyyyy 10xxxxxx zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx 000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx [Appendix C : Relationship to ISO/IEC 10646] C.3 UCS Transformation Formats UTF-8 The term UTF-8 stands for UCS Transformation Format, 8-bit form. UTF-8 is an alternative coded representation form for all of the characters of ISO/IEC 10646. The ISO/IEC definition is identical in format to UTF-8 as described under definition D36 in Section 3.9, Unicode Encoding Forms. ... The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as an encoding form of Unicode characters. |
January 07, 2004 Re: Unicode discussion | ||||
---|---|---|---|---|
| ||||
Posted in reply to Elias Martenson | First: I'm new in D and my english are bad. I realy like the utf8, but the true it no is efficient all the time ( local character acces...) and in a litle number of C/C++ programs I ned to use interrnal utf32 intestead of utf8 but later, I introduced a hack and I indexed the utf8 char nunber/pos and used a standard utf8 vector, the memory need are lower than using utf32 in my most frequent cases and the memory efficiency are better than utf32 for my experience this work very well in latin and CJK languages ( I normale use this two encodings) but for cirilyc, arabian... the memory can be bigger than utf32 but if is used a eficient indexation system we can equal the memory needed to utf32, in perfomance the penalitation is than 8 times slower than utf32 implementation, compared it to the penalitation in standar utf8 are very fast. I recomend to add: stringi -> indexed string for utf8 and the posibility to mark an internal representaion off the utf like: string utf8-32 -> this mark an utf8 string, but it works internal as utf32 |
Copyright © 1999-2021 by the D Language Foundation