Unicode discussion (page 8) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » Unicode discussion (page 8)

December 21, 2003

Re: Unicode discussion

Posted by Ant
in reply to Rupert Millard

Ant

Posted in reply to Rupert Millard

In article <bs4ea9$jo2$1@digitaldaemon.com>, Rupert Millard says...
>
>> > I would think that the datatype char would be a UTF-8 character, with no
>> indication of
>> > the amount of storage it used. The compiler would be free to represent
>it
>> internally
>> > however it chose. Indexing should work (perhaps inefficiently)
>>
>> That would be a higher level view of it, and I suggest a wrapper class around it can provide this.
>
>On Friday 19th, I posted a class that provides this functionality to this thread.

I sorry to interrup
(I'm one of the cluless here, in fact I call this the unicorn discussion)
but isn't Vathix's String class suppose to cover that?
http://www.digitalmars.com/drn-bin/wwwnews?D/19525

It's bigger so it must be better ;)

Ant

December 21, 2003

Re: Unicode discussion

Posted by Ilya Minkov
in reply to Walter

Ilya Minkov

Posted in reply to Walter

I think this discussion of "language being wrong" is wrong. It is obviuosly clear that the char[], char, and other associated types don't have a sensible higher-level symantics. The examples are many.

Obviously, i find it quite right from the language not to constrain the programmers to high-level types. It is a job for the library.

Now, everyone. Walter has quite enough to do of what he does better than all of us. Improving on a standard library is a job which he delegates to us.

A library class or struct String should be indexed by a real character scanning, and not by the adress, even if it means more overhead. And the result of this indexing, as well as any single character acess would be a dchar. The internal representation should be still acessible, for the case someone finds high-level semantics a bottleneck within his application.

Besides, myself and Mark have proposed a number of solutions a while ago, which would give strings non-standard storage, but would allow the high level representation to be significantly faster, at the cost of ease of operating on a lower-level representation.

-eye

December 21, 2003

Re: Unicode discussion

Posted by Rupert Millard
in reply to Ant

Rupert Millard

Posted in reply to Ant

Ant <Ant_member@pathlink.com> wrote in news:bs4gc8$n2c$1@digitaldaemon.com:

> In article <bs4ea9$jo2$1@digitaldaemon.com>, Rupert Millard says...
>>
>>> > I would think that the datatype char would be a UTF-8 character, with no
>>> indication of
>>> > the amount of storage it used. The compiler would be free to represent
>>it
>>> internally
>>> > however it chose. Indexing should work (perhaps inefficiently)
>>>
>>> That would be a higher level view of it, and I suggest a wrapper class around it can provide this.
>>
>>On Friday 19th, I posted a class that provides this functionality to this thread.
> 
> I sorry to interrup
> (I'm one of the cluless here, in fact I call this the unicorn
> discussion) but isn't Vathix's String class suppose to cover that?
> http://www.digitalmars.com/drn-bin/wwwnews?D/19525
> 
> It's bigger so it must be better ;)
> 
> Ant

You had me worried here because I missed that post! However, they do slightly different things, I think.

Mine indexes characters rather than bytes in UTF-8 strings. Vathix's does many other string handling things. (e.g. changing case) My code needs to be integrated into his, if it can be - I'm not sure what implications his use of templates has.

You're quite correct - as they currently are, his is vastly more useful - I can't think of many situations where you need to index whole characters rather than bytes. My main reason for writing it was that I enjoy writing code.

Rupert

December 22, 2003

Re: Unicode discussion

Posted by Walter
in reply to Roald Ribe

Walter

Posted in reply to Roald Ribe

"Roald Ribe" <rr.no@spam.teikom.no> wrote in message news:bs4ddt$ig4$1@digitaldaemon.com...
> > > Can't a single UTF-8 character require multiple bytes for
> representation?
> >
> > No.
>
> ???
> A unicode character can result in up to 6 bytes used, when encoded
> with UTF-8. Which is what the poster meant to ask, I think.

Sure, perhaps I misunderstood him.

December 31, 2003

Re: Unicode discussion

Posted by Serge K
in reply to Roald Ribe

Serge K

Posted in reply to Roald Ribe

> ???
> A unicode character can result in up to 6 bytes used, when encoded
> with UTF-8.

UTF-8 can represent all Unicode characters with no more then 4 bytes. ISO/IEC 10646 (UCS-4) may require up to 6 bytes in UTF-8, but it's the superset for Unicode.

December 31, 2003

Re: Unicode discussion

Posted by Serge K
in reply to Hauke Duden

Serge K

Posted in reply to Hauke Duden

> I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly.

Well, at least one can convert any Unicode string to UTF-8 without risk of losing information.

> Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8.

UTF-32 never takes less memory than UTF-8. Period.
Any Unicode character takes no more than 4 byte in UTF-8:
1 byte - ASCII
2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc...
3 byte - most of the scripts in use.
4 byte - rare/dead/special scripts

UTF-8 means multibyte encoding for most of the languages (except English and
maybe some others)
Most of the European and Asian languages need just one UTF-16 unit per
character.
For CJK languages occurrence of the UTF-16 surrogates in the real texts is
estimated as <1%.
Other scripts encoded in "higher planes" cover very rare or dead languages
and some special symbols.

In most of the cases UTF-16 string can be treated as simple array of UCS-2
characters.
You just need to know if it has surrogates // if (number_of_characters <
nomber_of_16bit_units)

December 31, 2003

Re: Unicode discussion

Posted by Roald Ribe
in reply to Serge K

Roald Ribe

Posted in reply to Serge K

"Serge K" <skarebo@programmer.net> wrote in message news:bst8q3$218i$1@digitaldaemon.com...
> > I don't see how the design of the UTF-8 encoding adds any advantage over other multibyte encodings that might cause people to use it properly.
>
> Well, at least one can convert any Unicode string to UTF-8 without risk of losing information.

This is a good point. But I stand my ground: it may result in up to 6 bytes used for ecah character (worst case).

> > Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters then UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8.
>
> UTF-32 never takes less memory than UTF-8. Period.
> Any Unicode character takes no more than 4 byte in UTF-8:
> 1 byte - ASCII
> 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc...
> 3 byte - most of the scripts in use.
> 4 byte - rare/dead/special scripts

This is wrong. Read up on UTF-8 encoding.

> UTF-8 means multibyte encoding for most of the languages (except English
and
> maybe some others)

Right.

> Most of the European and Asian languages need just one UTF-16 unit per character.

Yes most, but not all.

> For CJK languages occurrence of the UTF-16 surrogates in the real texts is estimated as <1%.

The code to handle it still has to be present...

> Other scripts encoded in "higher planes" cover very rare or dead languages and some special symbols.
>
> In most of the cases UTF-16 string can be treated as simple array of UCS-2 characters.

Yes, but "most cases" is not a good argument when the original discussion was initiated to handle ALL laguages, in a way that the developer would find to be "natural", easy and integrated in the D language.

> You just need to know if it has surrogates // if (number_of_characters <
> nomber_of_16bit_units)

There is no such thing as "just" with these issues (IMHO) ;-)

Roald

December 31, 2003

Re: Unicode discussion

Posted by Roald Ribe
in reply to Walter

Roald Ribe

Posted in reply to Walter

"Walter" <walter@digitalmars.com> wrote in message news:brll85$1oko$1@digitaldaemon.com...
>
> "Elias Martenson" <no@spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047@spam.spam...
> > Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.
>
> In a higher level language, yes. But in doing systems work, one always
seems
> to be looking at the lower level elements anyway. I wrestled with this for
a
> while, and eventually decided that char[], wchar[], and dchar[] would be
low
> level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.
>
>
> > The overloading issue is interesting, but may I suggest that char and
> whcar
> > are at least renamed to something more appropriate? Maybe utf8byte and utf16byte? I feel it's important to point out that they aren't
characters.
>
> I see your point, but I just can't see making utf8byte into a keyword <g>. The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see
much
> of an issue here.
>
>
> > And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either.
>
> I think the library functions should be improved to handle unicode chars. But I'm not much of an expert on how to do it right, so it is the way it
is
> for the moment.
>
> > I'd love to help out and do these things. But two things are needed
first:
> >     - At least one other person needs to volunteer.
> >       I've had bad experiences when one person does this by himself,
>
> You're not by yourself. There's a whole D community here!
>
> >     - The core concepts needs to be decided upon. Things seems to be
> >       somewhat in flux right now, with three different string types
> >       and all. At the very least it needs to be deicded what a "string"
> >       really is, is it a UTF-8 byte sequence or a UTF-32 character
> >       sequence? I haven't hid the fact that I would prefer the latter.
>
> A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8, UTF-16, or UTF-32 representations.
>
> > > That's correct as well. The library's support for unicode is
inadequate.
> But
> > > there also is a nice package (std.utf) which will convert between
> char[],
> > > wchar[], and dchar[]. This can be used to convert the text strings
into
> > > whatever unicode stream type the underlying operating system API
> supports.
> > > (For win32 this would be UTF-16, I am unsure what linux supports.)
> > Yes. But this would then assume that char[] is always in native encoding
> > and doesn't rhyme very well with the assertion that char[] is a UTF-8
> > byte sequence.
> > Or, the specification could be read as the stream actually performs
native
> > decoding to UTF-8 when reading into a char[] array.
>
> char[] strings are UTF-8, and as such I don't know what you mean by
'native
> decoding'. There is only one possible conversion of UTF-8 to UTF-16.
>
> > Unless fundamental encoding/decoding is embedded in the streams library,
> > it would be best to simply read text data into a byte array and then
> > perform native decoding manually afterwards using functions similar
> > to the C mbstowcs() and wcstombs(). The drawback to this is that you
> > cannot read text data in platform encoding without copying through
> > a separate buffer, even in cases when this is not needed.
>
> If you're talking about win32 code pages, I'm going to draw a line in the sand and assert that D char[] strings are NOT locale or code page
dependent.
> They are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion.
>
> > > D is headed that way. The current version of the library I'm working
on
> > > converts the char[] strings in the file name API's to UTF-16 via std.utf.toUTF16z(), for use calling the win32 API's.
> > This can be done in a much better, platform independent way, by using the native<->unicode conversion routines.
>
> The UTF-8 to UTF-16 conversion is defined and platform independent. The D runtime library includes routines to convert back and forth between them. They could probably be optimized better, but that's another issue. I feel that by designing D around UTF-8, UTF-16 and UTF-32 the problems with
locale
> dependent character sets are pushed off to the side as merely an input or output translation nuisance. The core routines all expect UTF strings, and so are platform and language independent. I personally think the future is UTF, and locale dependent encodings will fall by the wayside.
>
> > In C, as already mentioned,
> > these are called mbstowcs() and wcstombs(). For Windows, these would
> > convert to and from UTF-16. For Unix, these would convert to and from
> > whatever encoding the application is running under (dictated by the
> > LC_CTYPE environment variable). There really is no need to make the
> > API's platform dependent in any way here.
>
> After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language and runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/from
UTF.
> This should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? How about wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about
having
> #ifdef _UNICODE all over the place? I've done that too much already. No thanks!)
>
> UTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set.

Following this discussion, I have read some more on the subject.
In additon to the speed issues that was mentioned, I have had some
insights on the issues of endianess, serialization,
BOM (Byte Order Mark) ++
Most of it can be found in a reasonably short pdf document:
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

There is even more to this than I first believed...
Based on the new knowledge I become more and more convinced that
the choice of UTF-8 encoding as the basic "correct thing to do"
for general use in a programming language, is well founded. But
when text _processing_ comes into play, other rules aplies.

But: I still find it objectionable to call one byte in a
UTF-8/Unicode based language a char! ;-) The naming will of course
make it easier to do a straight port from C to D, but such a port
will in most cases be of no use on the "International scene".
Oh well, this can be argued for/against well both ways I guess...
IMHO there should be no char type at all. Only byte. Or maybe to
take more sizes into consideration: bin8, bin16, bin32, bin64...
I think porting from C to D should involve renaming char's to bin8's

Hmmm... It is sad when learning more makes you want to change less ;-) Anyway, there is more to be learned...

Roald

January 03, 2004

Re: Unicode discussion

Posted by Serge K
in reply to Roald Ribe

Serge K

Posted in reply to Roald Ribe

> > > Actually, depending on your language, UTF-32 can also be better than UTF-8. If you use a language that uses the upper Unicode characters
then
> > > UTF-8 will use 3-5 bytes per character. So you may end up using even more memory with UTF-8.
> >
> > UTF-32 never takes less memory than UTF-8. Period.
> > Any Unicode character takes no more than 4 byte in UTF-8:
> > 1 byte - ASCII
> > 2 byte - Latin extended, Cyrillic, Greek, Hebrew, Arabic, etc...
> > 3 byte - most of the scripts in use.
> > 4 byte - rare/dead/special scripts
>
> This is wrong. Read up on UTF-8 encoding.

RTFM.

[The Unicode Standard, Version 4.0]
The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and
UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and
U+E000..U+10FFFF to unique code unit sequences.

UTF-8
D36.  UTF-8 encoding form: The Unicode encoding form which assigns each
Unicode scalar
value to an unsigned byte sequence of one to four bytes in length, as
specified in Table 3-5.

[Table 3-5. UTF-8 Bit Distribution]
Scalar Value   1st Byte   2nd Byte   3rd Byte   4th Byte
00000000   0xxxxxxx   0xxxxxxx
00000yyy yyxxxxxx   110yyyyy   10xxxxxx
zzzzyyyy yyxxxxxx   1110zzzz   10yyyyyy   10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx   11110uuu   10uuzzzz   10yyyyyy   10xxxxxx

[Appendix C : Relationship to ISO/IEC 10646]
C.3 UCS Transformation Formats
UTF-8
The term UTF-8 stands for UCS Transformation Format, 8-bit form. UTF-8 is an
alternative
coded representation form for all of the characters of ISO/IEC 10646. The
ISO/IEC
definition is identical in format to UTF-8 as described under definition D36
in Section 3.9,
Unicode Encoding Forms.
...
The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for
the use of
five- and six-byte sequences to encode characters that are outside the range
of the Unicode
character set; those five- and six-byte sequences are illegal for the use of
UTF-8 as an
encoding form of Unicode characters.

January 07, 2004

Re: Unicode discussion

Posted by Elohe
in reply to Elias Martenson

Elohe

Posted in reply to Elias Martenson

First: I'm new in D and my english are bad.
I realy like the utf8, but the true it no is efficient all the time ( local
character acces...) and in a litle number of C/C++ programs I ned to use
interrnal utf32 intestead of utf8 but later, I introduced a hack and I
indexed the utf8 char nunber/pos and used a standard utf8 vector, the memory
need are lower than using utf32 in my most frequent cases and the memory
efficiency are better than utf32 for my experience this work very well in
latin and CJK languages ( I normale use this two encodings) but for cirilyc,
arabian... the memory can be bigger than utf32 but if is used a eficient
indexation system we can equal the memory needed to utf32, in perfomance the
penalitation is than 8 times slower than utf32 implementation, compared  it
to the penalitation in standar utf8 are very fast.

I recomend to add:

stringi -> indexed string for utf8

and the posibility to mark an internal representaion off the utf like:

string utf8-32 -> this mark an utf8 string, but it works internal as utf32

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation