Unicode in D (page 6) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » Unicode in D (page 6)

January 24, 2003

Re: Unicode in D

Posted by Mark Evans
in reply to Theodore Reed

Mark Evans

Posted in reply to Theodore Reed

>In UTF-8 a glyph can be 1-4 bytes.

Only if you live within the same dynamic range as UTF-16.  To get the full effective UTF-8 dynamic range of 32 bits, UTF-8 employs up to six bytes.  With 4 bytes it has the same range as UTF-16.

"The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set."  http://www.unicode.org/reports/tr27

>The issue isn't really the space.
>It's the difficulty in dealing with an encoding where you don't know how
>long the next glyph will be without reading it.

Exactly.  UTF-16 can have at most one extra code (in roughly 1% of cases).  So you have either one 16-bit word, or two.  UTF-8 is the absolute worst encoding in this regard.  UTF-32 is the best (constant size).

The main selling point for D is that UTF-16 is the standard for Windows. Windows is built on it.  Knowing Microsoft, they probably use a "slightly modified Microsoft version" of UTF-16...that would not surprise me at all.

Mark

January 29, 2003

Re: Unicode in D

Posted by Serge K
in reply to Mark Evans

Serge K

Posted in reply to Mark Evans

"Mark Evans" <Mark_member@pathlink.com> wrote in message news:b0qp5g$n73$1@digitaldaemon.com...
> >In UTF-8 a glyph can be 1-4 bytes.
>
> Only if you live within the same dynamic range as UTF-16.  To get the full effective UTF-8 dynamic range of 32 bits, UTF-8 employs up to six bytes.
With 4
> bytes it has the same range as UTF-16.

Actually, UTF-8, UTF-16 and UTF-32 - all have the same range : [0..10FFFFh]

UTF-8 encoding method can be extended up to six bytes max. to encode UCS-4 character set, but it is way beyond Unicode.

> "The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows
for the
> use of five- and six-byte sequences to encode characters that are outside
the
> range of the Unicode character set."  http://www.unicode.org/reports/tr27

Please, do not post truncated citations.

"The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as a transformation of Unicode characters."


> >The issue isn't really the space.
> >It's the difficulty in dealing with an encoding where you don't know how
> >long the next glyph will be without reading it.
>
> Exactly.  UTF-16 can have at most one extra code (in roughly 1% of cases).
So
> you have either one 16-bit word, or two.  UTF-8 is the absolute worst
encoding
> in this regard.  UTF-32 is the best (constant size).

For the real world applications UTF-16 strings have to use those surrogates
only to access CJK Ideographs extensions (~43000 characters).
In most of the cases UTF-16 string can be treated as an array of the UCS-2
characters.
String object can include its length in 16bit units and in characters :
if these numbers are equal - it's an UCS-2 string, no surrogates inside.

> The main selling point for D is that UTF-16 is the standard for Windows.
Windows
> is built on it.  Knowing Microsoft, they probably use a "slightly modified Microsoft version" of UTF-16...that would not surprise me at all.

Surprise...
It's a regular UTF-16. >8-P
(Starting with Win2K+sp.)
WinNT 3.x & 4 support UCS-2 only - since it was Unicode 2.0 encoding.

Any efficient prog. language must use UTF-16 for Windows implementation - otherwise it have to convert strings for any API function requiring string parameters...

February 04, 2003

Re: Unicode in D

Posted by Walter
in reply to Serge K

Walter

Posted in reply to Serge K

"Serge K" <skarebo@programmer.net> wrote in message news:b17cd6$2n1l$1@digitaldaemon.com...
> Any efficient prog. language must use UTF-16 for Windows implementation - otherwise it have to convert strings for any API function requiring string parameters...

Not necessarilly. While Win32 is now fully UTF-16 internally, and apparently converts the strings in "A" api functions to UTF-16, because UTF-16 uses double the memory it can still be far more efficient for an app to do all its computation with UTF-8, and then convert when calling the windows api.

February 04, 2003

Re: Unicode in D

Posted by Theodore Reed
in reply to Walter

Theodore Reed

Posted in reply to Walter

On Mon, 3 Feb 2003 15:37:37 -0800
"Walter" <walter@digitalmars.com> wrote:

> 
> "Serge K" <skarebo@programmer.net> wrote in message news:b17cd6$2n1l$1@digitaldaemon.com...
> > Any efficient prog. language must use UTF-16 for Windows implementation - otherwise it have to convert strings for any API function requiring string parameters...
> 
> Not necessarilly. While Win32 is now fully UTF-16 internally, and apparently converts the strings in "A" api functions to UTF-16, because UTF-16 uses double the memory it can still be far more efficient for an app to do all its computation with UTF-8, and then convert when calling the windows api.
> 
> 

Plus, UTF-8 is pretty standard for Unicode on Linux. I believe BeOS used it, too, although I could be wrong. I don't know what OSX uses, nor other unices.

My point is that choosing a standard by what the underlying platform uses is a bad idea.

-- 
Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~

"[...] for plainly, although every work of art is an expression, not every expression is a work of art." -- DeWitt H. Parker, "The Principles of Aesthetics"

February 05, 2003

Arrays as aggregate elements vs arrays as bytes (was: Unicode in D)

Posted by Shannon Mann
in reply to Ben Hinkle

Shannon Mann

Posted in reply to Ben Hinkle

I've read through what I could find on the thread about char[] and I find myself disagreeing with the idea that char[n] should return the n'th byte, regardless of the width of a character.

My reasons are simple.  When I have an array of, say, ints, I don't expect that int[n] will give me the n'th byte of the array of numbers. I fully expect that the n'th integer will be what I get.

I see no reason why this should not hold for arrays of characters.

I do expect that there are times when it would be useful to access an array of TYPE (where TYPE is int, char, etc) at the byte level, but it strikes me that some interface between an array of TYPE elements and that array as an array of BYTE's (i.e. using the byte type) would be VERY USEFUL, and would address concerns in wanting to access characters in their raw byte form.  Indexing of the equivalent of a byte pointer to a TYPE array, perhaps formulated in syntactic sugar, would achieve this.  I would personally prefer a language-specific way to byte access an aggregate rather than use pointers to achieve what the language should provide anyway.

Please note that the above statements stand REGARDLESS of the encoding chosen, be it UTF-8 or 16 or whatever.

February 05, 2003

Re: Arrays as aggregate elements vs arrays as bytes (was: Unicode in D)

Posted by Sean L. Palmer
in reply to Shannon Mann

Sean L. Palmer

Posted in reply to Shannon Mann

The solution here is to use a char *iterator* instead of using char *indexing*.  char indexing will be very slow.  char iteration will be very fast.

D needs a good iterator concept.  It has a good array concept already, but arrays are not the solution to everything.  For instance, serial input or output can't easily be indexed.  You don't do:  serial_port[47] = character; you do:  serial_port.write(character).  Those are like iterators (ok well at least in STL, input iterators and output iterators were part of the iterator family).

Sean

"Shannon Mann" <Shannon_member@pathlink.com> wrote in message news:b1rb8q$5i7$1@digitaldaemon.com...
> I've read through what I could find on the thread about char[] and I find myself disagreeing with the idea that char[n] should return the n'th byte, regardless of the width of a character.
>
> My reasons are simple.  When I have an array of, say, ints, I don't expect that int[n] will give me the n'th byte of the array of numbers. I fully expect that the n'th integer will be what I get.
>
> I see no reason why this should not hold for arrays of characters.
>
> I do expect that there are times when it would be useful to access an array of TYPE (where TYPE is int, char, etc) at the byte level, but it strikes me that some interface between an array of TYPE elements and that array as an array of BYTE's (i.e. using the byte type) would be VERY USEFUL, and would address concerns in wanting to access characters in their raw byte form.  Indexing of the equivalent of a byte pointer to a TYPE array, perhaps formulated in syntactic sugar, would achieve this.  I would personally prefer a language-specific way to byte access an aggregate rather than use pointers to achieve what the language should provide anyway.
>
> Please note that the above statements stand REGARDLESS of the encoding chosen, be it UTF-8 or 16 or whatever.
>
>

February 14, 2003

Re: Unicode in D

Posted by Mark Evans
in reply to Theodore Reed

Mark Evans

Posted in reply to Theodore Reed

>My point is that choosing a standard by what the underlying platform uses is a bad idea.

I agree with this remark, but think there are plenty of platform-independent reasons for UTF-16.  The fact that Windows uses it just cements the case.

Mark

February 14, 2003

Re: Unicode in D

Posted by Mark Evans
in reply to Walter

Mark Evans

Posted in reply to Walter

Walter says...
>
>> Any efficient prog. language must use UTF-16 for Windows implementation - otherwise it have to convert strings for any API function requiring string parameters...
>
>Not necessarilly. While Win32 is now fully UTF-16 internally, and apparently converts the strings in "A" api functions to UTF-16, because UTF-16 uses double the memory it can still be far more efficient for an app to do all its computation with UTF-8, and then convert when calling the windows api.
>

Memory is cheap and getting cheaper, but procesor time never loses value.

The supposition that UTF-8 needs less space is flawed anyway.  For some languages, yes -- but not all.  My earlier citations indicate that long-term, averaging over all languages, UTF-8 and UTF-16 will require equivalent memory storage.

UTF-8 code is also harder to write because UTF-8 is just more complicated than UTF-16.  The only reason for its popularity is that it's a fig leaf for people who really want to use ASCII.  They can use ASCII and call it UTF-8.  Not very forward-thinking.

Microsoft had good reasons for selecting UTF-16 and D should follow suit.  Other languages are struggling with Unicode support, and it would be nice to have one language out up front in this area.

Mark

February 17, 2003

Re: Unicode in D

Posted by Serge K
in reply to Mark Evans

Serge K

Posted in reply to Mark Evans

> The supposition that UTF-8 needs less space is flawed anyway.  For some languages, yes -- but not all.  My earlier citations indicate that
long-term,
> averaging over all languages, UTF-8 and UTF-16 will require equivalent
memory
> storage.
>
> UTF-8 code is also harder to write because UTF-8 is just more complicated
than
> UTF-16.  The only reason for its popularity is that it's a fig leaf for
people
> who really want to use ASCII.  They can use ASCII and call it UTF-8.  Not
very
> forward-thinking.
>
> Microsoft had good reasons for selecting UTF-16 and D should follow suit.
Other
> languages are struggling with Unicode support, and it would be nice to
have one
> language out up front in this area.
>
> Mark


http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/index
.html?dwzone=unicode
["Forms of Unicode", Mark Davis, IBM developer and President of the Unicode
Consortium, IBM]

  "Storage vs. performance
Both UTF-8 and UTF-16 are substantially more compact than UTF-32, when
averaging over the world's text in computers. UTF-8 is currently more
compact than UTF-16 on average, although it is not particularly suited for
East-Asian text because it occupies about 3 bytes of storage per code point.
UTF-8 will probably end up as about the same as UTF-16 over time, and may
end up being less compact on average as computers continue to make inroads
into East and South Asia. Both UTF-8 and UTF-16 offer substantial advantages
over UTF-32 in terms of storage requirements."

{ btw, about storage :
I've converted 300KB text file (russian book) into UTF-8 - it took about
~1.85 bytes per character.
Little compression comparing to UTF-16 comes mostly from "spaces" and
punctuation marks,
but it hardly worth processing complexity. }

  "Code-point boundaries, iteration, and indexing are very fast with UTF-32.
Code-point boundaries, accessing code points at a given offset, and
iteration involve a few extra machine instructions for UTF-16; UTF-8 is a
bit more cumbersome."

{ Occurrence of the UTF-16 surrogates in the real texts is estimated as <1% for CJK languages. Other scripts encoded in "higher planes" cover very rare or dead languages and some special symbols (like modern & old music symbols). So, if String object can identify absence of the surrogates - faster functions can to be used in most of the cases. The same optimization works for UTF-8, but only in the US-nivers (even British pound takes 2 bytes.. 8-) }

  "Ultimately, the choice of which encoding format to use will depend
heavily on the programming environment. For systems that only offer 8-bit
strings currently, but are multi-byte enabled, UTF-8 may be the best choice.
For systems that do not care about storage requirements, UTF-32 may be best.
For systems such as Windows, Java, or ICU that use UTF-16 strings already,
UTF-16 is the obvious choice. Even if they have not yet upgraded to fully
support surrogates, they will be before long.

   If the programming environment is not an issue, UTF-16 is recommended as
a good compromise between elegance, performance, and storage."

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation