January 20, 2003
Sean L. Palmer wrote:
> "Walter" <walter@digitalmars.com> wrote in message
> news:b0c66n$1mq6$1@digitaldaemon.com...
> 
>>I think the copy-on-write approach to strings is the right idea.
>>Unfortunately, if done by the language semantics, it can have severe
> adverse
>>performance results (think of a toupper() function, copying the string
> again
>>each time a character is converted). Using it instead as a coding style,
> 
> 
> Copy-on-write usually doesn't copy unless there's more than one live
> reference to the string.  If you're actively modifying it, it'll only make
> one copy until you distribute the new reference.  Of course that means
> reference counting.  Perhaps the GC could store info about string use.
> 

That's not gonna work, because there's no reliable way you can get this data from GC outside a mark phase.

The Delphi string implementation is Ref-Counted, and is said to be extremely slow. So it's better copy and forget the rest, than to count at every assignment. You'll just have one more reason to optimise the GC then. :)

IMO, the amount of copying should be limited by merging the operations together.


January 21, 2003
Walter asked,
>As for why UTF-16 instead of UTF-8, why do you find it preferable?

If one wants to do serious internationalized applications it is mandatory. China, Japan, India for example.  China and India by themselves encompass hundreds of languages and dialects that use non-Western glyphs.

My contacts at the SIL linguistics center in Dallas (heavy-duty Unicode and SGML folks) complain that in their language work, not even UTF-16 is good enough. They push for 32 bits!

I would not go that far, but UTF-16 is a very sensible, capable format for the majority of languages.

Mark


January 21, 2003
"Mark Evans" <Mark_member@pathlink.com> wrote in message news:b0itlo$2a46$1@digitaldaemon.com...
> If one wants to do serious internationalized applications it is mandatory. China, Japan, India for example.  China and India by themselves encompass hundreds of languages and dialects that use non-Western glyphs.

UTF-8 can handle that.

> My contacts at the SIL linguistics center in Dallas (heavy-duty Unicode
and SGML
> folks) complain that in their language work, not even UTF-16 is good
enough.
> They push for 32 bits!

UTF-16 has 2^20 characters in it. UTF-8 has 2^31 characters.

> I would not go that far, but UTF-16 is a very sensible, capable format for
the
> majority of languages.

The only advantage it has over UTF-8 is it is more compact for some languages. UTF-8 is more compact for the rest.


January 21, 2003
Well OK I should have been clearer.  You are right about sheer numerical quantity, but read the FAQ at Unicode.org (excerpted below).  Numerical quantity at the price of variable-width codes is a headache.  UTF-16 has variable width, but not as variable as UTF-8, and nowhere near as frequently.

UTF-16 is the Windows standard.  It's a sweet spot for Unicode, which was originally a pure 16-bit design.  The Unicode leaders advocate UTF-16 and I accept their wisdom.

The "real deal" with UTF-8 is that it's a retrofit to accommodate legacy ASCII that we all know and love.  So again I would argue that UTF-8 qualifies in a certain sense as "legacy support," and should therefore go in the runtime, not the core code.

I'd go even further and not use 'char' with any meaning other than UTF-16.  I never liked the Windows char/wchar goofiness.  A language should only have one type of char and the runtimes can support conversions of language-standard chars to other formats.  Trying to shimmy 'alternative characters' into C was a bad idea.  The wonderful thing about designing a new language is that you can do it right.  (Implementation details at http://www.unicode.org/reports/tr27/ )

Mark

http://www.unicode.org/faq/utf_bom.html
-----------------------------------------------
"Most Unicode APIs are using UTF-16."
-----------------------------------------------
"UTF-8 will be most common on the web. UTF16, UTF16LE, UTF16BE are used by Java
and Windows."
[BE and LE mean Big Endian and Little Endian.]
-----------------------------------------------
"Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts."
-----------------------------------------------
[UTF-8 can have anywhere from 1 to 4 code blocks so it's highly variable.
UTF-16 almost always has one code block, and in rare 1% cases, two; but no more.
This is important in the Asian context:]
"East Asians (Chinese, Japanese, and Koreans) are ... are well acquainted with
the problems that variable-width codes ... have caused....With UTF-16,
relatively few characters require 2 units. The vast majority of characters in
common use are single code units.  Even in East Asian text, the incidence of
surrogate pairs should be well less than 1% of all text storage on average."
-----------------------------------------------
"Furthermore, both Unicode and ISO 10646 have policies in place that formally limit even the UTF-32 encoding form to the integer range that can be expressed with UTF-16 (or 21 significant bits)."
-----------------------------------------------
"We don't anticipate a general switch to UTF-32 storage for a long time (if ever)....The chief selling point for Unicode was providing a representation for all the world's characters.... These features were enough to swing industry to the side of using Unicode (UTF-16)."
-----------------------------------------------



January 21, 2003
Quick follow-up.  Even the extra space in UTF-8 will probably not be used in the future, and UTF-8 vs. UTF-16 are going to be neck-and-neck in terms of storage/performance over time.  So I see no compelling reason for UTF-8 except its legacy ties to 7-bit ASCII.  I think of UTF-8 as "ASCII with Unicode paint."

Mark

http://www-106.ibm.com/developerworks/library/utfencodingforms/
"Storage vs. performance
Both UTF-8 and UTF-16 are substantially more compact than UTF-32, when averaging
over the world's text in computers. UTF-8 is currently more compact than UTF-16
on average, although it is not particularly suited for East-Asian text because
it occupies about 3 bytes of storage per code point. UTF-8 will probably end up
as about the same as UTF-16 over time, and may end up being less compact on
average as computers continue to make inroads into East and South Asia. Both
UTF-8 and UTF-16 offer substantial advantages over UTF-32 in terms of storage
requirements."

http://czyborra.com/utf/
"Actually, UTF-8 continues to represent up to 31 bits with up to 6 bytes, but it
is generally expected that the one million code points of the 20 bits offered by
UTF-16 and 4-byte UTF-8 will suffice to cover all characters and that we will
never get to see any Unicode character definitions beyond that."


January 22, 2003
Mark Evans wrote:
> Walter asked,
> 
>>As for why UTF-16 instead of UTF-8, why do you find it preferable?
> 
> 
> If one wants to do serious internationalized applications it is mandatory.
> China, Japan, India for example.  China and India by themselves encompass
> hundreds of languages and dialects that use non-Western glyphs.
> 
> My contacts at the SIL linguistics center in Dallas (heavy-duty Unicode and SGML
> folks) complain that in their language work, not even UTF-16 is good enough.
> They push for 32 bits!

Could someone explain me *what's the difference*? I thought there was one unicode set, which encodes *everything*. Then, there are different "wrappings" of it, like UTF8, 16 and so on. They do the same by assgning blocks, where multiple "characters" of 8, 16, or smth. bits compose a final character value. And a lot of optimisation can be done, because it is not likely that each next symbol will be from a different language, since natural language usually consists of words, sentences, and so on. In UFT8 there are sequences, consisting of header-data, where header encodes the language/code and the length of the text, so that some data is generalized and need not be tranferred with every symbol, and so that a character in a certain encoding can take as many target system characters is it needs.

As far as I understood, UTF7 is the shortest encoding for latin text, but it would be less optimal for some multi-hunderd-character sets than a generally wider encoding.

Please, someone correct me if i'm wrong. But if i'm right, Russian, arabic, and other "tiny" alphabets would only experience a minor "fat-ratio" with UTF8, since they requiere less not many more symbols than latin. That is, only headers and no further overhead.

Can anyone tell me: taken the same newspaper article in chinese, japanese, or some other "wide" language, encoded in UTF7, 8, 16, 32 and so on: how much space would it take? Which languages suffer more and which less from "small" UTF encodigs?

-i.

> 
> I would not go that far, but UTF-16 is a very sensible, capable format for the
> majority of languages.
> 
> Mark
> 
> 

January 22, 2003
Ilya Minkov wrote:
> Could someone explain me *what's the difference*? ...

I see myself approved.



> Can anyone tell me: taken the same newspaper article in chinese, japanese, or some other "wide" language, encoded in UTF7, 8, 16, 32 and so on: how much space would it take? Which languages suffer more and which less from "small" UTF encodigs?

This one remains.


-i.

January 22, 2003
Ilya Minkov says...
>Could someone explain me *what's the difference*?

Take the trouble to read through the links supplied in the previous posts before asking redundant questions like this.

Mark


January 22, 2003
On Wed, 22 Jan 2003 15:26:56 +0100
Ilya Minkov <midiclub@tiscali.de> wrote:

> sentences, and so on. In UFT8 there are sequences, consisting of header-data, where header encodes the language/code and the length of the text, so that some data is generalized and need not be tranferred with every symbol, and so that a character in a certain encoding can take as many target system characters is it needs.

That's not how UTF-8 works (although  I've thought a RLE scheme like the one you describe would be pretty good).  In UTF-8 a glyph can be 1-4 bytes. If the unicode value is below 0x80, it takes one byte. If it's between 0x80 and 0x7FF (inclusive), it takes two, etc

> As far as I understood, UTF7 is the shortest encoding for latin text, but it would be less optimal for some multi-hunderd-character sets than a generally wider encoding.

Quite less than optimal.

> Please, someone correct me if i'm wrong. But if i'm right, Russian, arabic, and other "tiny" alphabets would only experience a minor "fat-ratio" with UTF8, since they requiere less not many more symbols than latin. That is, only headers and no further overhead.

Most western alphabets would take 1-2 bytes per char. I think Arabic would take 3.

> Can anyone tell me: taken the same newspaper article in chinese, japanese, or some other "wide" language, encoded in UTF7, 8, 16, 32 and so on: how much space would it take? Which languages suffer more and which less from "small" UTF encodigs?

UTF-8 just flat takes less space over all. At most, it takes 4 bytes per glyph, plus for many, it takes less. The issue isn't  really the space. It's the difficulty in dealing with an encoding where you don't know how long the next glyph will be without reading it. (Which also means that in order to access the glyph in the middle, you have to start scanning from the front.)

-- 
Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~

"I hold it to be the inalienable right of anybody to go to hell in his own way." -- Robert Frost
January 22, 2003
Then considering UTF-16 might make sense...


I think there is a way to optimise UTF8 though: pre-scan the string and record character width changes in an array.