Why UTF-8/16 character encodings? (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Why UTF-8/16 character encodings? (page 3)

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Walter Bright

Joakim

Posted in reply to Walter Bright

On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote:
> I think you stand alone in your desire to return to code pages.
Nobody is talking about going back to code pages.  I'm talking about going to single-byte encodings, which do not imply the problems that you had with code pages way back when.

> I have years of experience with code pages and the unfixable misery they produce. This has disappeared with Unicode. I find your arguments unpersuasive when stacked against my experience. And yes, I have made a living writing high performance code that deals with characters, and you are quite off base with claims that UTF-8 has inevitable bad performance - though there is inefficient code in Phobos for it, to be sure.
How can a variable-width encoding possibly compete with a constant-width encoding?  You have not articulated a reason for this.  Do you believe there is a performance loss with variable-width, but that it is not significant and therefore worth it?  Or do you believe it can be implemented with no loss?  That is what I asked above, but you did not answer.

> My grandfather wrote a book that consists of mixed German, French, and Latin words, using special characters unique to those languages. Another failing of code pages is it fails miserably at any such mixed language text. Unicode handles it with aplomb.
I see no reason why single-byte encodings wouldn't do a better job at such mixed-language text.  You'd just have to have a larger, more complex header or keep all your strings in a single language, with a different format to compose them together for your book.  This would be so much easier than UTF-8 that I cannot see how anyone could argue for a variable-length encoding instead.

> I can't even write an email to Rainer Schütze in English under your scheme.
Why not?  You seem to think that my scheme doesn't implement multi-language text at all, whereas I pointed out, from the beginning, that it could be trivially done also.

> Code pages simply are no longer practical nor acceptable for a global community. D is never going to convert to a code page system, and even if it did, there's no way D will ever convince the world to abandon Unicode, and so D would be as useless as EBCDIC.
I'm afraid you and others here seem to mentally translate "single-byte encodings" to "code pages" in your head, then recoil in horror as you remember all your problems with broken implementations of code pages, even though those problems are not intrinsic to single-byte encodings.

I'm not asking you to consider this for D.  I just wanted to discuss why UTF-8 is used at all.  I had hoped for some technical evaluations of its merits, but I seem to simply be dredging up a bunch of repressed memories about code pages instead. ;)

The world may not "abandon Unicode," but it will abandon UTF-8, because it's a dumb idea.  Unfortunately, such dumb ideas- XML anyone?- often proliferate until someone comes up with something better to show how dumb they are.  Perhaps it won't be the D programming language that does that, but it would be easy to implement my idea in D, so maybe it will be a D-based library someday. :)

> I'm afraid your quest is quixotic.
I'd argue the opposite, considering most programmers still can't wrap their head around UTF-8.  If someone can just get a single-byte encoding implemented and in front of them, I suspect it will be UTF-8 that will be considered quixotic. :D

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Vladimir Panteleev

Joakim

Posted in reply to Vladimir Panteleev

On Saturday, 25 May 2013 at 08:58:57 UTC, Vladimir Panteleev wrote:
> Another thing I noticed: sometimes when you think you really need to operate on individual characters (and that your code will not be correct unless you do that), the assumption will be incorrect due to the existence of combining characters in Unicode. Two of the often-quoted use cases of working on individual code points is calculating the string width (assuming a fixed-width font), and slicing the string - both of these will break with combining characters if those are not accounted for. I believe the proper way to approach such tasks is to implement the respective Unicode algorithms for it, which I believe are non-trivial and for which the relative impact for the overhead of working with a variable-width encoding is acceptable.
Combining characters are examples of complexity baked into the various languages, so there's no way around that.  I'm arguing against layering more complexity on top, through UTF-8.

> Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8?
Let's take one you listed above, slicing a string.  You have to either translate your entire string into UTF-32 so it's constant-width, which is apparently what Phobos does, or decode every single UTF-8 character along the way, every single time.  A constant-width, single-byte encoding would be much easier to slice, while still using at most half the space.

> Also, I don't think this has been posted in this thread. Not sure if it answers your points, though:
>
> http://www.utf8everywhere.org/
That seems to be a call to using UTF-8 on Windows, with a lot of info on how best to do so, with little justification for why you'd want to do so in the first place.  For example,

"Q: But what about performance of text processing algorithms, byte alignment, etc?

A: Is it really better with UTF-16? Maybe so."

Not exactly a considered analysis of the two. ;)

> And here's a simple and correct UTF-8 decoder:
>
> http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
You cannot honestly look at those multiple state diagrams and tell me it's "simple."  That said, the difficulty of _using_ UTF-8 is a much bigger than problem than implementing a decoder in a library.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by w0rp
in reply to Joakim

w0rp

Posted in reply to Joakim

This is dumb. You are dumb. Go away.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Vladimir Panteleev
in reply to Joakim

Vladimir Panteleev

Posted in reply to Joakim

On Saturday, 25 May 2013 at 09:40:36 UTC, Joakim wrote:
>> Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8?
> Let's take one you listed above, slicing a string.  You have to either translate your entire string into UTF-32 so it's constant-width, which is apparently what Phobos does, or decode every single UTF-8 character along the way, every single time.  A constant-width, single-byte encoding would be much easier to slice, while still using at most half the space.

You don't need to do that to slice a string. I think you mean to say that you need to decode each character if you want to slice the string at the N-th code point? But this is exactly what I'm trying to point out: how would you find this N? How would you know if it makes sense, taking into account combining characters, and all the other complexities of Unicode?

If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless.

> You cannot honestly look at those multiple state diagrams and tell me it's "simple."

I meant that it's simple to implement (and adapt/port to other languages). I would say that UTF-8 is quite cleverly designed, so I wouldn't say it's simple by itself.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Vladimir Panteleev

Joakim

Posted in reply to Vladimir Panteleev

On Saturday, 25 May 2013 at 10:33:12 UTC, Vladimir Panteleev wrote:
> You don't need to do that to slice a string. I think you mean to say that you need to decode each character if you want to slice the string at the N-th code point? But this is exactly what I'm trying to point out: how would you find this N? How would you know if it makes sense, taking into account combining characters, and all the other complexities of Unicode?
Slicing a string implies finding the N-th code point, what other way would you slice and have it make any sense?  Finding the N-th point is much simpler with a constant-width encoding.

I'm leaving aside combining characters and those intrinsic language complexities baked into unicode in my previous analysis, but if you want to bring those in, that's actually an argument in favor of my encoding.  With my encoding, you know up front if you're using languages that have such complexity- just check the header- whereas with a chunk of random UTF-8 text, you cannot ever know that unless you decode the entire string once and extract knowledge of all the languages that are embedded.

For another similar example, let's say you want to run toUpper on a multi-language string, which contains English in the first half and some Asian script that doesn't define uppercase in the second half.  With my format, toUpper can check the header, then process the English half and skip the Asian half (I'm assuming that the substring indices for each language would be stored in this more complex header).  With UTF-8, you have to process the entire string, because you never know what random languages might be packed in there.

UTF-8 is riddled with such performance bottlenecks, all to make if self-synchronizing.  But is anybody really using its less compact encoding to do some "self-synchronized" integrity checking?  I suspect almost nobody is.

> If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.

>> You cannot honestly look at those multiple state diagrams and tell me it's "simple."
>
> I meant that it's simple to implement (and adapt/port to other languages). I would say that UTF-8 is quite cleverly designed, so I wouldn't say it's simple by itself.
Perhaps, maybe decoding is not so bad for the type of people who write the fundamental UTF-8 libraries.  But implementation does not merely refer to the UTF-8 libraries, but also all the code that tries to build on it for internationalized apps.  And with all the unnecessary additional complexity added by UTF-8, wrapping the average programmer's head around this mess likely leads to as many problems as broken code pages implementations did back in the day. ;)

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Timon Gehr
in reply to H. S. Teoh

Timon Gehr

Posted in reply to H. S. Teoh

On 05/25/2013 05:56 AM, H. S. Teoh wrote:
> On Fri, May 24, 2013 at 08:45:56PM -0700, Walter Bright wrote:
>> On 5/24/2013 7:16 PM, Manu wrote:
>>> So when we define operators for u × v and a · b, or maybe n²? ;)
>>
>> Oh, how I want to do that. But I still think the world hasn't
>> completely caught up with Unicode yet.
>
> That would be most awesome!
>
> Though it does raise the issue of how parsing would work, 'cos you
> either have to assign a fixed precedence to each of these operators (and
> there are a LOT of them in Unicode!),

I think this is what eg. fortress is doing.

> or allow user-defined operators
> with custom precedence and associativity,

This is what eg. Haskell, Coq are doing.
(Though Coq has the advantage of not allowing forward references, and hence inline parser customization is straighforward in Coq.)

> which means nightmare for the
> parser (it has to adapt itself to new operators as the code is
> parsed/analysed,

It would be easier on the parsing side, since the parser would not fully parse expressions. Semantic analysis would resolve precedences. This is quite simple, and the current way the parser resolves operator precedences is less efficient anyways.

> which then leads to issues with what happens if two
> different modules define the same operator with conflicting precedence /
> associativity).
>

This would probably be an error without explicit disambiguation, or follow the usual disambiguation rules. (trying all possibilities appears to be exponential in the number of conflicting operators in an expression in the worst case though.)

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Vladimir Panteleev
in reply to Joakim

Vladimir Panteleev

Posted in reply to Joakim

On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
>> If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless.
> Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.

No. Are you sure you understand UTF-8 properly?

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Andrei Alexandrescu
in reply to Joakim

Andrei Alexandrescu

Posted in reply to Joakim

On 5/25/13 3:33 AM, Joakim wrote:
> On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
>> This is more a problem with the algorithms taking the easy way than a
>> problem with UTF-8. You can do all the string algorithms, including
>> regex, by working with the UTF-8 directly rather than converting to
>> UTF-32. Then the algorithms work at full speed.
> I call BS on this. There's no way working on a variable-width encoding
> can be as "full speed" as a constant-width encoding. Perhaps you mean
> that the slowdown is minimal, but I doubt that also.

You mentioned this a couple of times, and I wonder what makes you so sure. On contemporary architectures small is fast and large is slow; betting on replacing larger data with more computation is quite often a win.


Andrei

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Vladimir Panteleev

Joakim

Posted in reply to Vladimir Panteleev

On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote:
> On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
>>> If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless.
>> Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.
>
> No. Are you sure you understand UTF-8 properly?
Are you sure _you_ understand it properly?  Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above.  Obviously the constant-width encoding will be faster.  Did I really need to explain this?

On Saturday, 25 May 2013 at 12:43:21 UTC, Andrei Alexandrescu wrote:
> On 5/25/13 3:33 AM, Joakim wrote:
>> On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
>>> This is more a problem with the algorithms taking the easy way than a
>>> problem with UTF-8. You can do all the string algorithms, including
>>> regex, by working with the UTF-8 directly rather than converting to
>>> UTF-32. Then the algorithms work at full speed.
>> I call BS on this. There's no way working on a variable-width encoding
>> can be as "full speed" as a constant-width encoding. Perhaps you mean
>> that the slowdown is minimal, but I doubt that also.
>
> You mentioned this a couple of times, and I wonder what makes you so sure. On contemporary architectures small is fast and large is slow; betting on replacing larger data with more computation is quite often a win.
When has small ever been slow and large fast? ;) I'm talking about replacing larger data _and_ more computation, ie UTF-8, with smaller data and less computation, ie single-byte encodings, so it is an unmitigated win in that regard. :)

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Peter Alexander
in reply to Joakim

Peter Alexander

Posted in reply to Joakim

On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleev wrote:
>> On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
>>>> If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless.
>>> Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.
>>
>> No. Are you sure you understand UTF-8 properly?
> Are you sure _you_ understand it properly?  Both encodings have to check every single character to test for whitespace, but the single-byte encoding simply has to load each byte in the string and compare it against the whitespace-signifying bytes, while the variable-length code has to first load and parse potentially 4 bytes before it can compare, because it has to go through the state machine that you linked to above.  Obviously the constant-width encoding will be faster.  Did I really need to explain this?

I suggest you read up on UTF-8. You really don't understand it. There is no need to decode, you just treat the UTF-8 string as if it is an ASCII string.

This code will count all spaces in a string whether it is encoded as ASCII or UTF-8:

int countSpaces(const(char)* c)
{
    int n = 0;
    while (*c)
        if (*c == ' ')
            ++n;
    return n;
}

I repeat: there is no need to decode. Please read up on UTF-8. You do not understand it. The reason you don't need to decode is because UTF-8 is self-synchronising.

The code above tests for spaces only, but it works the same when searching for any substring or single character. It is no slower than fixed-width encoding for these operations.

Again, I urge you, please read up on UTF-8. It is very well designed.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation