Why UTF-8/16 character encodings? (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Why UTF-8/16 character encodings? (page 2)

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Walter Bright
in reply to Manu

Walter Bright

Posted in reply to Manu

On 5/24/2013 7:16 PM, Manu wrote:
> So when we define operators for u × v and a · b, or maybe n²? ;)

Oh, how I want to do that. But I still think the world hasn't completely caught up with Unicode yet.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by H. S. Teoh
in reply to Walter Bright

H. S. Teoh

Posted in reply to Walter Bright

On Fri, May 24, 2013 at 08:45:56PM -0700, Walter Bright wrote:
> On 5/24/2013 7:16 PM, Manu wrote:
> >So when we define operators for u × v and a · b, or maybe n²? ;)
> 
> Oh, how I want to do that. But I still think the world hasn't completely caught up with Unicode yet.

That would be most awesome!

Though it does raise the issue of how parsing would work, 'cos you either have to assign a fixed precedence to each of these operators (and there are a LOT of them in Unicode!), or allow user-defined operators with custom precedence and associativity, which means nightmare for the parser (it has to adapt itself to new operators as the code is parsed/analysed, which then leads to issues with what happens if two different modules define the same operator with conflicting precedence / associativity).

T

-- 
Spaghetti code may be tangly, but lasagna code is just cheesy.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Dmitry Olshansky
in reply to H. S. Teoh

Dmitry Olshansky

Posted in reply to H. S. Teoh

25-May-2013 02:42, H. S. Teoh пишет:
> On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote:
>> 24-May-2013 21:05, Joakim пишет:
> [...]

> As far as Phobos is concerned, Dmitry's new std.uni module has powerful
> code-generation templates that let you write code that operate directly
> on UTF-8 without needing to convert to UTF-32 first.

As is there are no UTF-8 specific tables (yet), but there are tools to create the required abstraction by hand. I plan to grow one for std.regex that will thus be field-tested and then get into public interface. In fact the needs of std.regex prompted me to provide more Unicode stuff in the std.

> Well, OK, maybe
> we're not quite there yet, but the foundations are in place, and I'm
> looking forward to the day when string functions will no longer have
> implicit conversion to UTF-32, but will directly manipulate UTF-8 using
> optimized state tables generated by std.uni.

Yup, but let's get the correctness part first, then performance ;)

>
>> Want small - use compression schemes which are perfectly fine and
>> get to the precious 1byte per codepoint with exceptional speed.
>> http://www.unicode.org/reports/tr6/
>
> +1.  Using your own encoding is perfectly fine. Just don't do that for
> data interchange. Unicode was created because we *want* a single
> standard to communicate with each other without stupid broken encoding
> issues that used to be rampant on the web before Unicode came along.
>

BTW the document linked discusses _standard_ compression so that anybody can decode that stuff. How you compress would largely affect the compression ratio but not much beyond it..

> In the bad ole days, HTML could be served in any random number of
> encodings, often out-of-sync with what the server claims the encoding
> is, and browsers would assume arbitrary default encodings that for the
> most part *appeared* to work but are actually fundamentally b0rken.
> Sometimes webpages would show up mostly-intact, but with a few
> characters mangled, because of deviations / variations on codepage
> interpretation, or non-standard characters being used in a particular
> encoding. It was a total, utter mess, that wasted who knows how many
> man-hours of programming time to work around. For data interchange on
> the internet, we NEED a universal standard that everyone can agree on.

+1 on these and others :)

-- 
Dmitry Olshansky

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Dmitry Olshansky

Joakim

Posted in reply to Dmitry Olshansky

On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote:
> You seem to think that not only UTF-8 is bad encoding but also one unified encoding (code-space) is bad(?).
Yes, on the encoding, if it's a variable-length encoding like UTF-8, no, on the code space.  I was originally going to title my post, "Why Unicode?" but I have no real problem with UCS, which merely standardized a bunch of pre-existing code pages.  Perhaps there are a lot of problems with UCS also, I just haven't delved into it enough to know.  My problem is with these dumb variable-length encodings, so I was precise in the title.

> Separate code spaces were the case before Unicode (and utf-8). The problem is not only that without header text is meaningless (no easy slicing) but the fact that encoding of data after header strongly depends a variety of factors -  a list of encodings actually. Now everybody has to keep a (code) page per language to at least know if it's 2 bytes per char or 1 byte per char or whatever. And you still work on a basis that there is no combining marks and regional specific stuff :)
Everybody is still keeping code pages, UTF-8 hasn't changed that.  Does UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per char or whatever?"  It has to do that also.  Everyone keeps talking about "easy slicing" as though UTF-8 provides it, but it doesn't.  Phobos turns UTF-8 into UTF-32 internally for all that ease of use, at least doubling your string size in the process.  Correct me if I'm wrong, that was what I read on the newsgroup sometime back.

> In fact it was even "better" nobody ever talked about header they just assumed a codepage with some global setting. Imagine yourself creating a font rendering system these days - a hell of an exercise in frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ then ...).
I understand that people were frustrated with all the code pages out there before UCS standardized them, but that is a completely different argument than my problem with UTF-8 and variable-length encodings.  My proposed simple, header-based, constant-width encoding could be implemented with UCS and there go all your arguments about random code pages.

> This just shows you don't care for multilingual stuff at all. Imagine any language tutor/translator/dictionary on the Web. For instance most languages need to intersperse ASCII (also keep in mind e.g. HTML markup). Books often feature citations in native language (or e.g. latin) along with translations.
This is a small segment of use and it would be handled fine by an alternate encoding.

> Now also take into account math symbols, currency symbols and beyond. Also these days cultures are mixing in wild combinations so you might need to see the text even if you can't read it. Unicode is not only "encode characters from all languages". It needs to address universal representation of symbolics used in writing systems at large.
I take your point that it isn't just languages, but symbols also.  I see no reason why UTF-8 is a better encoding for that purpose than the kind of simple encoding I've suggested.

> We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity).
I hate monoculture, but then I haven't had to decipher some screwed-up codepage in the middle of the night. ;) That said, you could standardize on UCS for your code space without using a bad encoding like UTF-8, as I said above.

> Want small - use compression schemes which are perfectly fine and get to the precious 1byte per codepoint with exceptional speed.
> http://www.unicode.org/reports/tr6/
Correct me if I'm wrong, but it seems like that compression scheme simply adds a header and then uses a single-byte encoding, exactly what I suggested! :) But I get the impression that it's only for sending over the wire, ie transmision, so all the processing issues that UTF-8 introduces would still be there.

> And borrowing the arguments from from that rant: locale is borked shit when it comes to encodings. Locales should be used for tweaking visual like numbers, date display an so on.
Is that worse than every API simply assuming UTF-8, as he says?  Broken locale support in the past, as you and others complain about, doesn't invalidate the concept.  If they're screwing up something so simple, imagine how much worse everyone is screwing up something complex like UTF-8?

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to H. S. Teoh

Joakim

Posted in reply to H. S. Teoh

On Friday, 24 May 2013 at 22:44:24 UTC, H. S. Teoh wrote:
> I remember those bad ole days of gratuitously-incompatible encodings. I
> wish those days will never ever return again. You'd get a text file in
> some unknown encoding, and the only way to make any sense of it was to
> guess what encoding it might be and hope you get lucky. Not only so, the
> same language often has multiple encodings, so adding support for a
> single new language required supporting several new encodings and being
> able to tell them apart (often with no info on which they are, if you're
> lucky, or if you're unlucky, with *wrong* encoding type specs -- for
> example, I *still* get email from outdated systems that claim to be
> iso-8859 when it's actually KOI8R).
This is an argument for UCS, not UTF-8.

> Prepending the encoding to the data doesn't help, because it's pretty
> much guaranteed somebody will cut-n-paste some segment of that data and
> save it without the encoding type header (or worse, some program will
> try to "fix" broken low-level code by prepending a default encoding type
> to everything, regardless of whether it's actually in that encoding or
> not), thus ensuring nobody will be able to reliably recognize what
> encoding it is down the road.
This problem already exists for UTF-8, breaking ASCII compatibility in the process:

http://en.wikipedia.org/wiki/Byte_order_mark

Well, at the very least adding garbage ASCII data in the front, just as my header would do. ;)

> For all of its warts, Unicode fixed a WHOLE bunch of these problems, and
> made cross-linguistic data sane to handle without pulling out your hair,
> many times over.  And now we're trying to go back to that nightmarish
> old world again? No way, José!
No, I'm suggesting going back to one element of that "old world," single-byte encodings, but using UCS or some other standardized character set to avoid all those incompatible code pages you had to deal with.

> If you're really concerned about encoding size, just use a compression
> library -- they're readily available these days. Internally, the program
> can just use UTF-16 for the most part -- UTF-32 is really only necessary
> if you're routinely delving outside BMP, which is very rare.
True, but you're still doubling your string size with UTF-16 and non-ASCII text.  My concerns are the following, in order of importance:

1. Lost programmer productivity due to these dumb variable-length encodings.  That is the biggest loss from UTF-8's complexity.

2. Lost speed and memory due to using either an unnecessarily complex variable-length encoding or because you translated everything to 32-bit UTF-32 to get back to constant-width.

3. Lost bandwidth from using a fatter encoding.

> As far as Phobos is concerned, Dmitry's new std.uni module has powerful
> code-generation templates that let you write code that operate directly
> on UTF-8 without needing to convert to UTF-32 first. Well, OK, maybe
> we're not quite there yet, but the foundations are in place, and I'm
> looking forward to the day when string functions will no longer have
> implicit conversion to UTF-32, but will directly manipulate UTF-8 using
> optimized state tables generated by std.uni.
There is no way this can ever be as performant as a constant-width single-byte encoding.

> +1.  Using your own encoding is perfectly fine. Just don't do that for
> data interchange. Unicode was created because we *want* a single
> standard to communicate with each other without stupid broken encoding
> issues that used to be rampant on the web before Unicode came along.
>
> In the bad ole days, HTML could be served in any random number of
> encodings, often out-of-sync with what the server claims the encoding
> is, and browsers would assume arbitrary default encodings that for the
> most part *appeared* to work but are actually fundamentally b0rken.
> Sometimes webpages would show up mostly-intact, but with a few
> characters mangled, because of deviations / variations on codepage
> interpretation, or non-standard characters being used in a particular
> encoding. It was a total, utter mess, that wasted who knows how many
> man-hours of programming time to work around. For data interchange on
> the internet, we NEED a universal standard that everyone can agree on.
I disagree.  This is not an indictment of multiple encodings, it is one of multiple unspecified or _broken_ encodings.  Given how difficult UTF-8 is to get right, all you've likely done is replace multiple broken encodings with a single encoding with multiple broken implementations.

> UTF-8, for all its flaws, is remarkably resilient to mangling -- you can
> cut-n-paste any byte sequence and the receiving end can still make some
> sense of it.  Not like the bad old days of codepages where you just get
> one gigantic block of gibberish. A properly-synchronizing UTF-8 function
> can still recover legible data, maybe with only a few characters at the
> ends truncated in the worst case. I don't see how any codepage-based
> encoding is an improvement over this.
Have you ever used this self-synchronizing future of UTF-8?  Have you ever heard of anyone using it?  There is no reason why this kind of limited checking of data integrity should be rolled into the encoding.  Maybe this made sense two decades ago when everyone had plans to stream text or something, but nobody does that nowadays.  Just put a checksum in your header and you're good to go.

Unicode is still a "codepage-based encoding," nothing has changed in that regard.  All UCS did is standardize a bunch of pre-existing code pages, so that some of the redundancy was taken out.  Unfortunately, the UTF-8 encoding then bloated the transmission format and tempted devs to use this unnecessarily complex format for processing too.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Walter Bright

Joakim

Posted in reply to Walter Bright

On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
> One of the first, and best, decisions I made for D was it would be Unicode front to back.
That is why I asked this question here.  I think D is still one of the few programming languages with such unicode support.

> This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed.
I call BS on this.  There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding.  Perhaps you mean that the slowdown is minimal, but I doubt that also.

> That was the go-to solution in the 1980's, they were called "code pages". A disaster.
My understanding is that code pages were a "disaster" because they weren't standardized and often badly implemented.  If you used UCS with a single-byte encoding, you wouldn't have that problem.

> > with the few exceptional languages with more than 256
> characters encoded in two bytes.
>
> Like those rare languages Japanese, Korean, Chinese, etc. This too was done in the 80's with "Shift-JIS" for Japanese, and some other wacky scheme for Korean, and a third nutburger one for Chinese.
Of course, you have to have more than one byte for those languages, because they have more than 256 characters.  So there will be no compression gain over UTF-8/16 there, but a big gain in parsing complexity with a simpler encoding, particularly when dealing with multi-language strings.

> I've had the misfortune of supporting all that in the old Zortech C++ compiler. It's AWFUL. If you think it's simpler, all I can say is you've never tried to write internationalized code with it.
Heh, I'm not saying "let's go back to badly defined code pages" because I'm saying "let's go back to single-byte encodings."  The two are separate arguments.

> UTF-8 is heavenly in comparison. Your code is automatically internationalized. It's awesome.
At what cost?  Most programmers completely punt on unicode, because they just don't want to deal with the complexity.  Perhaps you can deal with it and don't mind the performance loss, but I suspect you're in the minority.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Diggory
in reply to Joakim

Diggory

Posted in reply to Joakim

I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!).

Unicode is:
1) A standardised numbering of a large number of characters
2) A set of standardised algorithms for operating on these characters
3) A set of standardised encodings for efficiently encoding sequences of these characters

You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem.

Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters).

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Diggory

Joakim

Posted in reply to Diggory

On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
> I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!).
Incorrect.

"Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode."
http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode

> Unicode is:
> 1) A standardised numbering of a large number of characters
> 2) A set of standardised algorithms for operating on these characters
> 3) A set of standardised encodings for efficiently encoding sequences of these characters
What makes you think I'm unaware of this?  I have repeatedly differentiated between UCS (1) and UTF-8 (3).

> You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem.
And what's a dchar?  Let's check:

dchar : unsigned 32 bit UTF-32
http://dlang.org/type.html

Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it.  Walter as much as said so up above.

> Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters).
The vast majority of non-english alphabets in UCS can be encoded in a single byte.  It is your exceptions that are not relevant.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Walter Bright
in reply to Joakim

Walter Bright

Posted in reply to Joakim

On 5/25/2013 12:33 AM, Joakim wrote:
> At what cost?  Most programmers completely punt on unicode, because they just
> don't want to deal with the complexity. Perhaps you can deal with it and don't
> mind the performance loss, but I suspect you're in the minority.

I think you stand alone in your desire to return to code pages. I have years of experience with code pages and the unfixable misery they produce. This has disappeared with Unicode. I find your arguments unpersuasive when stacked against my experience. And yes, I have made a living writing high performance code that deals with characters, and you are quite off base with claims that UTF-8 has inevitable bad performance - though there is inefficient code in Phobos for it, to be sure.

My grandfather wrote a book that consists of mixed German, French, and Latin words, using special characters unique to those languages. Another failing of code pages is it fails miserably at any such mixed language text. Unicode handles it with aplomb.

I can't even write an email to Rainer Schütze in English under your scheme.

Code pages simply are no longer practical nor acceptable for a global community. D is never going to convert to a code page system, and even if it did, there's no way D will ever convince the world to abandon Unicode, and so D would be as useless as EBCDIC.

I'm afraid your quest is quixotic.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Vladimir Panteleev
in reply to Joakim

Vladimir Panteleev

Posted in reply to Joakim

On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
>> This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed.
> I call BS on this.  There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding.  Perhaps you mean that the slowdown is minimal, but I doubt that also.

For the record, I noticed that programmers (myself included) that had an incomplete understanding of Unicode / UTF exaggerate this point, and sometimes needlessly assume that their code needs to operate on individual characters (code points), when it is in fact not so - and that code will work just fine as if it was written to handle ASCII. The example Walter quoted (regex - assuming you don't want Unicode ranges or case-insensitivity) is one such case.

Another thing I noticed: sometimes when you think you really need to operate on individual characters (and that your code will not be correct unless you do that), the assumption will be incorrect due to the existence of combining characters in Unicode. Two of the often-quoted use cases of working on individual code points is calculating the string width (assuming a fixed-width font), and slicing the string - both of these will break with combining characters if those are not accounted for. I believe the proper way to approach such tasks is to implement the respective Unicode algorithms for it, which I believe are non-trivial and for which the relative impact for the overhead of working with a variable-width encoding is acceptable.

Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8?

Also, I don't think this has been posted in this thread. Not sure if it answers your points, though:

http://www.utf8everywhere.org/

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation