Thread overview | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
May 24, 2013 Why UTF-8/16 character encodings? | ||||
---|---|---|---|---|
| ||||
On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote: > toUpper/lower cannot be made in place if it should handle all Unicode. Some characters will change their length when convert to/from uppercase. Examples of these are the German double S and some Turkish I. This triggered a long-standing bugbear of mine: why are we using these variable-length encodings at all? Does anybody really care about UTF-8 being "self-synchronizing," ie does anybody actually use that in this day and age? Sure, it's backwards-compatible with ASCII and the vast majority of usage is probably just ASCII, but that means the other languages don't matter anyway. Not to mention taking the valuable 8-bit real estate for English and dumping the longer encodings on everyone else. I'd just use a single-byte header to signify the language and then put the vast majority of languages in a single byte encoding, with the few exceptional languages with more than 256 characters encoded in two bytes. OK, that doesn't cover multi-language strings, but that is what, .000001% of usage? Make your header a little longer and you could handle those also. Yes, it wouldn't be strictly backwards-compatible with ASCII, but it would be so much easier to internationalize. Of course, there's also the monoculture we're creating; love this UTF-8 rant by tuomov, author of one the first tiling window managers for linux: http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06 The emperor has no clothes, what am I missing? |
May 24, 2013 Re: Why UTF-8/16 character encodings? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Joakim | On Friday, 24 May 2013 at 17:05:57 UTC, Joakim wrote:
> This triggered a long-standing bugbear of mine: why are we using these variable-length encodings at all?
Simple: backwards compatibility with all ASCII APIs (e.g. most C libraries), and because I don't want my strings to consume multiple bytes per character when I don't need it.
Your language header idea is no good for at least three reasons:
1. What happens if I want to take a substring slice of your string? I'll need to allocate a new string to add the header in.
2. What if I have a long string with the ASCII header and want to append a non-ASCII character on the end? I'll need to reallocate the whole string and widen it with the new header.
3. Even if I have a string that is 99% ASCII then I have to pay extra bytes for every character just because 1% wasn't ASCII. With UTF-8, I only pay the extra bytes when needed.
|
May 24, 2013 Re: Why UTF-8/16 character encodings? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Joakim | On Friday, 24 May 2013 at 17:05:57 UTC, Joakim wrote:
> On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
>> toUpper/lower cannot be made in place if it should handle all Unicode. Some characters will change their length when convert to/from uppercase. Examples of these are the German double S and some Turkish I.
>
> This triggered a long-standing bugbear of mine: why are we using these variable-length encodings at all? Does anybody really care about UTF-8 being "self-synchronizing," ie does anybody actually use that in this day and age? Sure, it's backwards-compatible with ASCII and the vast majority of usage is probably just ASCII, but that means the other languages don't matter anyway. Not to mention taking the valuable 8-bit real estate for English and dumping the longer encodings on everyone else.
The German ß becomes SS when capitalised. It's no encoding issue.
|
May 24, 2013 Re: Why UTF-8/16 character encodings? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Peter Alexander | On Friday, 24 May 2013 at 17:43:03 UTC, Peter Alexander wrote: > Simple: backwards compatibility with all ASCII APIs (e.g. most C libraries), and because I don't want my strings to consume multiple bytes per character when I don't need it. And yet here we are today, where an early decision made solely to accommodate the authors of then-dominant all-ASCII APIs has now foisted an unnecessarily complex encoding on all of us, with reduced performance as the result. You do realize that my encoding would encode almost all languages' characters in single bytes, unlike UTF-8, right? Your latter argument is one against UTF-8. > Your language header idea is no good for at least three reasons: > > 1. What happens if I want to take a substring slice of your string? I'll need to allocate a new string to add the header in. Good point. The solution that comes to mind right now is that you'd parse my format and store it in memory as a String class, storing the chars in an internal array with the header stripped out and the language stored in a property. That way, even a slice could be made to refer to the same language, by referring to the language of the containing array. Strictly speaking, this solution could also be implemented with UTF-8, simply by changing the format for the data structure you use in memory to the one I've outlined, as opposed to using the the UTF-8 encoding for both transmission and processing. But if you're going to use my format for processing, you might as well use it for transmission also, since it is much smaller for non-ASCII text. Before you ridicule my solution as somehow unworkable, let me remind you of the current monstrosity. Currently, the language is stored in every single UTF-8 character, by having the length vary from one to four bytes depending on the language. This leads to Phobos converting every UTF-8 string to UTF-32, so that it can easily run its algorithms on a constant-width 32-bit character set, and the resulting performance penalties. Perhaps the biggest loss is that programmers everywhere are pushed to wrap their heads around this mess, predictably leading to either ignorance or broken code. Which seems more unworkable to you? > 2. What if I have a long string with the ASCII header and want to append a non-ASCII character on the end? I'll need to reallocate the whole string and widen it with the new header. How often does this happen in practice? I suspect that this almost never happens. But if it does, it would be solved by the String class I outlined above, as the header isn't stored in the array anymore. > 3. Even if I have a string that is 99% ASCII then I have to pay extra bytes for every character just because 1% wasn't ASCII. With UTF-8, I only pay the extra bytes when needed. I don't understand what you mean here. If your string has a thousand non-ASCII characters, the UTF-8 version will have one or two thousand more characters, ie 1 or 2 KB more. My format would add a couple bytes in the header for each non-ASCII language character used, that's it. It's a clear win for my format. In any case, I just came up with the simplest format I could off the top of my head, maybe there are gaping holes in it. But my point is that we should be able to come up with such a much simpler format, which keeps most characters to a single byte, not that my format is best. All I want to argue is that UTF-8 is the worst. ;) |
May 24, 2013 Re: Why UTF-8/16 character encodings? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Joakim | On Friday, 24 May 2013 at 20:37:58 UTC, Joakim wrote:
>> 3. Even if I have a string that is 99% ASCII then I have to pay extra bytes for every character just because 1% wasn't ASCII. With UTF-8, I only pay the extra bytes when needed.
> I don't understand what you mean here. If your string has a thousand non-ASCII characters, the UTF-8 version will have one or two thousand more characters, ie 1 or 2 KB more. My format would add a couple bytes in the header for each non-ASCII language character used, that's it. It's a clear win for my format.
Sorry, I was a bit imprecise. Here's what I meant to write:
I don't understand what you mean here. If your string has a
thousand non-ASCII characters, the UTF-8 version will have one
or two thousand more bytes, ie 1 or 2 KB more. My format
would add a couple bytes in the header for each non-ASCII
language used, that's it. It's a clear win for my format.
|
May 24, 2013 Re: Why UTF-8/16 character encodings? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Joakim | 24-May-2013 21:05, Joakim пишет: > On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote: >> toUpper/lower cannot be made in place if it should handle all Unicode. >> Some characters will change their length when convert to/from >> uppercase. Examples of these are the German double S and some Turkish I. > > This triggered a long-standing bugbear of mine: why are we using these > variable-length encodings at all? Does anybody really care about UTF-8 > being "self-synchronizing," ie does anybody actually use that in this > day and age? Sure, it's backwards-compatible with ASCII and the vast > majority of usage is probably just ASCII, but that means the other > languages don't matter anyway. Not to mention taking the valuable 8-bit > real estate for English and dumping the longer encodings on everyone else. > > I'd just use a single-byte header to signify the language and then put > the vast majority of languages in a single byte encoding, with the few > exceptional languages with more than 256 characters encoded in two > bytes. You seem to think that not only UTF-8 is bad encoding but also one unified encoding (code-space) is bad(?). Separate code spaces were the case before Unicode (and utf-8). The problem is not only that without header text is meaningless (no easy slicing) but the fact that encoding of data after header strongly depends a variety of factors - a list of encodings actually. Now everybody has to keep a (code) page per language to at least know if it's 2 bytes per char or 1 byte per char or whatever. And you still work on a basis that there is no combining marks and regional specific stuff :) In fact it was even "better" nobody ever talked about header they just assumed a codepage with some global setting. Imagine yourself creating a font rendering system these days - a hell of an exercise in frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ then ...). > OK, that doesn't cover multi-language strings, but that is what, > .000001% of usage? This just shows you don't care for multilingual stuff at all. Imagine any language tutor/translator/dictionary on the Web. For instance most languages need to intersperse ASCII (also keep in mind e.g. HTML markup). Books often feature citations in native language (or e.g. latin) along with translations. Now also take into account math symbols, currency symbols and beyond. Also these days cultures are mixing in wild combinations so you might need to see the text even if you can't read it. Unicode is not only "encode characters from all languages". It needs to address universal representation of symbolics used in writing systems at large. > Make your header a little longer and you could handle > those also. Yes, it wouldn't be strictly backwards-compatible with > ASCII, but it would be so much easier to internationalize. Of course, > there's also the monoculture we're creating; love this UTF-8 rant by > tuomov, author of one the first tiling window managers for linux: > We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity). Want small - use compression schemes which are perfectly fine and get to the precious 1byte per codepoint with exceptional speed. http://www.unicode.org/reports/tr6/ > http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06 > > The emperor has no clothes, what am I missing? And borrowing the arguments from from that rant: locale is borked shit when it comes to encodings. Locales should be used for tweaking visual like numbers, date display an so on. -- Dmitry Olshansky |
May 24, 2013 Re: Why UTF-8/16 character encodings? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On Sat, May 25, 2013 at 01:21:25AM +0400, Dmitry Olshansky wrote: > 24-May-2013 21:05, Joakim пишет: [...] > >This triggered a long-standing bugbear of mine: why are we using these variable-length encodings at all? Does anybody really care about UTF-8 being "self-synchronizing," ie does anybody actually use that in this day and age? Sure, it's backwards-compatible with ASCII and the vast majority of usage is probably just ASCII, but that means the other languages don't matter anyway. Not to mention taking the valuable 8-bit real estate for English and dumping the longer encodings on everyone else. > > > >I'd just use a single-byte header to signify the language and then put the vast majority of languages in a single byte encoding, with the few exceptional languages with more than 256 characters encoded in two bytes. > > You seem to think that not only UTF-8 is bad encoding but also one > unified encoding (code-space) is bad(?). > > Separate code spaces were the case before Unicode (and utf-8). The problem is not only that without header text is meaningless (no easy slicing) but the fact that encoding of data after header strongly depends a variety of factors - a list of encodings actually. Now everybody has to keep a (code) page per language to at least know if it's 2 bytes per char or 1 byte per char or whatever. And you still work on a basis that there is no combining marks and regional specific stuff :) I remember those bad ole days of gratuitously-incompatible encodings. I wish those days will never ever return again. You'd get a text file in some unknown encoding, and the only way to make any sense of it was to guess what encoding it might be and hope you get lucky. Not only so, the same language often has multiple encodings, so adding support for a single new language required supporting several new encodings and being able to tell them apart (often with no info on which they are, if you're lucky, or if you're unlucky, with *wrong* encoding type specs -- for example, I *still* get email from outdated systems that claim to be iso-8859 when it's actually KOI8R). Prepending the encoding to the data doesn't help, because it's pretty much guaranteed somebody will cut-n-paste some segment of that data and save it without the encoding type header (or worse, some program will try to "fix" broken low-level code by prepending a default encoding type to everything, regardless of whether it's actually in that encoding or not), thus ensuring nobody will be able to reliably recognize what encoding it is down the road. > In fact it was even "better" nobody ever talked about header they just assumed a codepage with some global setting. Imagine yourself creating a font rendering system these days - a hell of an exercise in frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ then ...). Not to mention, if the sysadmin changes the default locale settings, you may suddenly discover that a bunch of your text files have become gibberish, because some programs blindly assume that every text file is in the current locale-specified language. I tried writing language-agnostic text-processing programs in C/C++ before the widespread adoption of Unicode. It was a living nightmare. The Posix spec *seems* to promise language-independence with its locale functions, but actually, the whole thing is one big inconsistent and under-specified mess that has many unspecified, implementation-specific behaviours that you can't rely on. The APIs basically assume that you set your locale's language once, and never change it, and every single file you'll ever want to read must be encoded in that particular encoding. If you try to read another encoding, too bad, you're screwed. There isn't even a standard for locale names that you could use to manually switch to inside your program (yes there are de facto conventions, but there *are* systems out there that don't follow it). And many standard library functions are affected by locale settings (once you call setlocale, *anything* could change, like string comparison, output encoding, etc.), making it a hairy mess to get input/output of multiple encodings to work correctly. Basically, you have to write everything manually, because the standard library can't handle more than a single encoding correctly (well, not without extreme amounts of pain, that is). So you're back to manipulating bytes directly. Which means you have to keep large tables of every single encoding you ever wish to support. And encoding-specific code to deal with exceptions with those evil variant encodings that are supposedly the same as the official standard of that encoding, but actually have one or two subtle differences that cause your program to output embarrassing garbage characters every now and then. For all of its warts, Unicode fixed a WHOLE bunch of these problems, and made cross-linguistic data sane to handle without pulling out your hair, many times over. And now we're trying to go back to that nightmarish old world again? No way, José! [...] > >Make your header a little longer and you could handle those also. Yes, it wouldn't be strictly backwards-compatible with ASCII, but it would be so much easier to internationalize. Of course, there's also the monoculture we're creating; love this UTF-8 rant by tuomov, author of one the first tiling window managers for linux: > > > We want monoculture! That is to understand each without all these "par-le-vu-france?" and codepages of various complexity(insanity). Yeah, those codepages were an utter nightmare to deal with. Everybody and his neighbour's dog invented their own codepage, sometimes multiple codepages for a single language, all of which are gratuitously incompatible with each other. Every codepage has its own peculiarities and exceptions, and programs have to know how to deal with all of them. Only to get broken again as soon as somebody invents yet another codepage two years later, or creates yet another codepage variant just for the heck of it. If you're really concerned about encoding size, just use a compression library -- they're readily available these days. Internally, the program can just use UTF-16 for the most part -- UTF-32 is really only necessary if you're routinely delving outside BMP, which is very rare. As far as Phobos is concerned, Dmitry's new std.uni module has powerful code-generation templates that let you write code that operate directly on UTF-8 without needing to convert to UTF-32 first. Well, OK, maybe we're not quite there yet, but the foundations are in place, and I'm looking forward to the day when string functions will no longer have implicit conversion to UTF-32, but will directly manipulate UTF-8 using optimized state tables generated by std.uni. > Want small - use compression schemes which are perfectly fine and get to the precious 1byte per codepoint with exceptional speed. http://www.unicode.org/reports/tr6/ +1. Using your own encoding is perfectly fine. Just don't do that for data interchange. Unicode was created because we *want* a single standard to communicate with each other without stupid broken encoding issues that used to be rampant on the web before Unicode came along. In the bad ole days, HTML could be served in any random number of encodings, often out-of-sync with what the server claims the encoding is, and browsers would assume arbitrary default encodings that for the most part *appeared* to work but are actually fundamentally b0rken. Sometimes webpages would show up mostly-intact, but with a few characters mangled, because of deviations / variations on codepage interpretation, or non-standard characters being used in a particular encoding. It was a total, utter mess, that wasted who knows how many man-hours of programming time to work around. For data interchange on the internet, we NEED a universal standard that everyone can agree on. > >http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06 > > > >The emperor has no clothes, what am I missing? > > And borrowing the arguments from from that rant: locale is borked shit when it comes to encodings. Locales should be used for tweaking visual like numbers, date display an so on. [...] I found that rant rather incoherent. I didn't find any convincing arguments as to why we should return to the bad old scheme of codepages and gratuitous complexity, just a lot of grievances about why monoculture is "bad" without much supporting evidence. UTF-8, for all its flaws, is remarkably resilient to mangling -- you can cut-n-paste any byte sequence and the receiving end can still make some sense of it. Not like the bad old days of codepages where you just get one gigantic block of gibberish. A properly-synchronizing UTF-8 function can still recover legible data, maybe with only a few characters at the ends truncated in the worst case. I don't see how any codepage-based encoding is an improvement over this. T -- There are 10 kinds of people in the world: those who can count in binary, and those who can't. |
May 25, 2013 Re: Why UTF-8/16 character encodings? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Joakim | On 5/24/2013 1:37 PM, Joakim wrote:
> This leads to Phobos converting every UTF-8 string to UTF-32, so that
> it can easily run its algorithms on a constant-width 32-bit character set, and
> the resulting performance penalties.
This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed.
> Yes, it wouldn't be strictly backwards-compatible with ASCII, but it would be so much easier to internationalize.
That was the go-to solution in the 1980's, they were called "code pages". A disaster.
> with the few exceptional languages with more than 256 characters encoded in two bytes.
Like those rare languages Japanese, Korean, Chinese, etc. This too was done in the 80's with "Shift-JIS" for Japanese, and some other wacky scheme for Korean, and a third nutburger one for Chinese.
I've had the misfortune of supporting all that in the old Zortech C++ compiler. It's AWFUL. If you think it's simpler, all I can say is you've never tried to write internationalized code with it.
UTF-8 is heavenly in comparison. Your code is automatically internationalized. It's awesome.
|
May 25, 2013 Re: Why UTF-8/16 character encodings? | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On 5/24/2013 3:42 PM, H. S. Teoh wrote:
> I tried writing language-agnostic text-processing programs in C/C++
> before the widespread adoption of Unicode.
One of the first, and best, decisions I made for D was it would be Unicode front to back.
At the time, Unicode was poorly supported by operating systems and lots of software, and I encountered some initial resistance to it. But I believed Unicode was the inevitable future.
Code pages, Shift-JIS, EBCDIC, etc., should all be terminated with prejudice.
|
May 25, 2013 Re: Why UTF-8/16 character encodings? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright Attachments:
| On 25 May 2013 11:58, Walter Bright <newshound2@digitalmars.com> wrote: > On 5/24/2013 3:42 PM, H. S. Teoh wrote: > >> I tried writing language-agnostic text-processing programs in C/C++ before the widespread adoption of Unicode. >> > > One of the first, and best, decisions I made for D was it would be Unicode front to back. > Indeed, excellent decision! So when we define operators for u × v and a · b, or maybe n²? ;) At the time, Unicode was poorly supported by operating systems and lots of > software, and I encountered some initial resistance to it. But I believed Unicode was the inevitable future. > > Code pages, Shift-JIS, EBCDIC, etc., should all be terminated with prejudice. > |
Copyright © 1999-2021 by the D Language Foundation