May 31, 2016
Am Tue, 31 May 2016 16:29:33 +0000
schrieb Joakim <dlang@joakim.fea.st>:

> Part of it is the complexity of written language, part of it is bad technical decisions.  Building the default string type in D around the horrible UTF-8 encoding was a fundamental mistake, both in terms of efficiency and complexity.  I noted this in one of my first threads in this forum, and as Andrei said at the time, nobody agreed with me, with a lot of hand-waving about how efficiency wasn't an issue or that UTF-8 arrays were fine. Fast-forward years later and exactly the issues I raised are now causing pain.

Maybe you can dig up your old post and we can look at each of your complaints in detail.

> UTF-8 is an antiquated hack that needs to be eradicated.  It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world.  It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched.

You don't download twice the data. First of all, some
languages had two-byte encodings before UTF-8, and second web
content is full of HTML syntax and gzip compressed afterwards.
Take this Thai Wikipedia entry for example:
https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
The download of the gzipped html is 11% larger in UTF-8 than
in Thai TIS-620 single-byte encoding. And that is dwarfed by
the size of JS + images. (I don't have the numbers, but I
expect the effective overhead to be ~2%).
Ironically a lot of symbols we take for granted would then
have to be implemented as HTML entities using their Unicode
code points(sic!). Amongst them basic stuff like dashes, degree
(°) and minute (′), accents in names, non-breaking space or
footnotes (↑).

> D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable.  I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte.  Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_

That would have put D on an island. "Some kind of header" would be a horrible mess to have in strings, because you have to account for it when concatenating strings and scan for them all the time to see if there is some interspersed 2 byte encoding in the stream. That's hardly better than UTF-8. And yes, a huge amount of websites mix scripts and a lot of other text uses the available extra symbols like ° or α,β,γ.

> The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language!  This is madness.

No thx, madness was when we couldn't reliably open text files, because nowhere was the encoding stored and when you had to compile programs for each of a dozen codepages, so localized text would be rendered correctly. And your retro codepage system wont convince the world to drop Unicode either.

> Yes, the complexity of diacritics and combining characters will remain, but that is complexity that is inherent to the variety of written language.  UTF-8 is not: it is just a bad technical decision, likely chosen for ASCII compatibility and some misguided notion that being able to combine arbitrary language strings with no other metadata was worthwhile.  It is not.

The web proves you wrong. Scripts do get mixed often. Be it Wikipedia, a foreign language learning site or mathematical symbols.

-- 
Marco

May 31, 2016
On Tuesday, 31 May 2016 at 18:34:54 UTC, Jonathan M Davis wrote:
> On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d wrote:
>> UTF-8 is an antiquated hack that needs to be eradicated.  It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world.  It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched.
>
> Considering that *nix land uses UTF-8 almost exclusively, and many C libraries do even on Windows, I very much doubt that UTF-8 is going anywhere anytime soon - if ever. The Win32 API does use UTF-16, and Java and C# do, but vast sea of code that is C or C++ generally uses UTF-8 as do plenty of other programming languages.

I agree that both UTF encodings are somewhat popular now.

> And even aside from English, most European languages are going to be more efficient with UTF-8, because they're still primarily ASCII even if they contain characters that are not. Stuff like Chinese is definitely worse in UTF-8 than it would be in UTF-16, but there are a lot of languages other than English which are going to encode better with UTF-8 than UTF-16 - let alone UTF-32.

And there are a lot more languages that will be twice as long than English, ie ASCII.

> Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too much uses it for it to be going anywhere, and most folks have no problem with that. Any attempt to get rid of it would be a huge, uphill battle.

I disagree, it is inevitable.  Any tech so complex and inefficient cannot last long.

> But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even without involving the standard library - so anyone who wants to avoid UTF-8 is free to do so.

Yes, but not by using UTF-16/32, which use too much memory.  I've suggested a single-byte encoding for most languages instead, both in my last post and the earlier thread.

D could use this new encoding internally, while keeping its current UTF-8/16 strings around for any outside UTF-8/16 data passed in.  Any of that data run through algorithms that don't require decoding could be kept in UTF-8, but the moment any decoding is required, D would translate UTF-8 to the new encoding, which would be much easier for programmers to understand and manipulate. If UTF-8 output is needed, you'd have to encode back again.

Yes, this translation layer would be a bit of a pain, but the new encoding would be so much more efficient and understandable that it would be worth it, and you're already decoding and encoding back to UTF-8 for those algorithms now.  All that's changing is that you're using a new and different encoding than dchar as the default.  If it succeeds for D, it could then be sold more widely as a replacement for UTF-8/16.

I think this would be the right path forward, not navigating this UTF-8/16 mess further.
May 31, 2016
On 05/31/2016 06:29 PM, Joakim wrote:
> D devs should lead the way in getting rid of the UTF-8 encoding, not
> bickering about how to make it more palatable.  I suggested a
> single-byte encoding for most languages, with double-byte for the ones
> which wouldn't fit in a byte.  Use some kind of header or other metadata
> to combine strings of different languages, _rather than encoding the
> language into every character!_

Guys, may I ask you to move this discussion to a new thread? I'd like to follow the (already crowded) autodecode thing, and this is really a separate topic.
May 31, 2016
On Tue, May 31, 2016 at 10:47:56PM +0300, Dmitry Olshansky via Digitalmars-d wrote:
> On 31-May-2016 01:00, Walter Bright wrote:
> > On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
> > > I don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units).
> > 
> > Yup. It isn't hard at all to use arrays of codeunits correctly.
> 
> Ehm as long as all you care for is operating on substrings I'd say. Working with individual character requires either decoding or clever tricks like operating on encoded UTF directly.
[...]

Working on individual characters needs byGrapheme, unless you know beforehand that the character(s) you're working with are ASCII, or fits in a single code unit.

About "clever tricks", it's not really that hard.  I was thinking that things like s.canFind('Ш') should translate the 'Ш' into a UTF-8 byte sequence, and then do a substring search directly on the encoded string. This way, a large number of single-character algorithms don't even need to decode.  The way UTF-8 is designed guarantees that there will not be any false positives.  This will eliminate a lot of the current overhead of autodecoding.


T

-- 
Klein bottle for rent ... inquire within. -- Stephen Mulraney
May 31, 2016
On 31.05.2016 21:51, Steven Schveighoffer wrote:
> On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:
>> On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via
>> Digitalmars-d wrote:
>> [...]
>>> Does walkLength yield the same number for all representations?
>>
>> Let's put the question this way. Given the following string, what do
>> *you* think walkLength should return?
>
> Compiler error.
>
> -Steve

What about e.g. joiner?
May 31, 2016
On Tue, May 31, 2016 at 10:38:03PM +0200, Timon Gehr via Digitalmars-d wrote:
> On 31.05.2016 21:51, Steven Schveighoffer wrote:
> > On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:
> > > On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via
> > > Digitalmars-d wrote:
> > > [...]
> > > > Does walkLength yield the same number for all representations?
> > > 
> > > Let's put the question this way. Given the following string, what do *you* think walkLength should return?
> > 
> > Compiler error.
> > 
> > -Steve
> 
> What about e.g. joiner?

joiner is one of those algorithms that can work perfectly fine *without* autodecoding anything at all. The only time it'd actually need to decode would be if you're joining a set of UTF-8 strings with a UTF-16 delimiter, or some other such combination, which should be pretty rare. After all, within the same application you'd usually only be dealing with a single encoding rather than mixing UTF-8, UTF-16, and UTF-32 willy-nilly.

(Unless the code is specifically written for transcoding, in which case decoding is part of the job description, so it should be expected that the programmer ought to know how to do it properly without needing Phobos to do it for him.)

Even in the case of s.joiner('Ш'), joiner could easily convert that dchar into a short UTF-8 string and then operate directly on UTF-8.


T

-- 
Just because you survived after you did it, doesn't mean it wasn't stupid!
May 31, 2016
On Tuesday, 31 May 2016 at 20:20:46 UTC, Marco Leise wrote:
> Am Tue, 31 May 2016 16:29:33 +0000
> schrieb Joakim <dlang@joakim.fea.st>:
>
>> Part of it is the complexity of written language, part of it is bad technical decisions.  Building the default string type in D around the horrible UTF-8 encoding was a fundamental mistake, both in terms of efficiency and complexity.  I noted this in one of my first threads in this forum, and as Andrei said at the time, nobody agreed with me, with a lot of hand-waving about how efficiency wasn't an issue or that UTF-8 arrays were fine. Fast-forward years later and exactly the issues I raised are now causing pain.
>
> Maybe you can dig up your old post and we can look at each of your complaints in detail.

Not interested.  I believe you were part of that thread then.  Google it if you want to read it again.

>> UTF-8 is an antiquated hack that needs to be eradicated.  It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world.  It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched.
>
> You don't download twice the data. First of all, some
> languages had two-byte encodings before UTF-8, and second web
> content is full of HTML syntax and gzip compressed afterwards.

The vast majority can be encoded in a single byte, and are unnecessarily forced to two bytes by the inefficient UTF-8/16 encodings.  HTML syntax is a non sequitur; compression helps but isn't as efficient as a proper encoding.

> Take this Thai Wikipedia entry for example:
> https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
> The download of the gzipped html is 11% larger in UTF-8 than
> in Thai TIS-620 single-byte encoding. And that is dwarfed by
> the size of JS + images. (I don't have the numbers, but I
> expect the effective overhead to be ~2%).

Nobody on a 2G connection is waiting minutes to download such massive web pages.  They are mostly sending text to each other on their favorite chat app, and waiting longer and using up more of their mobile data quota if they're forced to use bad encodings.

> Ironically a lot of symbols we take for granted would then
> have to be implemented as HTML entities using their Unicode
> code points(sic!). Amongst them basic stuff like dashes, degree
> (°) and minute (′), accents in names, non-breaking space or
> footnotes (↑).

No, they just don't use HTML, opting for much superior mobile apps instead. :)

>> D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable.  I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte.  Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_
>
> That would have put D on an island. "Some kind of header" would be a horrible mess to have in strings, because you have to account for it when concatenating strings and scan for them all the time to see if there is some interspersed 2 byte encoding in the stream. That's hardly better than UTF-8. And yes, a huge amount of websites mix scripts and a lot of other text uses the available extra symbols like ° or α,β,γ.

Let's see: a constant-time addition to a header or constantly decoding every character every time I want to manipulate the string... I wonder which is a better choice?!  You would not "intersperse" any other encodings, unless you kept track of those substrings in the header.  My whole point is that such mixing of languages or "extra symbols" is an extreme minority use case: the vast majority of strings are a single language.

>> The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language!  This is madness.
>
> No thx, madness was when we couldn't reliably open text files, because nowhere was the encoding stored and when you had to compile programs for each of a dozen codepages, so localized text would be rendered correctly. And your retro codepage system wont convince the world to drop Unicode either.

Unicode _is_ a retro codepage system, they merely standardized a bunch of the most popular codepages.  So that's not going away no matter what system you use. :)

>> Yes, the complexity of diacritics and combining characters will remain, but that is complexity that is inherent to the variety of written language.  UTF-8 is not: it is just a bad technical decision, likely chosen for ASCII compatibility and some misguided notion that being able to combine arbitrary language strings with no other metadata was worthwhile.  It is not.
>
> The web proves you wrong. Scripts do get mixed often. Be it Wikipedia, a foreign language learning site or mathematical symbols.

Those are some of the least-trafficked parts of the web, which itself is dying off as the developing world comes online through mobile apps, not the bloated web stack.

Anyway, I'm not interested in rehashing this dumb argument again. The UTF-8/16 encodings are a horrible mess, and D made a big mistake by baking them in.
May 31, 2016
On Tuesday, 31 May 2016 at 20:28:32 UTC, ag0aep6g wrote:
> On 05/31/2016 06:29 PM, Joakim wrote:
>> D devs should lead the way in getting rid of the UTF-8 encoding, not
>> bickering about how to make it more palatable.  I suggested a
>> single-byte encoding for most languages, with double-byte for the ones
>> which wouldn't fit in a byte.  Use some kind of header or other metadata
>> to combine strings of different languages, _rather than encoding the
>> language into every character!_
>
> Guys, may I ask you to move this discussion to a new thread? I'd like to follow the (already crowded) autodecode thing, and this is really a separate topic.

No, this is the root of the problem, but I'm not interested in debating it, so you can go back to discussing how to avoid the elephant in the room.
May 31, 2016
On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:
> Let's put the question this way. Given the following string, what do
> *you*  think walkLength should return?
>
> 	şŭt̥ḛ́k̠

The number of code units in the string. That's the contract promised and honored by Phobos. -- Andrei
May 31, 2016
On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
> On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:
>> Let's put the question this way. Given the following string, what do
>> *you*  think walkLength should return?
>>
>>     şŭt̥ḛ́k̠
>
> The number of code units in the string. That's the contract promised and
> honored by Phobos. -- Andrei

Code points I mean. -- Andrei