June 01, 2016
Am Tue, 31 May 2016 15:47:02 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so D bet on all three. If I had a do-over, I'd just support UTF-8. UTF-16 is useful pretty much only as a transitional encoding to talk with Windows APIs.

I think so too, although more APIs than just Windows use UTF-16. Think of Java or ICU. Aside from their Java heritage they found that it is the fastest encoding for transcoding from and to Unicode as UTF-16 codepoints cover most 8-bit codepages. Also Qt defined a char as UTF-16 code point, but they probably regret it as the 'charmap' program KCharSelect is now unable to show Unicode characters >= 0x10000.

-- 
Marco

May 31, 2016
On 5/31/2016 1:57 AM, Chris wrote:
> 1. Given you experience with Warp, how hard would it be to clean Phobos up?

It's not hard, it's just a bit tedious.

> 2. After recoding a number of Phobos functions, how much code did actually break
> (yours or someone else's)?.

It's been a while so I don't remember exactly, but as I recall if the API had to change, I created a new overload or a new name, and left the old one as it is. For the std.path functions, I just changed them. While that technically changed the API, I'm not aware of any actual problems it caused.

(Decoding file strings is a latent bug anyway, as pointed out elsewhere in this thread. It's a change that had to be made sooner or later.)

May 31, 2016
On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
> On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
>> On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:
>>> Let's put the question this way. Given the following string, what do
>>> *you*  think walkLength should return?
>>>
>>>     şŭt̥ḛ́k̠
>>
>> The number of code units in the string. That's the contract promised and
>> honored by Phobos. -- Andrei
>
> Code points I mean. -- Andrei

Yes, we know it's the contract. ***That's the problem.*** As everybody is saying, it *SHOULDN'T* be the contract.

Why shouldn't it be the contract? Because it's proven itself, both logically (as presented by pretty much everybody other than you in both this and other threads) and empirically (in phobos, warp, and other user code) to be both the least useful and most PITA option.

May 31, 2016
On 05/31/2016 01:23 PM, Andrei Alexandrescu wrote:
> On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
>> The standard library has to fight against itself because of autodecoding!
>> The vast majority of the algorithms in Phobos are special-cased on
>> strings
>> in an attempt to get around autodecoding. That alone should highlight the
>> fact that autodecoding is problematic.
>
> The way I see it is it's specialization to speed things up without
> giving up the higher level abstraction. -- Andrei

Problem is, that "higher"[1] level abstraction you don't want to give up (ie working on code points) is rarely useful, and yet the default is to pay the price for something which is rarely useful.

[1] It's really the mid-level abstraction - grapheme is the high-level one (and more likely useful).

May 31, 2016
On 5/31/16 4:38 PM, Timon Gehr wrote:
> On 31.05.2016 21:51, Steven Schveighoffer wrote:
>> On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:
>>> On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via
>>> Digitalmars-d wrote:
>>> [...]
>>>> Does walkLength yield the same number for all representations?
>>>
>>> Let's put the question this way. Given the following string, what do
>>> *you* think walkLength should return?
>>
>> Compiler error.
>
> What about e.g. joiner?

Compiler error. Better than what it does now.

-Steve
May 31, 2016
On Tuesday, May 31, 2016 20:38:14 Nick Sabalausky via Digitalmars-d wrote:
> On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
> > On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
> >> On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:
> >>> Let's put the question this way. Given the following string, what do *you*  think walkLength should return?
> >>>
> >>>     şŭt̥ḛ́k̠
> >>
> >> The number of code units in the string. That's the contract promised and honored by Phobos. -- Andrei
> >
> > Code points I mean. -- Andrei
>
> Yes, we know it's the contract. ***That's the problem.*** As everybody is saying, it *SHOULDN'T* be the contract.
>
> Why shouldn't it be the contract? Because it's proven itself, both logically (as presented by pretty much everybody other than you in both this and other threads) and empirically (in phobos, warp, and other user code) to be both the least useful and most PITA option.

Exactly. Operating at the code point level rarely makes sense. What sorts of algorithms purposefully do that in a typical program? Unless you're doing very specific Unicode stuff or somehow know that your strings don't contain any graphemes that are made up of multiple code points, operating at the code point level is just bug-prone, and unless you're using dchar[] everywhere, it's slow to boot, because you're strings have to be decoded whether the algorithm needs to or not.

I think that it's very safe to say that the vast majority of string algorithms are either able to operate at the code unit level without decoding (though possibly encoding another string to match - e.g. with a string comparison or search), or they have to operate at the grapheme level in order to deal with full characters. A code point is borderline useless on its own. It's just a step above the different UTF encodings without actually getting to proper characters.

- Jonathan M Davis


May 31, 2016
On Tuesday, May 31, 2016 23:36:20 Marco Leise via Digitalmars-d wrote:
> Am Tue, 31 May 2016 16:56:43 -0400
>
> schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:
> > On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:
> > > In the vast majority of cases what folks care about is full character
> >
> > How are you so sure? -- Andrei
>
> Because a full character is the typical unit of a written language. It's what we visualize in our heads when we think about finding a substring or counting characters. A special case of this is the reduction to ASCII where we can use code units in place of grapheme clusters.

Exactly. How many folks here have written code where the correct thing to do
is to search on code points? Under what circumstances is that even useful?
Code points are a mid-level abstraction between UTF-8/16 and graphemes that
are not particularly useful on their own. Yes, by using code points, we
eliminate the differences between the encodings, but how much code even
operates on multiple string types? Having all of your strings have the same
encoding fixes the consistency problem just as well as autodecoding to dchar
evereywhere does - and without the efficiency hit. Typically, folks operate
on string or char[] unless they're talking to the Windows API, in which
case, they need wchar[]. Our general recommendation is that D code operate
on UTF-8 except when it needs to operate on a different encoding because of
other stuff it has to interact with (like the Win32 API), in which case,
ideally it converts those strings to UTF-8 once they get into the D code and
operates on them as UTF-8, and anything that has to be output in a different
encoding is operated on as UTF-8 until it needs to be outputed, in which
case, it's converted to UTF-16 or whatever the target encoding is. Not
much of anyone is recommending that you use dchar[] everywhere, but that's
essentially what the range API is trying to force.

I think that it's very safe to say that the vast majority of string processing either is looking to operate on strings as a whole or on individual, full characters within a string. Code points are neither. While code may play tricks with Unicode to be efficient (e.g. operating at the code unit level where it can rather than decoding to either code points or graphemes), or it might make assumptions about its data being ASCII-only, aside from explicit Unicode processing code, I have _never_ seen code that was actually looking to logically operate on only pieces of characters. While it may operate on code units for efficiency, it's always looking to be logically operating on string as a unit or on whole characters.

Anyone looking to operate on code points is going to need to take into account the fact that they're not full characters, just like anyone who operates on code units needs to take into account the fact that they're not whole characters. Operating on code points as if they were characters - which is exactly what D currently does with ranges - is just plain wrong. We need to support operating at the code point level for those rare cases where it's actually useful, but autedecoding makes no sense. It incurs a performance penality without actually giving correct results except in those rare cases where you want code points instead of full characters. And only Unicode experts are ever going to want that. The average programmer who is not super Unicode savvy doesn't even know what code points are. They're clearly going to be looking to operate on strings as sequences of characters, not sequences of code points. I don't see how anyone could expect otherwise. Code points are a mid-level, Unicode abstraction that only those who are Unicode savvy are going to know or care about, let alone want to operate on.

- Jonathan M Davis

June 01, 2016
On Wednesday, 1 June 2016 at 02:17:21 UTC, Jonathan M Davis wrote:
> ...

This thread is going in circles; the against crowd has stated each of their arguments very clearly at least five times in different ways.

The cost/benefit problems with auto decoding are as clear as day. If the evidence already presented in this thread (and in the many others) isn't enough to convince people of that, then I don't think anything else said will have an impact.

I don't want to sound like someone telling people not to discuss this anymore, but honestly, what is continuing this thread going to accomplish?
May 31, 2016
On 5/31/2016 4:00 PM, ag0aep6g wrote:
> Wikipedia says [1] that UCS-2 is essentially UTF-16 without surrogate pairs. I
> suppose you mean UTF-32/UCS-4.
> [1] https://en.wikipedia.org/wiki/UTF-16

Thanks for the correction.
June 01, 2016
On Tuesday, 31 May 2016 at 16:29:33 UTC, Joakim wrote:
> UTF-8 is an antiquated hack that needs to be eradicated.  It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world.

I assume you're talking about the web here. In this case, plain text makes up only a minor part of the entire traffic, the majority of which is images (binary data), javascript and stylesheets (almost pure ASCII), and HTML markup (ditto). It's like not significant even without taking compression into account, which is ubiquitous.

> It is unnecessarily inefficient, which is precisely why auto-decoding is a problem.

No, inefficiency is the least of the problems with auto-decoding.

> It is only a matter of time till UTF-8 is ditched.

This is ridiculous, even if your other claims were true.

>
> D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable.  I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte.  Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_

I think I remember that post, and - sorry to be so blunt - it was one of the worst things I've ever seen proposed regarding text encoding.

>
> The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language!  This is madness.

No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.