June 01, 2016
On Tuesday, 31 May 2016 at 20:56:43 UTC, Andrei Alexandrescu wrote:
> On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:
>> In the vast majority of cases what folks care about is full character
>
> How are you so sure? -- Andrei

He doesn't need to be sure. You are the one advocating for code points, so the burden is on you to present evidence that it's the correct choice.
June 01, 2016
On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:
> On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:
>> Wasn't the whole point of operating at the code point level by default to
>> make it so that code would be operating on full characters by default
>> instead of chopping them up as is so easy to do when operating at the code
>> unit level?
>
> The point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units).

_Both_ are low-level representation-specific artifacts.
June 01, 2016
On Wednesday, 1 June 2016 at 01:13:17 UTC, Steven Schveighoffer wrote:
> On 5/31/16 4:38 PM, Timon Gehr wrote:
>> What about e.g. joiner?
>
> Compiler error. Better than what it does now.

I believe everything that does only concatenation will work correctly. That's why joiner() is one of those algorithms that should accept strings directly without going through any decoding (but it may need to recode the joining element itself, of course).
June 01, 2016
On Wednesday, 1 June 2016 at 10:04:42 UTC, Marc Schütz wrote:
> On Tuesday, 31 May 2016 at 16:29:33 UTC, Joakim wrote:
>> UTF-8 is an antiquated hack that needs to be eradicated.  It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world.
>
> I assume you're talking about the web here. In this case, plain text makes up only a minor part of the entire traffic, the majority of which is images (binary data), javascript and stylesheets (almost pure ASCII), and HTML markup (ditto). It's like not significant even without taking compression into account, which is ubiquitous.

No, I explicitly said not the web in a subsequent post.  The ignorance here of what 2G speeds are like is mind-boggling.

>> It is unnecessarily inefficient, which is precisely why auto-decoding is a problem.
>
> No, inefficiency is the least of the problems with auto-decoding.

Right... that's why this 200-post thread was spawned with that as the main reason.

>> It is only a matter of time till UTF-8 is ditched.
>
> This is ridiculous, even if your other claims were true.

The UTF-8 encoding is what's ridiculous.

>>
>> D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable.  I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte.  Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_
>
> I think I remember that post, and - sorry to be so blunt - it was one of the worst things I've ever seen proposed regarding text encoding.

Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters.

>>
>> The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language!  This is madness.
>
> No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.

Lol, this may be the dumbest argument put forth yet.

I don't think anyone here even understands what a good encoding is and what it's for, which is why there's no point in debating this.
June 01, 2016
On 06/01/2016 06:25 AM, Marc Schütz wrote:
> On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:
>> On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:
>>> Wasn't the whole point of operating at the code point level by
>>> default to
>>> make it so that code would be operating on full characters by default
>>> instead of chopping them up as is so easy to do when operating at the
>>> code
>>> unit level?
>>
>> The point is to operate on representation-independent entities
>> (Unicode code points) instead of low-level representation-specific
>> artifacts (code units).
>
> _Both_ are low-level representation-specific artifacts.

Maybe this is a misunderstanding. Representation = how things are laid out in memory. What does associating numbers with various Unicode symbols have to do with representation? -- Andrei

June 01, 2016
Am Wed, 01 Jun 2016 13:57:27 +0000
schrieb Joakim <dlang@joakim.fea.st>:

> No, I explicitly said not the web in a subsequent post.  The ignorance here of what 2G speeds are like is mind-boggling.

I've used 56k and had a phone conversation with my sister while she was downloading a 800 MiB file over 2G. You just learn to be patient (or you already are when the next major city is hundreds of kilometers away) and load only what you need. Your point about the costs convinced me more.

Here is one article spiced up with numbers and figures: http://www.thequint.com/technology/2016/05/30/almost-every-indian-may-be-online-if-data-cost-cut-to-one-third

But even if you could prove with a study that UTF-8 caused a
notable bandwith cost in real life, it would - I think - be a
matter of regional ISPs to provide special servers and apps
that reduce data volume. There is also the overhead of
key exchange when establishing a secure connection:
http://stackoverflow.com/a/20306907/4038614
Something every app should do, but will increase bandwidth use.
Then there is the overhead of using XML in applications
like WhatsApp, which I presume is quite popular around the
world. I'm just trying to broaden the view a bit here.
This note from the XMPP that WhatsApp and Jabber use will make
you cringe: https://tools.ietf.org/html/rfc6120#section-11.6

-- 
Marco

June 01, 2016
On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:
>
> No, I explicitly said not the web in a subsequent post.  The ignorance here of what 2G speeds are like is mind-boggling.
>
It's not hard.  I think a lot of us remember when a 14.4 modem was cutting-edge.  Codepages and incompatible encodings were terrible then, too.

Never again.

> Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters.

It _is_ kind of ludicrous, isn't it?  But it really is the least-bad option for the most text.  Sorry, bub.

>> No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.
>
> Lol, this may be the dumbest argument put forth yet.

This just makes it feel like you're trolling.  You're not just trolling, right?

> I don't think anyone here even understands what a good encoding is and what it's for, which is why there's no point in debating this.

And I don't think you realise how backwards you sound to people who had to live through the character encoding hell of the past.  This has been an ongoing headache for the better part of a century (it still comes up in old files, sites, and systems) and you're literally the only person I've ever seen seriously suggest we turn back now that the madness has been somewhat tamed.

If you have to deal with delivering the fastest possible i18n at GSM data rates, well, that's a tough problem and it sounds like you might need to do something pretty special. Turning the entire ecosystem into your special case is not the answer.

-Wyatt
June 01, 2016
On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
> On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:
>>
>> No, I explicitly said not the web in a subsequent post.  The ignorance here of what 2G speeds are like is mind-boggling.
>>
> It's not hard.  I think a lot of us remember when a 14.4 modem was cutting-edge.  Codepages and incompatible encodings were terrible then, too.
>
> Never again.
>
>> Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters.
>
> It _is_ kind of ludicrous, isn't it?  But it really is the least-bad option for the most text.  Sorry, bub.
>
>>> No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.
>>
>> Lol, this may be the dumbest argument put forth yet.
>
> This just makes it feel like you're trolling.  You're not just trolling, right?
>
>> I don't think anyone here even understands what a good encoding is and what it's for, which is why there's no point in debating this.
>
> And I don't think you realise how backwards you sound to people who had to live through the character encoding hell of the past.  This has been an ongoing headache for the better part of a century (it still comes up in old files, sites, and systems) and you're literally the only person I've ever seen seriously suggest we turn back now that the madness has been somewhat tamed.

Indeed, Joakim's proposal is so insane it beggars belief (why not go back to baudot encoding, it's only 5 bit, hurray, it's so much faster when used with flag semaphores).

As a programmer in the European Commission translation unit, working on the probably biggest translation memory in the world for 14 years, I can attest that Unicode is a blessing. When I remember the shit we had in our documents because of the code pages before most programs could handle utf-8 or utf-16 (and before 2004 we only had 2 alphabets to take care of, Western and Greek). What Joakim does not understand, is that there are huge, huge quantities of documents that are multi-lingual. Translators of course handle nearly exclusively with at least bi-lingual documents. Any document encountered by a translator must at least be able to present the source and the target language. But even outside of that specific population, multilingual documents are very, very common.

>
> If you have to deal with delivering the fastest possible i18n at GSM data rates, well, that's a tough problem and it sounds like you might need to do something pretty special. Turning the entire ecosystem into your special case is not the answer.
>



June 01, 2016
On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:
> What Joakim does not understand, is that there are huge, huge quantities of documents that are multi-lingual.

That should be obvious to anyone living outside the USA.

June 01, 2016
On 06/01/2016 10:29 AM, Andrei Alexandrescu wrote:
> On 06/01/2016 06:25 AM, Marc Schütz wrote:
>> On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:
>>>
>>> The point is to operate on representation-independent entities
>>> (Unicode code points) instead of low-level representation-specific
>>> artifacts (code units).
>>
>> _Both_ are low-level representation-specific artifacts.
>
> Maybe this is a misunderstanding. Representation = how things are laid
> out in memory. What does associating numbers with various Unicode
> symbols have to do with representation? -- Andrei
>

As has been explained countless times already, code points are a non-1:1 internal representation of graphemes. Code points don't exist for their own sake, their entire existence is purely as a way to encode graphemes. Whether that technically qualifies as "memory representation" or not is irrelevant: it's still a low-level implementation detail of text.