May 31, 2016
On 05/31/2016 03:34 PM, ag0aep6g wrote:
> On 05/31/2016 07:21 PM, Andrei Alexandrescu wrote:
>> Could you please substantiate that? My understanding is that code unit
>> is a higher-level Unicode notion independent of encoding, whereas code
>> point is an encoding-dependent representation detail. -- Andrei
>
> You got the terms mixed up. Code unit is lower level. Code point is
> higher level.

Apologies and thank you. -- Andrei

May 31, 2016
On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:
> In the vast majority of cases what folks care about is full character

How are you so sure? -- Andrei
May 31, 2016
On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:
> Wasn't the whole point of operating at the code point level by default to
> make it so that code would be operating on full characters by default
> instead of chopping them up as is so easy to do when operating at the code
> unit level?

The point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units). That's the contract, and it seems meaningful seeing how Unicode is defined in terms of code points as its abstract building block. If user code needs to go lower at the code unit level, they can do so. If user code needs to go upper at the grapheme level, they can do so. If anything this thread strengthens my opinion that autodecoding is a sweet spot. -- Andrei
May 31, 2016
On 31.05.2016 22:20, Marco Leise wrote:
> Am Tue, 31 May 2016 16:29:33 +0000
> schrieb Joakim<dlang@joakim.fea.st>:
>
>> >Part of it is the complexity of written language, part of it is
>> >bad technical decisions.  Building the default string type in D
>> >around the horrible UTF-8 encoding was a fundamental mistake,
>> >both in terms of efficiency and complexity.  I noted this in one
>> >of my first threads in this forum, and as Andrei said at the
>> >time, nobody agreed with me, with a lot of hand-waving about how
>> >efficiency wasn't an issue or that UTF-8 arrays were fine.
>> >Fast-forward years later and exactly the issues I raised are now
>> >causing pain.
> Maybe you can dig up your old post and we can look at each of
> your complaints in detail.
>

It is probably this one. Not sure what "exactly the issues" are though.

http://forum.dlang.org/thread/bwbuowkblpdxcpysejpb@forum.dlang.org
May 31, 2016
On Tue, May 31, 2016 at 05:01:17PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:
> > Wasn't the whole point of operating at the code point level by default to make it so that code would be operating on full characters by default instead of chopping them up as is so easy to do when operating at the code unit level?
> 
> The point is to operate on representation-independent entities
> (Unicode code points) instead of low-level representation-specific
> artifacts (code units).

This is basically saying that we operate on dchar[] by default, except that we disguise its detrimental memory usage consequences by compressing to UTF-8/UTF-16 and incurring the cost of decompression every time we access its elements.  Perhaps you love the idea of running an OS that stores all files in compressed form and always decompresses upon every syscall to read(), but I prefer a higher-performance system.


> That's the contract, and it seems meaningful
> seeing how Unicode is defined in terms of code points as its abstract
> building block.

Where's this contract stated, and when did we sign up for this?


> If user code needs to go lower at the code unit level, they can do so. If user code needs to go upper at the grapheme level, they can do so.

Only with much pain by using workarounds to bypass meticulously-crafted autodecoding algorithms in Phobos.


> If anything this thread strengthens my opinion that autodecoding is a sweet spot. -- Andrei

No, autodecoding is a stalemate that's neither fast nor correct.


T

-- 
"Real programmers can write assembly code in any language. :-)" -- Larry Wall
May 31, 2016
Am Tue, 31 May 2016 13:06:16 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:

> On 05/31/2016 12:54 PM, Jonathan M Davis via Digitalmars-d wrote:
> > Equality does not require decoding. Similarly, functions like find don't either. Something like filter generally would, but it's also not particularly normal to filter a string on a by-character basis. You'd probably want to get to at least the word level in that case.
> 
> It's nice that the stdlib takes care of that.

Both "equality" and "find" require byGrapheme.

 ⇰ The equivalence algorithm first brings both strings to a
   common normalization form (NFD or NFC), which works on one
   grapheme cluster at a time and afterwards does the binary
   comparison.
   http://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence

 ⇰ Find would yield false positives for the start of grapheme clusters.
   I.e. will match 'o' in an NFD "ö" (simplified example).
   http://www.unicode.org/reports/tr10/#Searching

-- 
Marco

May 31, 2016
On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:

> If user code needs to go upper at the grapheme level, they can If anything this thread strengthens my opinion that autodecoding is a sweet spot. -- Andrei

Unicode FAQ disagrees (http://unicode.org/faq/utf_bom.html):

"Q: How about using UTF-32 interfaces in my APIs?

A: Except in some environments that store text as UTF-32 in memory, most Unicode APIs are using UTF-16. With UTF-16 APIs  the low level indexing is at the storage or code unit level, with higher-level mechanisms for graphemes or words specifying their boundaries in terms of the code units. This provides efficiency at the low levels, and the required functionality at the high levels."


May 31, 2016
Am Tue, 31 May 2016 16:56:43 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:

> On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:
> > In the vast majority of cases what folks care about is full character
> 
> How are you so sure? -- Andrei

Because a full character is the typical unit of a written language. It's what we visualize in our heads when we think about finding a substring or counting characters. A special case of this is the reduction to ASCII where we can use code units in place of grapheme clusters.

-- 
Marco

May 31, 2016
On 5/31/2016 1:20 PM, Marco Leise wrote:
> [...]

I agree. I dealt the madness of code pages, Shift-JIS, EBCDIC, locales, etc., in the pre-Unicode days. Despite its problems, Unicode (and UTF-8) is a major improvement, and I mean major.

16 years ago, I bet that Unicode was the future, and events have shown that to be correct.

But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so D bet on all three. If I had a do-over, I'd just support UTF-8. UTF-16 is useful pretty much only as a transitional encoding to talk with Windows APIs. Nobody uses UCS-2 (it consumes far too much memory).
June 01, 2016
On 06/01/2016 12:47 AM, Walter Bright wrote:
> But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so
> D bet on all three. If I had a do-over, I'd just support UTF-8. UTF-16
> is useful pretty much only as a transitional encoding to talk with
> Windows APIs. Nobody uses UCS-2 (it consumes far too much memory).

Wikipedia says [1] that UCS-2 is essentially UTF-16 without surrogate pairs. I suppose you mean UTF-32/UCS-4.


[1] https://en.wikipedia.org/wiki/UTF-16