May 30, 2016
On Monday, 30 May 2016 at 18:24:23 UTC, Andrei Alexandrescu wrote:
> That kind of makes this thread less productive than "How to improve autodecoding?" -- Andrei

Please don't misunderstand, I'm for fixing string behavior. But, let's not pretend that this wouldn't be one of the (if not the) largest breaking change since D2. As I said, straight up removing auto-decoding would break all string handling code.
May 30, 2016
On 30.05.2016 18:01, Andrei Alexandrescu wrote:
> On 05/28/2016 03:04 PM, Walter Bright wrote:
>> On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
>>> So it harkens back to the original mistake: strings should NOT be
>>> arrays with
>>> the respective primitives.
>>
>> An array of code units provides consistency, predictability,
>> flexibility, and performance. It's a solid base upon which the
>> programmer can build what he needs as required.
>
> Nope. Not buying it.

I'm buying it. IMO alias string=immutable(char)[] is the most useful choice, and auto-decoding ideally wouldn't exist.
May 30, 2016
On 05/30/2016 03:00 PM, Jack Stouffer wrote:
> On Monday, 30 May 2016 at 18:24:23 UTC, Andrei Alexandrescu wrote:
>> That kind of makes this thread less productive than "How to improve
>> autodecoding?" -- Andrei
>
> Please don't misunderstand, I'm for fixing string behavior.

Surely the misunderstanding is not on this side of the table :o). By "that" I meant your assertion at face value (i.e. assuming it's a fact) "All string handling code would become broken, even if it appears to work at first". -- Andrei
May 30, 2016
Am Mon, 30 May 2016 17:35:36 +0000
schrieb Chris <wendlec@tcd.ie>:

> I was actually talking about ICU with a colleague today. Could it be that Unicode itself is broken? I've often heard criticism of Unicode but never looked into it.

You have to compare to the situation before, when every operating system with every localization had its own encoding. Have some text file with ASCII art in a DOS code page? Doesn't render on Windows with the same locale. Open Cyrillic text on a Latin system? Indigestible. Someone wrote a website on Windows and incorrectly tagged it with an ISO charset? The browser has to fix it up for them.

One objection I remember was the Han Unification:
https://en.wikipedia.org/wiki/Han_unification
Not everyone liked how Chinese, Japanese, Korean were
represented with a common set of ideograms. At the time
Unicode was still 16-bit and the unified symbols would already
make up 32% of all code points.

In my eyes many of the perceived problems of Unicode are stemming from the fact that raises awareness to different writing systems all over the globe in a way that we didn't have to, when software was developed locally instead of globally on GitHub, when the target was Windows instead of cross-platform and mobile, when we were lucky if we localized for a couple of latin languages, but Asia was a real barrier.

I don't know what you and your colleague discussed about ICU,
but likely if you should add another dependency and what
alternatives there are. In Linux user space, almost everything
is an outside project, an extra library, most of them with
alternatives. My own research lead me to the point where I
came to think that there was one set of libraries without
real alternatives: ICU -> HarfBuff -> Pango
That's the go-to chain for Unicode text. From text processing
over rendering to layouting. Moreover many successful
open-source projects make use of it: LibreOffice, sqlite, Qt,
libxml2, WebKit to name a few.
Unicode is here to stay, no matter what could have been done
better in the past, and I think it is perfectly safe to bet on
ICU on Linux for what e.g. Windows has built-in.

Otherwise just do as Adam Ruppe said:
> Don't mess with strings. Get them from the user, store them without modification, spit them back out again.

:p

-- 
Marco

May 30, 2016
On 05/30/2016 03:04 PM, Timon Gehr wrote:
> On 30.05.2016 18:01, Andrei Alexandrescu wrote:
>> On 05/28/2016 03:04 PM, Walter Bright wrote:
>>> On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
>>>> So it harkens back to the original mistake: strings should NOT be
>>>> arrays with
>>>> the respective primitives.
>>>
>>> An array of code units provides consistency, predictability,
>>> flexibility, and performance. It's a solid base upon which the
>>> programmer can build what he needs as required.
>>
>> Nope. Not buying it.
>
> I'm buying it. IMO alias string=immutable(char)[] is the most useful
> choice, and auto-decoding ideally wouldn't exist.

Wouldn't D then be seen (and rightfully so) as largely not supporting Unicode, seeing as its many many core generic algorithms seem to randomly work or not on arrays of characters? Wouldn't ranges - the most important artifact of D's stdlib - default for strings on the least meaningful approach to strings (dumb code units)? Would a smattering of Unicode primitives in std.utf and friends entitle us to claim D had dyed Unicode in its wool? (All are not rhetorical.)

I.e. wouldn't be in a worse place than now? (This is rhetorical.) The best argument for autodecoding is to contemplate where we'd be without it: the ghetto of Unicode string handling.

I'm not going to debate this further (though I'll look for meaningful answers to the questions above). But this thread has been informative in that it did little to change my conviction that autodecoding is a good thing for D, all things considered (i.e. the wrong decision to not encapsulate string as a separate type distinct from bare array of code units). I'd lie if I said it did nothing. It did, but only a little.

Funny thing is that's not even what's important. What's important is that autodecoding is here to stay - there's no realistic way to eliminate it from D. So the focus should be making autodecoding the best it could ever be.


Andrei

May 30, 2016
On Mon, May 30, 2016 at 03:28:38PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> On 05/30/2016 03:04 PM, Timon Gehr wrote:
> > On 30.05.2016 18:01, Andrei Alexandrescu wrote:
> > > On 05/28/2016 03:04 PM, Walter Bright wrote:
> > > > On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
> > > > > So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.
> > > > 
> > > > An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required.
> > > 
> > > Nope. Not buying it.
> > 
> > I'm buying it. IMO alias string=immutable(char)[] is the most useful choice, and auto-decoding ideally wouldn't exist.
> 
> Wouldn't D then be seen (and rightfully so) as largely not supporting Unicode, seeing as its many many core generic algorithms seem to randomly work or not on arrays of characters?

They already randomly work or not work on ranges of dchar. I hope we don't have to rehash all the examples of why things that seem to work, like count, filter, map, etc., actually *don't* work outside of a very narrow set of languages. The best of all this is that they *both* don't work properly *and* make your program pay for the performance overhead, even when you're not even using them -- thanks to ubiquitous autodecoding.


> Wouldn't ranges - the most important artifact of D's stdlib - default for strings on the least meaningful approach to strings (dumb code units)?

No, ideally there should *not* be a default range type -- the user needs to specify what he wants to iterate by, whether code unit, code point, or grapheme, etc..


> Would a smattering of Unicode primitives in std.utf and friends entitle us to claim D had dyed Unicode in its wool? (All are not rhetorical.)

I have no idea what this means.


> I.e. wouldn't be in a worse place than now? (This is rhetorical.) The best argument for autodecoding is to contemplate where we'd be without it: the ghetto of Unicode string handling.

I've no idea what you're talking about. Without autodecoding we'd actually have faster string handling, and forcing the user to specify the unit of iteration would actually bring more Unicode-awareness which would improve the quality of string handling code, instead of proliferating today's wrong code that just happens to work in some languages but make a hash of things everywhere else.


> I'm not going to debate this further (though I'll look for meaningful answers to the questions above). But this thread has been informative in that it did little to change my conviction that autodecoding is a good thing for D, all things considered (i.e. the wrong decision to not encapsulate string as a separate type distinct from bare array of code units). I'd lie if I said it did nothing. It did, but only a little.
> 
> Funny thing is that's not even what's important. What's important is that autodecoding is here to stay - there's no realistic way to eliminate it from D. So the focus should be making autodecoding the best it could ever be.
[...]

If I ever had to write string-heavy code, I'd probably fork Phobos just so I can get decent performance. Just sayin'.


T

-- 
People walk. Computers run.
May 30, 2016
On Monday, 30 May 2016 at 18:26:32 UTC, Adam D. Ruppe wrote:
> On Monday, 30 May 2016 at 17:14:47 UTC, Andrew Godfrey wrote:
>> I like "make string iteration explicit" but I wonder about other constructs. E.g. What about "sort an array of strings"? How would you tell a generic sort function whether you want it to interpret strings by code unit vs code point vs grapheme?
>
> The comparison predicate does that...
>
> sort!( (string a, string b) {
>   /* you interpret a and b here and return the comparison */
> })(["hi", "there"]);

Thanks! You left out some details but I think I see - an example predicate might be "cmp(a.byGrapheme, b.byGrapheme)" and by the looks of it, that code works in D today.

(However, "cmp(a, b)" would default to code points today, which is surprising to almost everyone, and that's more what this thread is about).
May 30, 2016
On 30.05.2016 21:28, Andrei Alexandrescu wrote:
> On 05/30/2016 03:04 PM, Timon Gehr wrote:
>> On 30.05.2016 18:01, Andrei Alexandrescu wrote:
>>> On 05/28/2016 03:04 PM, Walter Bright wrote:
>>>> On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
>>>>> So it harkens back to the original mistake: strings should NOT be
>>>>> arrays with
>>>>> the respective primitives.
>>>>
>>>> An array of code units provides consistency, predictability,
>>>> flexibility, and performance. It's a solid base upon which the
>>>> programmer can build what he needs as required.
>>>
>>> Nope. Not buying it.
>>
>> I'm buying it. IMO alias string=immutable(char)[] is the most useful
>> choice, and auto-decoding ideally wouldn't exist.
>
> Wouldn't D then be seen (and rightfully so) as largely not supporting
> Unicode, seeing as its many many core generic algorithms seem to
> randomly work or not on arrays of characters?

In D, enum does not mean enumeration, const does not mean constant, pure is not pure, lazy is not lazy, and char does not mean character.

> Wouldn't ranges - the most
> important artifact of D's stdlib - default for strings on the least
> meaningful approach to strings (dumb code units)?

I don't see how that's the least meaningful approach. It's the data that you actually have sitting in memory. It's the data that you can slice and index and get a length for in constant time.

> Would a smattering of
> Unicode primitives in std.utf and friends entitle us to claim D had dyed
> Unicode in its wool? (All are not rhetorical.)
>...

We should support Unicode by having all the required functionality and properly documenting the data formats used. What is the goal here? I.e. what does a language that has "Unicode dyed in its wool" have that other languages do not? Why isn't it enough to provide data types for UTF8/16/32 and Unicode algorithms operating on them?

> I.e. wouldn't be in a worse place than now? (This is rhetorical.) The
> best argument for autodecoding is to contemplate where we'd be without
> it: the ghetto of Unicode string handling.
> ...

Those questions seem to be mostly marketing concerns. I'm more concerned with whether I find it convenient to use. Autodecoding does not improve Unicode support.

> I'm not going to debate this further (though I'll look for meaningful
> answers to the questions above). But this thread has been informative in
> that it did little to change my conviction that autodecoding is a good
> thing for D, all things considered (i.e. the wrong decision to not
> encapsulate string as a separate type distinct from bare array of code
> units). I'd lie if I said it did nothing. It did, but only a little.
>
> Funny thing is that's not even what's important. What's important is
> that autodecoding is here to stay - there's no realistic way to
> eliminate it from D. So the focus should be making autodecoding the best
> it could ever be.
>
>
> Andrei
>

Sure, I didn't mean to engage in a debate (it seems there is no decision to be made here that might affect me in the future).
May 30, 2016
Am Mon, 30 May 2016 17:14:47 +0000
schrieb Andrew Godfrey <X@y.com>:

> I like "make string iteration explicit" but I wonder about other constructs. E.g. What about "sort an array of strings"? How would you tell a generic sort function whether you want it to interpret strings by code unit vs code point vs grapheme?

You are just scratching the surface! Unicode strings are
sorted following the Unicode Collation Algorithm which is
described in the 86 pages document here:
(http://www.unicode.org/reports/tr10/)
which is implemented in the ICU library mentioned before.

Some obvious considerations from the description of the algorithm:

In Sweden z comes before ö, while in Germany its the reverse.
In Germany, words in a dictionary are sorted differently from
lists of names in a phone book.
  dictionary: of < öf,
  phone book: öf < of
Spanish sorts 'll' as one character right after 'l'.

The default collation is selected in Windows through the
control panel's localization app and on Linux (Posix) using
the LC_COLLATE environment variable.
The actual string sorting in the user's locale can then be
performed with the C library using
http://www.cplusplus.com/reference/cstring/strcoll/
or OS specific functions like CompareStringEx on Windows
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx

TL;DR neither code-points nor grapheme clusters are adequate
for string sorting. Also two strings may compare unequal byte
for byte, while they are actually the same text in different
normalization forms. (E.g. Umlauts on OS X (NFD) vs. rest of
the world (NFC)).

Admittedly I find myself using str1 == str2 without first normalizing both, because it is frigging convenient and fast.

-- 
Marco

May 30, 2016
On 5/30/2016 12:52 PM, H. S. Teoh via Digitalmars-d wrote:
> If I ever had to write string-heavy code, I'd probably fork Phobos just
> so I can get decent performance. Just sayin'.

When I wrote Warp, the only point of which was speed, I couldn't use phobos because of autodecoding. I have since recoded a number of phobos functions so they didn't autodecode, so the situation is better.