May 29, 2016
On Sunday, 29 May 2016 at 13:04:18 UTC, Tobias M wrote:
> On Sunday, 29 May 2016 at 12:08:52 UTC, default0 wrote:
>> I am pretty sure that a single grapheme in unicode does not correspond to your notion of "character". I am pretty sure that what you think of as a "character" is officially called "Grapheme Cluster" not "Grapheme".
>
> Grapheme is a linguistic term. AFAIUI, a grapheme cluster is a cluster of codepoints representing a grapheme. It's called "cluster" in the unicode spec, because there there is no dedicated grapheme unit.

> I put "character" into quotes, because the term is not really well defined. I just used it for a short and pregnant answer. I'm sure there's a better/more correct definition of graphem/phoneme, but it's probably also much longer and complicated.

Which is why we need to agree on a terminology, i.e. be clear when we use linguistic terms and when we use Unicode specific terminology.
May 29, 2016
On 05/12/2016 08:47 PM, Jack Stouffer wrote:
>
> If you're serious about removing auto-decoding, which I think you and
> others have shown has merits, you have to the THE SIMPLEST migration
> path ever, or you will kill D. I'm talking a simple press of a button.
>
> I'm not exaggerating here. Python, a language which was much more
> popular than D at the time, came out with two versions in 2008: Python
> 2.7 which had numerous unicode problems, and Python 3.0 which fixed
> those problems. Almost eight years later, and Python 2 is STILL the more
> popular version despite Py3 having five major point releases since and
> Python 2 only getting security patches. Think the tango vs phobos
> problem, only a little worse.
>
> D is much less popular now than was Python at the time, and Python 2
> problems were more straight forward than the auto-decoding problem.
> You'll need a very clear migration path, years long deprecations, and
> automatic tools in order to make the transition work, or else D's usage
> will be permanently damaged.

As much as I agree on the importance of a good smooth migration path, I don't think the "Python 2 vs 3" situation is really all that comparable here. Unlike Python, we wouldn't be maintaining a "with auto-decoding" fork for years and years and years, ensuring nobody ever had a pressing reason to bother migrating. And on top of that, we don't have a culture and design philosophy that promotes "do the lazy thing first and the robust thing never". D users are more likely than dynamic language users to be willing to make a few changes for the sake of improvement.

Heck, we weather breaking fixes enough anyway. There was even one point within the last couple years where something (forget offhand what it was) was removed from std.datetime and its replacement was added *in the very same compiler release*. No transition period. It was an annoying pain (at least to me), but I got through it fine and never even entertained the thought of just sticking with the old compiler. Not sure most people even noticed it. Point is, in D, even when something does need to change, life goes on fine. As long as we don't maintain a long-term fork ;)

Naturally, minimizing breakage is important here, but I really don't think Python's UTF migration situation is all that comparable.

May 29, 2016
On 05/29/2016 09:42 AM, Tobias M wrote:
> On Friday, 27 May 2016 at 19:43:16 UTC, H. S. Teoh wrote:
>> On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via
>> Digitalmars-d wrote:
>>> On 5/27/16 3:10 PM, ag0aep6g wrote:
>>> > I don't think there is value in distinguishing by language. > The
>>> point of Unicode is that you shouldn't need to do that.
>>>
>>> It seems code points are kind of useless because they don't really
>>> mean anything, would that be accurate? -- Andrei
>>
>> That's what we've been trying to say all along! :-P  They're a kind of
>> low-level Unicode construct used for building "real" characters, i.e.,
>> what a layperson would consider to be a "character".
>
> Code points are *the fundamental unit* of unicode. AFAIK most (all?)
> algorithms in the unicode spec are defined in terms of code points.
> Sure, some algorithms also work on the code unit level. That can be used
> as an optimization, but they are still defined on code points.
>
> Code points are also abstracting over the different representations
> (UTF-...), providing a uniform "interface".

So now code points are good? -- Andrei

May 29, 2016
On Sun, May 29, 2016 at 03:55:22PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> On 05/29/2016 09:42 AM, Tobias M wrote:
> > On Friday, 27 May 2016 at 19:43:16 UTC, H. S. Teoh wrote:
> > > On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> > > > On 5/27/16 3:10 PM, ag0aep6g wrote:
> > > > > I don't think there is value in distinguishing by language. The point of Unicode is that you shouldn't need to do that.
> > > > 
> > > > It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- Andrei
> > > 
> > > That's what we've been trying to say all along! :-P  They're a kind of low-level Unicode construct used for building "real" characters, i.e., what a layperson would consider to be a "character".
> > 
> > Code points are *the fundamental unit* of unicode. AFAIK most (all?) algorithms in the unicode spec are defined in terms of code points. Sure, some algorithms also work on the code unit level. That can be used as an optimization, but they are still defined on code points.
> > 
> > Code points are also abstracting over the different representations (UTF-...), providing a uniform "interface".
> 
> So now code points are good? -- Andrei

It depends on what you're trying to accomplish. That's the point we're trying to get at.  For some operations, working with code points makes the most sense. But for other operations, it does not.  There is no one representation that is best for all situations; it needs to be decided on a case-by-case basis.  Which is why forcing everything to decode to code points eventually leads to problems.


T

-- 
Customer support: the art of getting your clients to pay for your own incompetence.
May 29, 2016
On 05/12/2016 10:15 PM, Walter Bright wrote:
> On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
>> I am as unclear about the problems of autodecoding as I am about the
> necessity
>> to remove curl. Whenever I ask I hear some arguments that work well
> emotionally
>> but are scant on reason and engineering. Maybe it's time to rehash
> them? I just
>> did so about curl, no solid argument seemed to come together. I'd be
> curious of
>> a crisp list of grievances about autodecoding. -- Andrei
> 
> Here are some that are not matters of opinion.
>
> 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow.

There are more than 2 choices here, see the related discussion on avoiding redundant unicode validation https://issues.dlang.org/show_bug.cgi?id=14519#c32.
May 29, 2016
On 5/29/2016 4:47 AM, Tobias Müller wrote:
> No, this is well established terminology, you are confusing several things here:

For D, we should stick with the terminology as defined by Unicode.

May 29, 2016
On Sun, May 29, 2016 at 01:13:36PM +0000, Tobias M via Digitalmars-d wrote:
> On Sunday, 29 May 2016 at 12:41:50 UTC, Chris wrote:
> > Ok, you have a point there, to be precise <sh> is a multigraph (a digraph)(cf. [1]). In French you can have multigraphs consisting of three or more characters <eau> /o/, as in Irish <aoi> => /i:/. However, a phoneme is not necessarily a spoken "character" as <sh> represents one phoneme but consists of two "characters" or graphemes. <th> can represent two different phonemes (voiced and unvoiced "th" as in `this` vs. `thorough`).
> 
> What I meant was, a phoneme is the "character" (smallest unit) in a
> spoken language, not that it corresponds to a character (whatever that
> means).
[...]

Calling a phoneme a "character" is misleading.  A phoneme is a logical sound unit in a spoken language, whereas a "character" is a unit of written language.  The two do not necessarily have a direct correspondence (or even any correspondence whatsoever).

In a language like English, whose writing system was codified many hundreds of years ago, the spoken language has sufficiently diverged from the written language (specifically, in the way words are spelt) that the correspondence between the two is complex at best, downright arbitrary at worst.  For example, the 'o' in "women" and the 'i' in "fish" map to the same phoneme, the short /i/, in (common dialects of) spoken English, in spite of being two completely different characters. Therefore conflating "character" and "phoneme" is misleading and is only confusing the issue.

As far as Unicode is concerned, it is a standard for representing *written* text, not spoken language, so concepts like phonemes aren't even relevant in the first place.  Let's not get derailed from the present discussion by confusing the two.


T

-- 
What are you when you run out of Monet? Baroque.
May 30, 2016
On Sunday, 29 May 2016 at 17:35:35 UTC, Nick Sabalausky wrote:
> Unlike Python, we wouldn't be maintaining a "with auto-decoding" fork for years and years and years, ensuring nobody ever had a pressing reason to bother migrating.

If it happens, they better. The D1 fork was maintained for almost three years for a good reason.

> Heck, we weather breaking fixes enough anyway.

Not nearly on a scale similar to changing how strings are iterated; not since the D1/D2 split.

> It was an annoying pain (at least to me), but I got through it fine and never even entertained the thought of just sticking with the old compiler.
> Not sure most people even noticed it. Point is, in D, even when something does need to change, life goes on fine. As long as we don't maintain a long-term fork ;)

The problem is not active users. The problem is companies who have > 10K LOC and libraries that are no longer maintained. E.g. It took Sociomantic eight years after D2's release to switch only a few parts of their projects to D2. With the loss of old libraries/old code (even old answers on SO), all of a sudden you lose a lot of the network effect that makes programming languages much more useful.





May 29, 2016
On 5/29/2016 5:56 PM, H. S. Teoh via Digitalmars-d wrote:
> As far as Unicode is concerned, it is a standard for representing
> *written* text, not spoken language, so concepts like phonemes aren't
> even relevant in the first place.  Let's not get derailed from the
> present discussion by confusing the two.

As far as D is concerned, we are not going to invent our own concepts around text that is different from Unicode or redefine Unicode terms. Unicode is what it is, and D is going to work with it.
May 30, 2016
On Sunday, 29 May 2016 at 17:35:35 UTC, Nick Sabalausky wrote:
> On 05/12/2016 08:47 PM, Jack Stouffer wrote:
>
> As much as I agree on the importance of a good smooth migration path, I don't think the "Python 2 vs 3" situation is really all that comparable here. Unlike Python, we wouldn't be maintaining a "with auto-decoding" fork for years and years and years, ensuring nobody ever had a pressing reason to bother migrating. And on top of that, we don't have a culture and design philosophy that promotes "do the lazy thing first and the robust thing never". D users are more likely than dynamic language users to be willing to make a few changes for the sake of improvement.
>
> Heck, we weather breaking fixes enough anyway. There was even one point within the last couple years where something (forget offhand what it was) was removed from std.datetime and its replacement was added *in the very same compiler release*. No transition period. It was an annoying pain (at least to me), but I got through it fine and never even entertained the thought of just sticking with the old compiler. Not sure most people even noticed it. Point is, in D, even when something does need to change, life goes on fine. As long as we don't maintain a long-term fork ;)
>
> Naturally, minimizing breakage is important here, but I really don't think Python's UTF migration situation is all that comparable.

I suggest providing an automatic tool (either within the compiler or as a separate program like dfix) to help with the transition. Ideally the tool would advise the user where potential problems are and how to fix them.

If it's true that auto decode is unnecessary in many cases, then it shouldn't affect the whole code base. But I might be mistaken here. Maybe we should make a list of the functions where auto decode does make a difference, see how common they are, and work out a strategy from there. Destroy.
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19