March 10, 2014
On Monday, 10 March 2014 at 14:05:39 UTC, dennis luehring wrote:
> Am 07.03.2014 03:37, schrieb Walter Bright:
>> In "Lots of low hanging fruit in Phobos" the issue came up about the automatic
>> encoding and decoding of char ranges.
>
> after reading many of the attached posts the question is - what
> could be Ds future design of introducing breaking changes, its
> not a solution to say its not possible because of too many breaking changes - that will become more and more a problem of Ds evolution
> - much like C++

I'm a newbie here but I've been waiting for D to mature for a long time. D IMHO has to stabilise now because:
* D needs a bigger community so that the the big fish who have learnt the ins and outs don't get bored and move on due to lack of kudos etc.
* To get the bigger community D needs more _working_ libraries for major toolkits (GUI etc. etc.)
* Libraries will cease to work if there is significant change in D, and then can stay broken because there isn't the inertial mass of other developers to maintain it after the intial developer has moved on. You can see that this has happened a LOT
* Anyway the D that I read about in TDPL is already very exciting for programmers like myself, we just want that thanks.

Breaking changes can go into D3, if and whenever that is. Keep breaking D2 now and it risks just being forevermore a playpen for computer scientist types.

Anyway who cares what I think but I think it reflects a lot of people's opinions too.


March 10, 2014
On Monday, 10 March 2014 at 14:11:13 UTC, Dicebot wrote:
> Historically 2 approaches has been practiced:
>
> 1) argue a lot and then do nothing
> 2) suddenly change something and tell users is was necessary

These are one and the same, just from the two opposing points of view.

> I also think that this is much more important issue than this whole thread but it does not seem to attract any real attention when mentioned.

You mean the whole policy on breaking changes?
March 10, 2014
>
> Historically 2 approaches has been practiced:
>
> 1) argue a lot and then do nothing

This happens (I think) because Andrei and Walter really value your's and other expert's opinions, but nevertheless have to preserve the general way things work to preserve the long term future of D. They have to be open to persuasion but it would have to be very compelling to get them to change basics now - it seems to me.

D is at that difficult 90% stage that we all know about where the boring difficult stuff is left to do. People like to discuss interesting new stuff which at the time seems oh-so-important.

March 10, 2014
On Monday, 10 March 2014 at 14:27:02 UTC, Vladimir Panteleev wrote:
> On Monday, 10 March 2014 at 14:11:13 UTC, Dicebot wrote:
>> Historically 2 approaches has been practiced:
>>
>> 1) argue a lot and then do nothing
>> 2) suddenly change something and tell users is was necessary
>
> These are one and the same, just from the two opposing points of view.

</sarcasm> :)

>
>> I also think that this is much more important issue than this whole thread but it does not seem to attract any real attention when mentioned.
>
> You mean the whole policy on breaking changes?

Yes. I have given up about this idea at some point as there seemed to be consensus that no breaking changes will be even considered for D2 and those that come from fixing bugs are not worth the fuss. This is exactly why I was so shocked that Walter has even started this thread. If breaking changes are actually considered (rare or not), then it is absolutely critical to define the process for it and put link to its description to dlang.org front page.
March 10, 2014
Am Mon, 10 Mar 2014 14:05:03 +0000
schrieb "Andrea Fontana" <nospam@example.com>:

> In italian we need unicode too. We have several accented letters and often programming languages don't handle utf-8 and other encoding so well...
> 
> In D I never had any problem with this, and I work a lot on text processing.
> 
> So my question: is there any problem I'm missing in D with unicode support or is just a performance problem on algorithms?

The only real problem apart from potential performance issues I've seen mentioned in this thread is that indexing/slicing is done with code units. I think this:

auto index = countUntil(...);
auto slice = str[0 .. index];

is really the only problem with the current implementation.


If we could start from scratch I'd say we keep operating on code points
by default but don't make strings arrays of char/wchar/dchar. Instead
they should be special types which do all operations (especially
indexing, slicing) on code points. This would be as safe as the current
implementation, always consistent but probably even slower in some
cases. Then offer some nice way to get the raw data for algorithms
which can deal with it.
However, I think it's too late to make these changes.
March 10, 2014
On Monday, 10 March 2014 at 13:18:50 UTC, Dicebot wrote:
> On Sunday, 9 March 2014 at 17:27:20 UTC, Andrei Alexandrescu wrote:
>> On 3/9/14, 6:47 AM, "Marc Schütz" <schuetzm@gmx.net>" wrote:
>>> On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
>>>> 2) It is regression back to C++ days of no-one-cares-about-Unicode
>>>> pain. Thinking about strings as character arrays is so natural and
>>>> convenient that if language/Phobos won't punish you for that, it will
>>>> be extremely widespread.
>>>
>>> Not with Nick Sabalausky's suggestion to remove the implementation of
>>> front from char arrays. This way, everyone will be forced to decide
>>> whether they want code units or code points or something else.
>>
>> Such as giving up on that crappy language that keeps on breaking their code.
>>
>> Andrei
>
>
> That was more about "if you are that crazy to even consider such breakage, this is closer my personal perfection" than actual proposal ;)

BTW, I don't believe it would be that bad, because there's a straight-forward path of deprecation:

First, std.range.front for narrow strings (and dchar, for consistency) can be marked as deprecated. The deprecation message can say: "Please specify .byCodePoint()/.byCodeUnit()", guiding the users towards a better style (assuming one agrees that explicit is indeed better than implicit in this case).

After some time, the functionality can be moved into a compatibility module, with the deprecated functions still there, but now additionally telling the user about the quick fix of importing that module.

The deprecation period can be very long, and even if the functions should never be removed, at least everyone writing new code would do so in the new style.
March 10, 2014
On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
> My app deals with unicode arabic text that is 'out there', and the UnicodeTM support for Arabic is not that well thought out, so the data is often (always) inconsistent in terms of sequencing diacritics etc. Even the code page can vary. Therefore my code has to cater to various ways that other developers have sequenced the code points.
>
> So, my needs as a 'user' are:
> * I want to encode all incoming data immediately into unicode, usually UTF8, if isn't already.
> * I want to iterate over code points. I don't care about the raw data.
> * When I get the length of my string it should be the number of code points.
> * When I index my string it should return the nth code point.
> * When I manipulate my strings I want to work with code points
> ... you get the drift.

Are you sure that code points is what you want? AFAIK there are lots of diacritics in Arabic, and I believe they are not precomposed with their carrying letters...
March 10, 2014
On Monday, 10 March 2014 at 18:54:26 UTC, Marc Schütz wrote:
> On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
>> My app deals with unicode arabic text that is 'out there', and the UnicodeTM support for Arabic is not that well thought out, so the data is often (always) inconsistent in terms of sequencing diacritics etc. Even the code page can vary. Therefore my code has to cater to various ways that other developers have sequenced the code points.
>>
>> So, my needs as a 'user' are:
>> * I want to encode all incoming data immediately into unicode, usually UTF8, if isn't already.
>> * I want to iterate over code points. I don't care about the raw data.
>> * When I get the length of my string it should be the number of code points.
>> * When I index my string it should return the nth code point.
>> * When I manipulate my strings I want to work with code points
>> ... you get the drift.
>
> Are you sure that code points is what you want? AFAIK there are lots of diacritics in Arabic, and I believe they are not precomposed with their carrying letters...

I checked the terminology before posting so I'm pretty sure. Arabic has a code page for the logical characters, one code point for each letter of the alphabet and others for various diacritics.

Because of the 'shaping' each logical character has various glyphs, found on other code pages.

Text editing programs tend to store typed Arabic as the user entered it, and because there can be more than one diacritic per alphabetic letter the sequence varies as to how the user sequenced them.
March 10, 2014
On Monday, 10 March 2014 at 18:54:26 UTC, Marc Schütz wrote:
> On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
>> My app deals with unicode arabic text that is 'out there', and the UnicodeTM support for Arabic is not that well thought out, so the data is often (always) inconsistent in terms of sequencing diacritics etc. Even the code page can vary. Therefore my code has to cater to various ways that other developers have sequenced the code points.
>>
>> So, my needs as a 'user' are:
>> * I want to encode all incoming data immediately into unicode, usually UTF8, if isn't already.
>> * I want to iterate over code points. I don't care about the raw data.
>> * When I get the length of my string it should be the number of code points.
>> * When I index my string it should return the nth code point.
>> * When I manipulate my strings I want to work with code points
>> ... you get the drift.
>
> Are you sure that code points is what you want? AFAIK there are lots of diacritics in Arabic, and I believe they are not precomposed with their carrying letters...

Adding to my other comment I don't expect a string type to understand arabic and merge the diacritics for me. In fact there are other symbols (code points) that can also be present, for instance instructions on how Quranic text is to be read. These issues have not been standardised and I would say are not well understood generally.
March 10, 2014
On 3/7/2014 8:40 AM, Michel Fortin wrote:
> On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileHUGS@lycos.com> said:
>
>> Walter Bright:
>>
>>> I understand this all too well. (Note that we currently have a
>>> different silent problem: unnoticed large performance problems.)
>>
>> On the other hand your change could introduce Unicode-related bugs in
>> future code (that the current Phobos avoids) (and here I am not
>> talking about code breakage).
>
> The way Phobos works isn't any more correct than dealing with code
> units. Many graphemes span on multiple code points -- because of
> combined diacritics or character variant modifiers -- and decoding at
> the code-point level is thus often insufficient for correctness.
>

Well, it is *more* correct, as many western languages are more likely in current Phobos to "just work" in most cases. It's just that things still aren't completely correct overall.

>  From my experience, I'd suggest these basic operations for a "string
> range" instead of the regular range interface:
>
> .empty
> .frontCodeUnit
> .frontCodePoint
> .frontGrapheme
> .popFrontCodeUnit
> .popFrontCodePoint
> .popFrontGrapheme
> .codeUnitLength (aka length)
> .codePointLength (for dchar[] only)
> .codePointLengthLinear
> .graphemeLengthLinear
>
> Someone should be able to mix all the three 'front' and 'pop' function
> variants above in any code dealing with a string type. In my XML parser
> for instance I regularly use frontCodeUnit to avoid the decoding penalty
> when matching the next character with an ASCII one such as '<' or '&'.
> An API like the one above forces you to be aware of the level you're
> working on, making bugs and inefficiencies stand out (as long as you're
> familiar with each representation).
>
> If someone wants to use a generic array/range algorithm with a string,
> my opinion is that he should have to wrap it in a range type that maps
> front and popFront to one of the above variant. Having to do that should
> make it obvious that there's an inefficiency there, as you're using an
> algorithm that wasn't tailored to work with strings and that more
> decoding than strictly necessary is being done.
>

I actually like this suggestion quite a bit.