March 10, 2014
I'm not sure I understood the point of this (long) thread.
The main problem is that decode() is called also if not needed?

Well, in this case that's not a problem only for string. I found
this problem also when I was writing other ranges. For example
when I read binary data from db stream. Front represent a single
row, and I decode it every time also if not needed.

On Friday, 7 March 2014 at 02:37:11 UTC, Walter Bright wrote:
> In "Lots of low hanging fruit in Phobos" the issue came up about the automatic encoding and decoding of char ranges.
>
> Throughout D's history, there are regular and repeated proposals to redesign D's view of char[] to pretend it is not UTF-8, but UTF-32. I.e. so D will automatically generate code to decode and encode on every attempt to index char[].
>
> I have strongly objected to these proposals on the grounds that:
>
> 1. It is a MAJOR performance problem to do this.
>
> 2. Very, very few manipulations of strings ever actually need decoded values.
>
> 3. D is a systems/native programming language, and systems/native programming languages must not hide the underlying representation (I make similar arguments about proposals to make ints issue errors on overflow, etc.).
>
> 4. Users should choose when decode/encode happens, not the language.
>
> and I have been successful at heading these off. But one slipped by me. See this in std.array:
>
>   @property dchar front(T)(T[] a) @safe pure if (isNarrowString!(T[]))
>   {
>     assert(a.length, "Attempting to fetch the front of an empty array of " ~
>            T.stringof);
>     size_t i = 0;
>     return decode(a, i);
>   }
>
> What that means is that if I implement an algorithm that accepts, as input, an InputRange of char's, it will ALWAYS try to decode it. This means that even:
>
>    from.copy(to)
>
> will decode 'from', and then re-encode it for 'to'. And it will do it SILENTLY. The user won't notice, and he'll just assume that D performance sux. Even if he does notice, his options to make his code run faster are poor.
>
> If the user wants decoding, it should be explicit, as in:
>
>     from.decode.copy(encode!to)
>
> The USER should decide where and when the decoding goes. 'decode' should be just another algorithm.
>
> (Yes, I know that std.algorithm.copy() has some specializations to take care of this. But these specializations would have to be written for EVERY algorithm, which is thoroughly unreasonable. Furthermore, copy()'s specializations only apply if BOTH source and destination are arrays. If just one is, the decode/encode penalty applies.)
>
> Is there any hope of fixing this?
March 10, 2014
On 3/10/2014 6:21 AM, ponce wrote:
> On Sunday, 9 March 2014 at 21:14:30 UTC, Nick Sabalausky wrote:
>>
>> Yea, I've had problems before - completely unnecessary problems that
>> were *not* helpful or indicative of latent bugs - which were a direct
>> result of Phobos being overly pedantic and eager about UTF validation.
>> And yet the implicit UTF validation has never actually *helped* me in
>> any way.
>
>
>>> self-imposed limitation
> For greater good.
>
> I finds this article very telling about why string should be converted
> to UTF-8 as often as possible.
> http://www.utf8everywhere.org/
>
> I agree 100% with its content, it's impossibly hard to have a sane
> handling of encodings on WIndows (even more in a team), if not following
> the drastic rules the article exposes.
>

I may have missed it, but I don't see where it says anything about validation or immediate sanitation of invalid sequences. It's mostly "UTF-16 sucks and so does Windows" (not that I'm necessarily disagreeing with it). (ot: Kinda wish they hadn't used such a hard to read font...)

March 10, 2014
On 3/9/2014 11:27 AM, Vladimir Panteleev wrote:
> On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
>> On topic, I think D's implicit default decode to dchar is *infinity*
>> times better than C++'s char-based strings. While imperfect in terms
>> of grapheme, it was still a design decision made of win.
>
> Care to argument?
>

It's simple: Breaking things on all non-English languages is worse than breaking things on non-western[1] languages. Is still breakage, and that *is* bad, but there's no question which breakage is significantly larger.

[1] (And yes, I realize "western" is a gross over-simplification here. Point is "one working language" vs "several working languages".)

March 10, 2014
On Monday, 10 March 2014 at 11:04:43 UTC, Nick Sabalausky wrote:
>
> I may have missed it, but I don't see where it says anything about validation or immediate sanitation of invalid sequences. It's mostly "UTF-16 sucks and so does Windows" (not that I'm necessarily disagreeing with it). (ot: Kinda wish they hadn't used such a hard to read font...)

I should have highlighted it, their recommendations for proper encoding handling on Windows are in section 5 ("How to do text on Windows").

One of them is "std::strings and char*, anywhere in the program, are considered UTF-8 (if not said otherwise)."

I finds it interesting that D tends to enforce this lesson learned with mixed-encodings codebases.

March 10, 2014
On Sunday, 9 March 2014 at 17:27:20 UTC, Andrei Alexandrescu wrote:
> On 3/9/14, 6:47 AM, "Marc Schütz" <schuetzm@gmx.net>" wrote:
>> On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
>>> 2) It is regression back to C++ days of no-one-cares-about-Unicode
>>> pain. Thinking about strings as character arrays is so natural and
>>> convenient that if language/Phobos won't punish you for that, it will
>>> be extremely widespread.
>>
>> Not with Nick Sabalausky's suggestion to remove the implementation of
>> front from char arrays. This way, everyone will be forced to decide
>> whether they want code units or code points or something else.
>
> Such as giving up on that crappy language that keeps on breaking their code.
>
> Andrei


That was more about "if you are that crazy to even consider such breakage, this is closer my personal perfection" than actual proposal ;)
March 10, 2014
On Friday, 7 March 2014 at 19:43:57 UTC, Walter Bright wrote:
> On 3/7/2014 7:03 AM, Dicebot wrote:
>> 1) It is a huge breakage and you have been refusing to do one even for more
>> important problems. What is about this sudden change of mind?
>
> 1. Performance Performance Performance

Not important enough. D has always been "safe by default, fast when asked to" language, not other way around. There is no fundamental performance problem here, only lack of knowledge about Phobos.

> 2. The current behavior is surprising (it sure surprised me, I didn't notice it until I looked at the assembler to figure out why the performance sucked)

That may imply that better documentation is needed. You were only surprised because of wrong initial assumption about what `char[]` type means.

> 3. Weirdnesses like ElementEncodingType

ElementEncodingType is extremely annoying but I think it is just a side effect of more bigger problem how string algorithms are handled currently. It does not need to be that way.

> 4. Strange behavior differences between char[], char*, and InputRange!char types

Again, there is nothing strange about it. `char[]` is a special type with special semantics that is defined in documentation and consistently following  that definition in all but raw array indexing/slicing (which is what I find unfortunate but also beyond fixing feasibility).

> 5. Funky anomalous issues with writing OutputRange!char (the put(T) must take a dchar)

Bad but not worth even a small breaking change.

>> 2) lack of convenient .raw property which will effectively do cast(ubyte[])
>
> I've done the cast as a workaround, but when working with generic code it turns out the ubyte type becomes viral - you have to use it everywhere. So all over the place you're having casts between ubyte <=> char in unexpected places. You also wind up with ugly ubyte <=> dchar casts, with the commensurate risk that you goofed and have a truncation bug.

Of course it is viral. Because you never ever wan't to have char[] at all if you don't work with Unicode (or work with it on raw byte level). And in that case it is your responsibility to do manual decoding when appropriate. Trying to dish out that performance often means going at low level with all associated risks, there is nothing special about char[] here. It is not a common use case.

> Essentially, the auto-decode makes trivial code look better, but if you're writing a more comprehensive string processing program, and care about performance, it makes a regular ugly mess of things.

And this is how it should be. Again, I am all for creating language that favors performance-critical power programming needs over common/casual needs but it is not what D is and you have been making such choices consistently over quite a long time now (array literals that allocate, I will never forgive that). Suddenly changing your mind only because you have encountered this specific issue personally as opposed to just reports does not fit a language author role. It does not really matter if any new approach itself is good or bad - being unpredictable is a reputation damage D simply can't afford.
March 10, 2014
On Monday, 10 March 2014 at 10:52:02 UTC, Andrea Fontana wrote:
> I'm not sure I understood the point of this (long) thread.
> The main problem is that decode() is called also if not needed?
>

I'd like to offer up one D 'user' perspective, it's just a single data point but perhaps useful. I write applications that process Arabic, and I'm thinking about converting one of those apps from python to D, for performance reasons.

My app deals with unicode arabic text that is 'out there', and the UnicodeTM support for Arabic is not that well thought out, so the data is often (always) inconsistent in terms of sequencing diacritics etc. Even the code page can vary. Therefore my code has to cater to various ways that other developers have sequenced the code points.

So, my needs as a 'user' are:
* I want to encode all incoming data immediately into unicode, usually UTF8, if isn't already.
* I want to iterate over code points. I don't care about the raw data.
* When I get the length of my string it should be the number of code points.
* When I index my string it should return the nth code point.
* When I manipulate my strings I want to work with code points
... you get the drift.

If I want to access the raw data, which I don't, then I'm very happy to cast to ubyte etc.

If encode/decode is a performance issue then perhaps there could be a cache for recently used strings where the code point representation is held.

BTW to answer a question in the thread, yes the data is left-to-right and visualised right-to-left.



March 10, 2014
In italian we need unicode too. We have several accented letters and often programming languages don't handle utf-8 and other encoding so well...

In D I never had any problem with this, and I work a lot on text processing.

So my question: is there any problem I'm missing in D with unicode support or is just a performance problem on algorithms?

If the problem is performance on algorithms that use .front() but don't care to understand its data, why don't we add a .rawFront() property to implement only when make sense and then a "fallback" like:

auto rawFront(R)(R range) if ( ... isrange ... && !__traits(compiles, range.rawFront))  { return range.front; }

In this way on copy() or other algorithms we can use rawFront() and it's backward compatible with other ranges too.

But I guess I'm missing the point :)


On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
> On Monday, 10 March 2014 at 10:52:02 UTC, Andrea Fontana wrote:
>> I'm not sure I understood the point of this (long) thread.
>> The main problem is that decode() is called also if not needed?
>>
>
> I'd like to offer up one D 'user' perspective, it's just a single data point but perhaps useful. I write applications that process Arabic, and I'm thinking about converting one of those apps from python to D, for performance reasons.
>
> My app deals with unicode arabic text that is 'out there', and the UnicodeTM support for Arabic is not that well thought out, so the data is often (always) inconsistent in terms of sequencing diacritics etc. Even the code page can vary. Therefore my code has to cater to various ways that other developers have sequenced the code points.
>
> So, my needs as a 'user' are:
> * I want to encode all incoming data immediately into unicode, usually UTF8, if isn't already.
> * I want to iterate over code points. I don't care about the raw data.
> * When I get the length of my string it should be the number of code points.
> * When I index my string it should return the nth code point.
> * When I manipulate my strings I want to work with code points
> ... you get the drift.
>
> If I want to access the raw data, which I don't, then I'm very happy to cast to ubyte etc.
>
> If encode/decode is a performance issue then perhaps there could be a cache for recently used strings where the code point representation is held.
>
> BTW to answer a question in the thread, yes the data is left-to-right and visualised right-to-left.

March 10, 2014
Am 07.03.2014 03:37, schrieb Walter Bright:
> In "Lots of low hanging fruit in Phobos" the issue came up about the automatic
> encoding and decoding of char ranges.

after reading many of the attached posts the question is - what
could be Ds future design of introducing breaking changes, its
not a solution to say its not possible because of too many breaking changes - that will become more and more a problem of Ds evolution
- much like C++

March 10, 2014
On Monday, 10 March 2014 at 14:05:39 UTC, dennis luehring wrote:
> Am 07.03.2014 03:37, schrieb Walter Bright:
>> In "Lots of low hanging fruit in Phobos" the issue came up about the automatic
>> encoding and decoding of char ranges.
>
> after reading many of the attached posts the question is - what
> could be Ds future design of introducing breaking changes, its
> not a solution to say its not possible because of too many breaking changes - that will become more and more a problem of Ds evolution
> - much like C++

Historically 2 approaches has been practiced:

1) argue a lot and then do nothing
2) suddenly change something and tell users is was necessary

I also think that this is much more important issue than this whole thread but it does not seem to attract any real attention when mentioned.