March 09, 2014
09-Mar-2014 07:53, Vladimir Panteleev пишет:
> On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote:
> I don't understand this argument. Iterating by code unit is not
> meaningless if you don't want to extract meaning from each unit
> iteration. For example, if you're parsing JSON or XML, you only care
> about the syntax characters, which are all ASCII. And there is no
> confusion of "what exactly are we counting here".
>
>>> This was debated... people should not be looking at individual code
>>> points, unless they really know what they're doing.
>>
>> Should they be looking at code units instead?
>
> No. They should only be looking at substrings.

This. Anyhow searching dchar makes sense for _some_ languages, the problem is that it shouldn't decode the whole string but rather encode the needle properly and search that.

Basically the whole thread is about:
how do I work efficiently (no-decoding) with UTF-8/UTF-16 in cases where it obviously can be done?

The current situation is bad in that it undermines writing decode-less generic code. One easily falls into auto-decode trap on first .front, especially when called from some standard algorithm. The algo sees char[]/wchar[] and gets into decode mode via some special case. If it would do that with _all_ char/wchar random access ranges it'd be at least consistent.

That and wrapping your head around 2 sets of constraints. The amount of code around 2 types - wchar[]/char[] is way too much, that much is clear.

-- 
Dmitry Olshansky
March 09, 2014
09-Mar-2014 21:54, Vladimir Panteleev пишет:
> On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote:
>> wc
>
> What should wc produce on a Sanskrit text?
>
> The problem is that such questions quickly become philosophical.

Technically it could use word-braking algorithm for words.
Or count grapheme clusters, or count code points it all may have value, depending on the user and writing system.


-- 
Dmitry Olshansky
March 09, 2014
On 3/9/14, 11:34 AM, Dmitry Olshansky wrote:
> 09-Mar-2014 07:53, Vladimir Panteleev пишет:
>> On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote:
>> I don't understand this argument. Iterating by code unit is not
>> meaningless if you don't want to extract meaning from each unit
>> iteration. For example, if you're parsing JSON or XML, you only care
>> about the syntax characters, which are all ASCII. And there is no
>> confusion of "what exactly are we counting here".
>>
>>>> This was debated... people should not be looking at individual code
>>>> points, unless they really know what they're doing.
>>>
>>> Should they be looking at code units instead?
>>
>> No. They should only be looking at substrings.
>
> This. Anyhow searching dchar makes sense for _some_ languages, the
> problem is that it shouldn't decode the whole string but rather encode
> the needle properly and search that.

That's just an optimization. Conceptually what happens is we're looking for a code point in a sequence of code points.

> Basically the whole thread is about:
> how do I work efficiently (no-decoding) with UTF-8/UTF-16 in cases where
> it obviously can be done?
>
> The current situation is bad in that it undermines writing decode-less
> generic code.

s/undermines writing/makes writing explicit/

> One easily falls into auto-decode trap on first .front,
> especially when called from some standard algorithm. The algo sees
> char[]/wchar[] and gets into decode mode via some special case. If it
> would do that with _all_ char/wchar random access ranges it'd be at
> least consistent.
>
> That and wrapping your head around 2 sets of constraints. The amount of
> code around 2 types - wchar[]/char[] is way too much, that much is clear.

We're engineers so we should quantify. Ideally that would be as simple as "git grep isNarrowString|wc -l" which currently prints 42 of all numbers :o).

Overall I suspect there are a few good simplifications we can make by using isNarrowString and .representation.


Andrei

March 09, 2014
On Sunday, 9 March 2014 at 14:57:32 UTC, Peter Alexander wrote:
> You have mentioned case-insensitive searching, but I think I've adequately demonstrated that this doesn't work in general by code point: you need to normalize and take locales into account.

I don't understand what your argument. Is it "by code point is not 100% correct, so let's just drop it and go for raw code units instead?"

We *are* arguing about whether or not "front/popFront" should decode by dchar, right...?

You mention the algorithms "Searching, equality testing, copying, sorting, hashing, splitting, joining..." I said "by codepoint is not correct", but I still think it's a hell of a lot more accurate than by codeunit. Unless you want to ignore any and all algorithms that takes a predicate?

You say "unless you are willing to ignore languages like Turkish", but... If you don't decode front, than aren't you just ignoring *all* languages that basically aren't English....?

As I said, maybe by codepoint is not correct, but if it isn't, I think we should be moving further *into* the correct behavior by default, not away from it.
March 09, 2014
09-Mar-2014 22:41, Andrei Alexandrescu пишет:
> On 3/9/14, 11:34 AM, Dmitry Olshansky wrote:
>> This. Anyhow searching dchar makes sense for _some_ languages, the
>> problem is that it shouldn't decode the whole string but rather encode
>> the needle properly and search that.
>
> That's just an optimization. Conceptually what happens is we're looking
> for a code point in a sequence of code points.

Yup. It's till not a good idea to introduce this in std.algorithm in a non-generic way.

>> That and wrapping your head around 2 sets of constraints. The amount of
>> code around 2 types - wchar[]/char[] is way too much, that much is clear.
>
> We're engineers so we should quantify. Ideally that would be as simple
> as "git grep isNarrowString|wc -l" which currently prints 42 of all
> numbers :o).

Add to that some uses of isSomeString and ElementEncodingType.
138 and 80 respectively.

And in most cases it means that nice generic code was hacked to care about 2 types in particular. That is what bothers me.

> Overall I suspect there are a few good simplifications we can make by
> using isNarrowString and .representation.

Okay putting potential breakage aside.
Let me sketch up an additive way of improving current situation.

1. Say we recognize any indexable entity of char/wchar/dchar, that however has .front returning a dchar as a "narrow string". Nothing fancy - it's just a generalization of isNarrowString. At least a range over Array!char will work as string now.

2. Likewise representation must be made something more explicit say byCodeUnit and work on any isNarrowString per above. The opposite of that is byCodePoint.

3. ElementEncodingType is too verbose and misleading. Something more explicit would be useful. ItemType/UnitType maybe?

4. We lack lots of good stuff from Unicode standard. Some recently landed in std.uni. We need many more, and deprecate crappy ones in std.string. (e.g. wrapping text is one)

5. Most algorithms conceptually decode, but may be enhanced to work directly on UTF-8/UTF-16. That together with 1, should IMHO solve most of our problems.

6. Take into account ASCII and maybe other alphabets? Should be as trivial as .assumeASCII and then on you march with all of std.algo/etc.

-- 
Dmitry Olshansky
March 09, 2014
On 3/9/14, 12:25 PM, Dmitry Olshansky wrote:
> Okay putting potential breakage aside.
> Let me sketch up an additive way of improving current situation.

Now you're talking.

> 1. Say we recognize any indexable entity of char/wchar/dchar, that
> however has .front returning a dchar as a "narrow string". Nothing fancy
> - it's just a generalization of isNarrowString. At least a range over
> Array!char will work as string now.

Wait, why is dchar[] a narrow string?

> 2. Likewise representation must be made something more explicit say
> byCodeUnit and work on any isNarrowString per above. The opposite of
> that is byCodePoint.

Fine.

> 3. ElementEncodingType is too verbose and misleading. Something more
> explicit would be useful. ItemType/UnitType maybe?

We're stuck with that name.

> 4. We lack lots of good stuff from Unicode standard. Some recently
> landed in std.uni. We need many more, and deprecate crappy ones in
> std.string. (e.g. wrapping text is one)

Add away.

> 5. Most algorithms conceptually decode, but may be enhanced to work
> directly on UTF-8/UTF-16. That together with 1, should IMHO solve most
> of our problems.

Great!

> 6. Take into account ASCII and maybe other alphabets? Should be as
> trivial as .assumeASCII and then on you march with all of std.algo/etc.

Walter is against that. His main argument is that UTF already covers ASCII with only a marginal cost (that can be avoided) and that we should go farther into the future instead of catering to an obsolete representation.


Andrei


March 09, 2014
On Sunday, 9 March 2014 at 19:40:32 UTC, Andrei Alexandrescu wrote:
>> 6. Take into account ASCII and maybe other alphabets? Should be as
>> trivial as .assumeASCII and then on you march with all of std.algo/etc.
>
> Walter is against that. His main argument is that UTF already covers ASCII with only a marginal cost (that can be avoided) and that we should go farther into the future instead of catering to an obsolete representation.
>
>
> Andrei

When I've wanted to write code especially for ASCII, I think it hasn't been for use in generic algorithms anyway. Mostly it's stuff for manipulating segments of memory in a particular way, like as seen here in my library which does some work to generate D code.

https://github.com/w0rp/dsmoke/blob/master/source/smoke/string_util.d#L45

Anything else would be something like running through an algorithm and then copying data into a new array or similar, and that would miss the point. When it comes to generic algorithms and ASCII I think UTF-x is sufficient.
March 09, 2014
09-Mar-2014 23:40, Andrei Alexandrescu пишет:
> On 3/9/14, 12:25 PM, Dmitry Olshansky wrote:
>> Okay putting potential breakage aside.
>> Let me sketch up an additive way of improving current situation.
>
> Now you're talking.
>
>> 1. Say we recognize any indexable entity of char/wchar/dchar, that
>> however has .front returning a dchar as a "narrow string". Nothing fancy
>> - it's just a generalization of isNarrowString. At least a range over
>> Array!char will work as string now.
>
> Wait, why is dchar[] a narrow string?

Indeed `...entity of char/wchar/dchar` --> `...entity of char/wchar`.

>> 3. ElementEncodingType is too verbose and misleading. Something more
>> explicit would be useful. ItemType/UnitType maybe?
>
> We're stuck with that name.

Too bad, but we have renamed imports... if only they worked correctly. But let's not derail.

[snip]

Great, so this may be turned into smallish DIP or bugzilla enhancements.

>> 6. Take into account ASCII and maybe other alphabets? Should be as
>> trivial as .assumeASCII and then on you march with all of std.algo/etc.
>
> Walter is against that. His main argument is that UTF already covers
> ASCII with only a marginal cost

He certainly doesn't have things like case-insensitive matching or collation on his list. Some cute tables are what "directly to the UTF" algorithms require for almost anything beyond simple-minded "find me a substring".

Walter certainly would have different stance the moment he observe the extra bulk of object code for these.

> (that can be avoided)

How? I'm not talking about  `x < 0x80` branches, these wouldn't cost a dime.

I really don't feel strong about 6th point. I see it as a good idea to allow custom alphabets and reap performance benefits where it makes sense, the need for that is less urgent though.

> and that we should
> go farther into the future instead of catering to an obsolete
> representation.

That is something I agree with.

-- 
Dmitry Olshansky
March 09, 2014
On 3/9/2014 1:26 PM, Andrei Alexandrescu wrote:
> On 3/9/14, 6:34 AM, Jakob Ovrum wrote:
>
>> `byCodeUnit` is essentially std.string.representation.
>
> Actually not because for reasons that are unclear to me people really
> want the individual type to be char, not ubyte.
>

Probably because char *is* D's type for UTF-8 code units.

March 09, 2014
On 3/9/2014 11:21 AM, Vladimir Panteleev wrote:
> On Sunday, 9 March 2014 at 12:24:11 UTC, ponce wrote:
>>> - In lots of places, I've discovered that Phobos did UTF decoding
>>> (thus murdering performance) when it didn't need to. Such cases
>>> included format (now fixed), appender (now fixed), startsWith (now
>>> fixed - recently), skipOver (still unfixed). These have caused latent
>>> bugs in my programs that happened to be fed non-UTF data. There's no
>>> reason for why D should fail on non-UTF data if it has no reason to
>>> decode it in the first place! These failures have only served to
>>> identify places in Phobos where redundant decoding was occurring.
>>
>> With all due respect, D string type is exclusively for UTF-8 strings.
>> If it is not valid UTF-8, it should never had been a D string in the
>> first place. In the other cases, ubyte[] is there.
>
> This is an arbitrary self-imposed limitation caused by the choice in how
> strings are handled in Phobos.

Yea, I've had problems before - completely unnecessary problems that were *not* helpful or indicative of latent bugs - which were a direct result of Phobos being overly pedantic and eager about UTF validation. And yet the implicit UTF validation has never actually *helped* me in any way.