May 26, 2016
On Thu, May 26, 2016 at 12:00:54PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: [...]
> On 05/12/2016 04:15 PM, Walter Bright wrote:
[...]
> > 4. Autodecoding is slow and has no place in high speed string processing.
> 
> I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D. Also, little code should deal with one code unit or code point at a time; instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially done through indexing.
> 
> Also allow me to point that much of the slowdown can be addressed tactically. The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore easily speculated. We can and we should arrange code to minimize impact.

General Unicode strings have a lot of non-ASCII characters. Why are we only optimizing for the ASCII case?


> > 5. Very few algorithms require decoding.
> 
> The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug):
> 
> s.find("abc")
> s.findSplit("abc")
> s.findSplit('a')
> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
> 
> However the following do require autodecoding:
> 
> s.walkLength
> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> s.count!(c => c >= 32) // non-control characters

Question: what should count return, given a string containing (1)
combining diacritics, or (2) Korean text? Or (3) zero-width spaces?


> Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.

The problem is that often such decisions can only be made by the user, because it depends on what the user wants to accomplish.  What should count return, given some Unicode string?  If the user wants to determine the size of a buffer (e.g., to store a string minus some characters to be stripped), then count should return the byte count. If the user wants to count the number of matching visual characters, then count should return the number of graphemes. If the user wants to determine the visual width of the (filtered) string, then count should not be used at all, but instead a font metric algorithm.  (I can't think of a practical use case where you'd actually need to count code points(!).)

Having the library arbitrarily choose one use case over the others
(especially one that seems the least applicable to practical situations)
just doesn't seem right to me at all. Rather, the user ought to specify
what exactly is to be counted, i.e., s.byCodeUnit.count(),
s.byCodePoint.count(), or s.byGrapheme.count().


[...]
> > 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there.
> 
> Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)

Therefore, instead of:

	myString.splitter!"abc".joiner!"def".count;

we have to write:

	myString.representation
		.splitter!("abc".representation)
		.joiner!("def".representation)
		.count;

Great.


[...]
> Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[].

That is a strawman. We are not arguing for eliminating the distinction between char[] and ubyte[]. Rather, the complaint is that autodecoding represents a constant overhead in string processing that's often *unnecessary*. Many string operations don't *need* to autodecode, and even those that may seem like they do, are often better implemented differently.

For example, filtering a string by a non-ASCII character can actually be done via substring search -- expand the non-ASCII character into 1 to 6 code units, and then do the equivalent of C's strstr().  This will not have false positives thanks to the way UTF-8 is designed.  It eliminates the overhead of decoding every single character -- in implementational terms, it could, for example, first scan for the first 1st byte by linear scan through the string without decoding, which is a lot faster than decoding every single character and then comparing with the target. Only when the first byte matches does it need to do the slightly more expensive operation of substring comparison.

Similarly, splitter does not need to operate on code points at all. It's unnecessarily slow that way. Most use cases of splitter has lots of data in between delimiters, which means most of the work done by autodecoding is wasted.  Instead, splitter should just scan for the substring to split on -- again the design of UTF-8 guarantees there will be no false positives -- and only put in the effort where it's actually needed: at the delimiters, not the data in between.

The same could be said of joiner, and many other common string algorithms.

There aren't many algorithms that actually need to decode; decoding should be restricted to them, rather than an overhead applied across the board.


[...]
> Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean.
[...]

We already have a clear definition: char, wchar, and dchar are Unicode code units, and the latter is also Unicode code points. That's all there is to it.

If we want Phobos to truly be able to take advantage of the fact that char[], wchar[], dchar[] contain Unicode strings, we need to stop the navel gazing at what byte representations and bits mean, and look at the bigger picture.  Consider char[] as a unit in itself, a complete Unicode string -- the actual code units don't really matter, as they are just an implementation detail. What you want to be able to do is for a Phobos algorithm to decide, OK, in order to produce output X, it's faster to do substring scanning, and in order to produce output Y, it's better to decode first.  In other words, decoding or not decoding ought to be a decision made at the algorithm level (or higher), depending on the need at hand. It should not be hard-boiled into the lower-level internals of how strings are handled, such that higher-level algorithms are straitjacketed and forced to work with the decoded stream, even when they actually don't *need* decoding to do what they want.

In the cases where Phobos is unable to make a decision (e.g., what should count return -- which depends on what the user is trying to accomplish), it should be left to the user. The user shouldn't have to work against a default setting that only works for a subset of use cases.


T

-- 
Without geometry, life would be pointless. -- VS
May 26, 2016
On 05/26/2016 07:23 PM, H. S. Teoh via Digitalmars-d wrote:
> Therefore, instead of:
>
> 	myString.splitter!"abc".joiner!"def".count;
>
> we have to write:
>
> 	myString.representation
> 		.splitter!("abc".representation)
> 		.joiner!("def".representation)
> 		.count;

No, that's not necessary (or correct). -- Andrei
May 27, 2016
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:
>> 4. Autodecoding is slow and has no place in high speed string processing.
>
> I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D.

It is completely wasted mental effort.

>> 5. Very few algorithms require decoding.
>
> The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug):
>
> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

As far as I can see, the language currently does not provide the facilities to implement the above without autodecoding.

> However the following do require autodecoding:
>
> s.walkLength

Usage of the result of this expression will be incorrect in many foreseeable cases.

> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation

Ditto.

> s.count!(c => c >= 32) // non-control characters

Ditto, with a big red flag. If you are dealing with control characters, the code is likely low-level enough that you need to be explicit in what you are counting. It is likely not what actually needs to be counted. Such confusion can lead to security risks.

> Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.

It should be explicit.

>> 7. Autodecode cannot be used with unicode path/filenames, because it is
>> legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
>> out in the wild that pure Unicode is not universal - there's lots of
>> dirty Unicode that should remain unmolested, and autocode does not play
>> with that.
>
> If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.

This is not practical. Do you really see changing std.file and std.path to accept ubyte[] for all path arguments?

>> 8. In my work with UTF-8 streams, dealing with autodecode has caused me
>> considerably extra work every time. A convenient timesaver it ain't.
>
> Objection. Vague.

I can confirm this vague subjective observation. For example, DustMite reimplements some std.string functions in order to be able to handle D files with invalid UTF-8 characters.

>> 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
>> importing std.array one way or another, and then autodecode is there.
>
> Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)

This is neither easy nor practical. It makes writing reliable string handling code a chore in D. Because it is difficult to find all places where this must be done, it is not possible to do on a program-wide scale, thus bugs can only be discovered when this or that component fails because it was not tested with Unicode strings.

>> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
>> benefit of being arrays in the first place.
>
> First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width.
>
> Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[]. Then the natural question is, what _is_ the difference between char[] and ubyte[] and why do we need char as a separate type from ubyte?
>
> This is a fundamental question for which we need a rigorous answer.

Why?

> What is the purpose of char, wchar, and dchar? My current understanding is that they're justified as pretty much indistinguishable in primitives and behavior from ubyte, ushort, and uint respectively, but they reflect a loose subjective intent from the programmer that they hold actual UTF code units. The core language does not enforce such, except it does special things in random places like for loops (any other)?
>
> If char is to be distinct from ubyte, and char[] is to be distinct from ubyte[], then autodecoding does the right thing: it makes sure they are distinguished in behavior and embodies the assumption that char is, in fact, a UTF8 code point.

I don't follow this line of reasoning at all.

>> 11. Indexing an array produces different results than autodecoding,
>> another glaring special case.
>
> This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding.

There is no convincing argument why indexing and slicing should not simply operate on code units.

> Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean.

I don't follow. Though, making char implicitly convertible to wchar and dchar has clearly been a mistake.

May 27, 2016
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:
>> 11. Indexing an array produces different results than autodecoding,
>> another glaring special case.
>
> This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding.

Sounds like you want to say that string should be smarter than an array of code units in dealing with unicode. As I understand, design rationale behind strings being plain arrays of code units is that it's impractical for the string to smarter than array of code units - it just won't cut it, while plain array provides simple and easy to understand implementation of string.
May 27, 2016
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:
> This might be a good time to discuss this a tad further. I'd appreciate if the debate stayed on point going forward. Thanks!
>
> My thesis: the D1 design decision to represent strings as char[] was disastrous and probably one of the largest weaknesses of D1. The decision in D2 to use immutable(char)[] for strings is a vast improvement but still has a number of issues. The approach to autodecoding in Phobos is an improvement on that decision.

It is not, which has been shown by various posts in this thread. Iterating by code points is at least as wrong as iterating by code units; it can be argued it is worse because it sometimes makes the fact that it's wrong harder to detect.

> The insistent shunning of a user-defined type to represent strings is not good and we need to rid ourselves of it.

While this may be true, it has nothing to do with auto decoding. I assume you would want such a user-define string type to auto-decode as well, right?

>
> On 05/12/2016 04:15 PM, Walter Bright wrote:
>> 5. Very few algorithms require decoding.
>
> The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug):
>
> s.find("abc")
> s.findSplit("abc")
> s.findSplit('a')

Yes.

> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

Ideally yes, but this is a special case that cannot be detected by `count`.

>
> However the following do require autodecoding:
>
> s.walkLength
> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> s.count!(c => c >= 32) // non-control characters

No, they do not need _auto_ decoding, they need a decision _by the user_ what they should be decoded to. Code units? Code points? Graphemes? Words? Lines?

>
> Currently the standard library operates at code point level

Because it auto decodes.

> even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.

No one wants to take that second part away. For example, the `find` can provide an overload that accepts `const(char)[]` directly, while `walkLength` doesn't, requiring a decision by the caller.

>> 7. Autodecode cannot be used with unicode path/filenames, because it is
>> legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
>> out in the wild that pure Unicode is not universal - there's lots of
>> dirty Unicode that should remain unmolested, and autocode does not play
>> with that.
>
> If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.

I believe a library type would be more appropriate than bare `ubyte[]`. It should provide conversion between the OS encoding (which can be detected automatically) and UTF strings, for example. And it should be used for any "strings" that comes from outside the program, like main's arguments, env variables...

>> 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
>> importing std.array one way or another, and then autodecode is there.
>
> Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)

This would no longer work if char[] and char ranges were to be treated identically.

>
>> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
>> benefit of being arrays in the first place.
>
> First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width.
>
> Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[]. Then the natural question is, what _is_ the difference between char[] and ubyte[] and why do we need char as a separate type from ubyte?
>
> This is a fundamental question for which we need a rigorous answer. What is the purpose of char, wchar, and dchar? My current understanding is that they're justified as pretty much indistinguishable in primitives and behavior from ubyte, ushort, and uint respectively, but they reflect a loose subjective intent from the programmer that they hold actual UTF code units. The core language does not enforce such, except it does special things in random places like for loops (any other)?

Agreed.

>
> If char is to be distinct from ubyte, and char[] is to be distinct from ubyte[], then autodecoding does the right thing: it makes sure they are distinguished in behavior and embodies the assumption that char is, in fact, a UTF8 code point.

Distinguishing them is the right thing to do, but auto decoding is not the way to achieve that, see above.
May 27, 2016
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:
[snip]
>
> I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D. Also, little code should deal with one code unit or code point at a time; instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially done through indexing.

I disagree. "if used naively" shouldn't be the default. A user (naively) expects string algorithms to work as efficiently as possible without overheads. To tell the user later that s/he shouldn't _naively_ have used a certain algorithm provided by the library is a bit cynical. Having to redesign a code base because of hidden behavior is a big turn off, having to go through Phobos to determine where the hidden pitfalls are is not the user's job.

> Also allow me to point that much of the slowdown can be addressed tactically. The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore easily speculated. We can and we should arrange code to minimize impact.

And what if you deal with non-ASCII heavy text? Does the user have to guess an micro-optimize for simple use cases?

>> 5. Very few algorithms require decoding.
>
> The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug):
>
> s.find("abc")
> s.findSplit("abc")
> s.findSplit('a')
> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
>
> However the following do require autodecoding:
>
> s.walkLength
> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> s.count!(c => c >= 32) // non-control characters
>
> Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.

But how is the user supposed to know without being a core contributor to Phobos? If using a library method that works well in one case can slow down your code in a slightly different case, something is wrong with the language/library design. For simple cases the burden shouldn't be on the user, or, if it is, s/he should be informed about it in order to be able to make well-informed decisions. Personally I wouldn't mind having to decide in each case what I want (provided I have a best practices cheat sheet :)), so I can get the best out of it. But to keep guessing, testing and benchmarking each string handling library function is not good at all.

[snip]
May 27, 2016
On 5/27/16 7:19 AM, Chris wrote:
> On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:
> [snip]
>>
>> I would agree only with the amendment "...if used naively", which is
>> important. Knowledge of how autodecoding works is a prerequisite for
>> writing fast string code in D. Also, little code should deal with one
>> code unit or code point at a time; instead, it should use standard
>> library algorithms for searching, matching etc. When needed, iterating
>> every code unit is trivially done through indexing.
>
> I disagree.

Misunderstanding.

> "if used naively" shouldn't be the default. A user (naively)
> expects string algorithms to work as efficiently as possible without
> overheads.

That's what happens with autodecoding.

>> Also allow me to point that much of the slowdown can be addressed
>> tactically. The test c < 0x80 is highly predictable (in ASCII-heavy
>> text) and therefore easily speculated. We can and we should arrange
>> code to minimize impact.
>
> And what if you deal with non-ASCII heavy text? Does the user have to
> guess an micro-optimize for simple use cases?

Misunderstanding.

>>> 5. Very few algorithms require decoding.
>>
>> The key here is leaving it to the standard library to do the right
>> thing instead of having the user wonder separately for each case.
>> These uses don't need decoding, and the standard library correctly
>> doesn't involve it (or if it currently does it has a bug):
>>
>> s.find("abc")
>> s.findSplit("abc")
>> s.findSplit('a')
>> s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
>>
>> However the following do require autodecoding:
>>
>> s.walkLength
>> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
>> s.count!(c => c >= 32) // non-control characters
>>
>> Currently the standard library operates at code point level even
>> though inside it may choose to use code units when admissible. Leaving
>> such a decision to the library seems like a wise thing to do.
>
> But how is the user supposed to know without being a core contributor to
> Phobos?

Misunderstanding. All examples work properly today because of autodecoding. -- Andrei

May 27, 2016
On 5/27/16 6:56 AM, Marc Schütz wrote:
> It is not, which has been shown by various posts in this thread.

Couldn't quite find strong arguments. Could you please be more explicit on which you found most convincing? -- Andrei
May 27, 2016
On 5/27/16 6:26 AM, Kagamin wrote:
> As I understand, design rationale
> behind strings being plain arrays of code units is that it's impractical
> for the string to smarter than array of code units - it just won't cut
> it, while plain array provides simple and easy to understand
> implementation of string.

That's my understanding too. And I think the design rationale is wrong. -- Andrei
May 27, 2016
On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
>>> However the following do require autodecoding:
>>>
>>> s.walkLength
>>> s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
>>> s.count!(c => c >= 32) // non-control characters
>>>
>>> Currently the standard library operates at code point level even
>>> though inside it may choose to use code units when admissible. Leaving
>>> such a decision to the library seems like a wise thing to do.
>>
>> But how is the user supposed to know without being a core contributor to
>> Phobos?
>
> Misunderstanding. All examples work properly today because of
> autodecoding. -- Andrei

They only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes.

https://dpaste.dzfl.pl/817dec505fd2