September 10, 2018
On Saturday, 8 September 2018 at 15:36:25 UTC, Steven Schveighoffer wrote:
> On 8/9/18 2:44 AM, Walter Bright wrote:

>
> So it turns out that technically the problem here, even though it seemed like an autodecoding problem, is a problem with splitter.
>
> splitter doesn't deal with encodings of character ranges at all.
>
> For instance, when you have this:
>
> "abc 123".byCodeUnit.splitter;
>
> What happens is splitter only has one overload that takes one parameter, and that requires a character *array*, not a range.
>
> So the byCodeUnit result is aliased-this to its original, and surprise! the elements from that splitter are string.
>
> Next, I tried to use a parameter:
>
> "abc 123".byCodeUnit.splitter(" ");
>
> Nope, still devolves to string. It turns out it can't figure out how to split character ranges using a character array as input.
>
> The only thing that does seem to work is this:
>
> "abc 123".byCodeUnit.splitter(" ".byCodeUnit);
>

After a while your code will be cluttered with absurd stuff like this. `.byCodeUnit`, `.byGrapheme`, `.array` etc. Due to my experience with `splitter` et. al. I tried to create my own parser to have better control over every step. After a few *minutes* of testing things I ran into this bug [1] that didn't get fixed till early 2018. I never started to write my own step-by-step parser. I'm glad I didn't.

I wish people began to realize that string handling is a basic necessity and that the correct handling of strings is of utmost importance. Please keep us updated on how things work out (or not) for you.

[Please, nobody answer my post pointing out that a) we don't understand Unicode and b) that it's an insult to the Universe to draw attention to flaws that keep pestering us on an almost daily basis - without trying to fix them ourselves stante pede. As is clear from Steve's efforts, the Universe doesn't seem to care.)

[1] https://issues.dlang.org/show_bug.cgi?id=16739

[snip]
September 10, 2018
On Monday, September 10, 2018 2:45:27 AM MDT Chris via Digitalmars-d wrote:

> After a while your code will be cluttered with absurd stuff like this. `.byCodeUnit`, `.byGrapheme`, `.array` etc. Due to my experience with `splitter` et. al. I tried to create my own parser to have better control over every step. After a few *minutes* of testing things I ran into this bug [1] that didn't get fixed till early 2018. I never started to write my own step-by-step parser. I'm glad I didn't.
>
> [1] https://issues.dlang.org/show_bug.cgi?id=16739
>
> [snip]

I suspect that that that didn't get found sooner simply because using Unicode in a switch statement is rare. Usually, Unicode characters are found in program input and not in the program itself. And grammars typically only involve ASCII characters (even D, which supports Unicode characters in identfiers, doesn't have any Unicode in any of its symbols). So, while I completely agree that using Unicode in switch statements should work, it doesn't really surprise me that it was broken. That's really a large part of the Unicode problem. Regardless of how a particular language or library attempst to make using Unicode sane, a large percentage of programmers don't ever do anything with Unicode characters (even if their programs are often used in environments where they will end up processing Unicode characters), and even when a programmer's native tongue requires Unicode characters, their programs frequently do not. So, it becomes very easy to write code that doesn't work properly with Unicode and have no clue that it doesn't.

Fortunately, D does provide better tools than many languages for handling Unicode, but the auto-decoding mess has made it considerably worse.

Still, even if we'd gotten it right, some portion of the code out there have to have something like byCodeUnit, byCodePoint, or byGrapheme, because efficient Unicode processing requires that you deal with all of that mess. The code that doesn't have to do any of that is generally code that treats strings as opaque data. Once you actually have to do string processing, you're pretty much screwed.

Doing everything at the grapheme level would eliminate most of the problems with regards to user-friendliness, but it would kill efficiency. So, as far as I can tell, there really isn't a great solution to be had. Unicode is simply too complicated and messy by its very nature. Now, we've definitely made mistakes with Phobos that make it worse, but the only programs that are going to avoid this whole mess either do so by not dealing with Unicode, handling it incorrectly, or by handling it inefficiently. I think that it's pretty much a pipe dream to be able to have completely sane and efficient string handling using Unicode as its currently defined.

Regardless, we need to do a better job of it in D than we have been.

- Jonathan M Davis



September 10, 2018
On 9/10/18 1:45 AM, Chris wrote:

> After a while your code will be cluttered with absurd stuff like this. `.byCodeUnit`, `.byGrapheme`, `.array` etc. Due to my experience with `splitter` et. al. I tried to create my own parser to have better control over every step.

I considered that, but I'm still trying to make this buffer reference thing work. Phobos just needs to be fixed. This is actually not as hopeless as I once thought. But what needs to happen is all of Phobos algorithms need to be tested with byCodeUnit et. al.

> After a few *minutes* of testing things I ran into this bug [1] that didn't get fixed till early 2018. I never started to write my own step-by-step parser. I'm glad I didn't.

It actually was fixed accidentally in 2017 in this PR: https://github.com/dlang/druntime/pull/1952. The bug was closed in 2018 when someone noticed the code no longer failed.

Essentially, the whole string switch algorithm was replaced with a completely rewritten better approach. This is a great example of why we should be moving more of the compiler magic into the library -- it's just easier to write and understand there.

> I wish people began to realize that string handling is a basic necessity and that the correct handling of strings is of utmost importance. Please keep us updated on how things work out (or not) for you.

Absolutely, D needs to have great support for string parsing and manipulation. The potential is awesome.

I will keep it up, what I'm trying to fix is the fact that using std.algorithm to extract pieces from a buffer, but then using the position in that buffer to determine things (i.e. parsing) is really difficult without some stupid requirements like pointer math.

> [Please, nobody answer my post pointing out that a) we don't understand Unicode and b) that it's an insult to the Universe to draw attention to flaws that keep pestering us on an almost daily basis - without trying to fix them ourselves stante pede. As is clear from Steve's efforts, the Universe doesn't seem to care.)

I don't characterize it as the universe not caring. Phobos has a legacy problem with string handling, and it needs to somehow be addressed -- either by painfully extracting the problem, or painfully working around it. I don't think anyone here thinks there isn't a problem or that it's insulting to bring it up. But anything that needs to be done is painful either way, which is why it's not happening very fast.

-Steve
September 10, 2018
On 9/8/18 8:36 AM, Steven Schveighoffer wrote:
> I'll work on adding some issues to the tracker, and potentially doing some PRs so they can be fixed.

https://issues.dlang.org/show_bug.cgi?id=19238
https://github.com/dlang/phobos/pull/6700

-Steve

September 10, 2018
On 9/8/18 8:36 AM, Steven Schveighoffer wrote:
> On 8/9/18 2:44 AM, Walter Bright wrote:
>> On 8/8/2018 2:01 PM, Steven Schveighoffer wrote:
>>> Here's where I'm struggling -- because a string provides indexing, slicing, length, etc. but Phobos ignores that. I can't make a new type that does the same thing. Not only that, but I'm finding the specializations of algorithms only work on the type "string", and nothing else.
>>
>> One of the worst things about autodecoding is it is special, it *only* steps in for strings. Fortunately, however, that specialness enabled us to save things with byCodePoint and byCodeUnit.
> 
> So it turns out that technically the problem here, even though it seemed like an autodecoding problem, is a problem with splitter.
> 
> splitter doesn't deal with encodings of character ranges at all.
> 
> For instance, when you have this:
> 
> "abc 123".byCodeUnit.splitter;
> 
> What happens is splitter only has one overload that takes one parameter, and that requires a character *array*, not a range.
> 
> So the byCodeUnit result is aliased-this to its original, and surprise! the elements from that splitter are string.
> 
> Next, I tried to use a parameter:
> 
> "abc 123".byCodeUnit.splitter(" ");
> 
> Nope, still devolves to string. It turns out it can't figure out how to split character ranges using a character array as input.

Hm... I made some erroneous assumptions in determining these problems.

1. There is no alias this for the source in ByCodeUnitImpl. I'm not sure how it was working when I tested before, but byCodeUnit definitely doesn't have it, and doesn't compile with the no-arg splitter call.
2. The .splitter(" ") does actually work and return a range of ByCodeUnitImpl elements.

So some of my analysis must have been based on bad testing.

However, the issue with the no-arg splitter is still there, and I still think it should be fixed.

I'll have to figure out why my specialized range doesn't allow splitting based on " ".

-Steve
September 10, 2018
On 9/10/18 8:58 AM, Steven Schveighoffer wrote:
> I'll have to figure out why my specialized range doesn't allow splitting based on " ".

And the answer is: I'm an idiot. Forgot to define empty :) Also my slicing operator accepted ints and not size_t.

-Steve

September 10, 2018
On 9/10/18 12:46 PM, Steven Schveighoffer wrote:
> On 9/10/18 8:58 AM, Steven Schveighoffer wrote:
>> I'll have to figure out why my specialized range doesn't allow splitting based on " ".
> 
> And the answer is: I'm an idiot. Forgot to define empty :) Also my slicing operator accepted ints and not size_t.

I guess a better error message would be in order.

September 11, 2018
On Monday, 10 September 2018 at 20:44:46 UTC, Andrei Alexandrescu wrote:
> On 9/10/18 12:46 PM, Steven Schveighoffer wrote:
>> On 9/10/18 8:58 AM, Steven Schveighoffer wrote:
>>> I'll have to figure out why my specialized range doesn't allow splitting based on " ".
>> 
>> And the answer is: I'm an idiot. Forgot to define empty :) Also my slicing operator accepted ints and not size_t.
>
> I guess a better error message would be in order.

https://github.com/dlang/DIPs/pull/131 will help narrow down the cause.
September 11, 2018
On 9/10/18 1:44 PM, Andrei Alexandrescu wrote:
> On 9/10/18 12:46 PM, Steven Schveighoffer wrote:
>> On 9/10/18 8:58 AM, Steven Schveighoffer wrote:
>>> I'll have to figure out why my specialized range doesn't allow splitting based on " ".
>>
>> And the answer is: I'm an idiot. Forgot to define empty :) Also my slicing operator accepted ints and not size_t.
> 
> I guess a better error message would be in order.
> 

A better error message would help prevent the painful diagnosis that I had to do to actually find the issue.

So the error I got was this:

source/bufref.d(346,36): Error: template std.algorithm.iteration.splitter cannot deduce function from argument types !()(Result, string), candidates are:
/Users/steves/.dvm/compilers/dmd-2.081.0/osx/bin/../../src/phobos/std/algorithm/iteration.d(3792,6):        std.algorithm.iteration.splitter(alias pred = "a == b", Range, Separator)(Range r, Separator s) if (is(typeof(binaryFun!pred(r.front, s)) : bool) && (hasSlicing!Range && hasLength!Range || isNarrowString!Range))
/Users/steves/.dvm/compilers/dmd-2.081.0/osx/bin/../../src/phobos/std/algorithm/iteration.d(4163,6):        std.algorithm.iteration.splitter(alias pred = "a == b", Range, Separator)(Range r, Separator s) if (is(typeof(binaryFun!pred(r.front, s.front)) : bool) && (hasSlicing!Range || isNarrowString!Range) && isForwardRange!Separator && (hasLength!Separator || isNarrowString!Separator))
/Users/steves/.dvm/compilers/dmd-2.081.0/osx/bin/../../src/phobos/std/algorithm/iteration.d(4350,6):        std.algorithm.iteration.splitter(alias isTerminator, Range)(Range r) if (isForwardRange!Range && is(typeof(unaryFun!isTerminator(r.front))))
/Users/steves/.dvm/compilers/dmd-2.081.0/osx/bin/../../src/phobos/std/algorithm/iteration.d(4573,6):        std.algorithm.iteration.splitter(C)(C[] s) if (isSomeChar!C)

This means I had to look at each line, figure out which overload I'm calling, and then copy all the constraints locally, seeing which ones were true and which ones false.

But it didn't stop there. The problem was hasSlicing!Range. If you look at hasSlicing, it looks like this:

enum bool hasSlicing(R) = isForwardRange!R
    && !isNarrowString!R
    && is(ReturnType!((R r) => r[1 .. 1].length) == size_t)
    && (is(typeof(lvalueOf!R[1 .. 1]) == R) || isInfinite!R)
    && (!is(typeof(lvalueOf!R[0 .. $])) || is(typeof(lvalueOf!R[0 .. $]) == R))
    && (!is(typeof(lvalueOf!R[0 .. $])) || isInfinite!R
        || is(typeof(lvalueOf!R[0 .. $ - 1]) == R))
    && is(typeof((ref R r)
    {
        static assert(isForwardRange!(typeof(r[1 .. 2])));
    }));

Now I had to instrument a whole slew of items. I pasted this whole thing this into my code, added an alias to my range type for R, and then changed the big boolean expression to a bunch of static asserts.

Then I found the true culprit was isForwardRange!R. This led me to requestion my sanity, and finally realized I forgot the empty function.

A fabulous fantastic mechanism that would have saved me some time is simply coloring the clauses of the template constraint that failed red, the ones that passed green, and the ones that weren't evaluated grey.

Furthermore, it would be good to either recursively continue this for red clauses like `hasSlicing` which have so much underneath. Either that or a way to trigger the colored evaluation on demand.

If I were a dmd guru, I'd look at doing this myself. I may still try and hack it in just to see if I can do it.

------

Finally, there is a possible bug in the definition of hasSlicing: it doesn't require the slice parameters be size_t, but there are places (e.g. inside std.algorithm.searching.find) that pass in range.length .. range.length for slicing the range. In my implementation I had used ints as the parameters for opSlice. So I started seeing errors deep inside std.algorithm saying there was no overload for slicing. Again the sanity was questioned, and I figured out the error and now it's actually working.

-Steve
September 11, 2018
On 9/10/18 7:00 PM, Nicholas Wilson wrote:
> On Monday, 10 September 2018 at 20:44:46 UTC, Andrei Alexandrescu wrote:
>> On 9/10/18 12:46 PM, Steven Schveighoffer wrote:
>>> On 9/10/18 8:58 AM, Steven Schveighoffer wrote:
>>>> I'll have to figure out why my specialized range doesn't allow splitting based on " ".
>>>
>>> And the answer is: I'm an idiot. Forgot to define empty :) Also my slicing operator accepted ints and not size_t.
>>
>> I guess a better error message would be in order.
> 
> https://github.com/dlang/DIPs/pull/131 will help narrow down the cause.

While this would help eventually, I'd prefer something that just transforms all the existing code into useful error messages. See my response to Andrei.

-Steve