Jump to page: 1 24  
Page
Thread overview
More fun with autodecoding
Aug 08, 2018
bauss
Aug 08, 2018
Walter Bright
Aug 09, 2018
Walter Bright
Sep 10, 2018
Jon Degenhardt
Sep 10, 2018
Jonathan M Davis
Sep 10, 2018
Chris
Sep 10, 2018
Jonathan M Davis
Sep 11, 2018
Nicholas Wilson
Sep 11, 2018
Nicholas Wilson
Sep 11, 2018
jmh530
Sep 12, 2018
Nicholas Wilson
Sep 12, 2018
jmh530
Sep 13, 2018
H. S. Teoh
Sep 15, 2018
Neia Neutuladh
Sep 15, 2018
Jonathan M Davis
Aug 09, 2018
Jon Degenhardt
August 06, 2018
I wanted to share a story where I actually tried to add a new type with autodecoding and failed.

I want to create a wrapper type that forwards an underlying range type but adds one feature -- tracking in the original range where you were. This is in a new library I'm writing for parsing.

So my first idea was I will just forward all methods from a given range manually -- I need to override certain ones which affect the offset into the original range.

However, typically parsing is done from text.

I realized, strings are a range of dchar, but I need the length and other things forwarded so they can be drop-in replacements for strings (I treat strings wstrings as character buffers in iopipe). However, phobos will then assume length() as the number of dchar elements, and assume it has indexing, etc.! Here is a case where I can't repeat the mistakes of phobos of auto-decoding for my own type! I never thought I'd have that problem...

So I thought, maybe I'll just alias this the underlying range and only override the parts that are needed. I end up with a nice tiny definition, and things are looking pretty good:

    static struct Result
    {
        private size_t pos;
        B _buffer;
        alias _buffer this;

        // implement the slice operations
        size_t[2] opSlice(size_t dim)(int start, int end) if (dim == 0)
        in
        { assert(start >= 0 && end <= _buffer.length); }
        do
        {
            return [start, end];
        }

        Result opIndex(size_t[2] dims)
        {
            return Result(pos + dims[0], _buffer[dims[0] .. dims[1]]);
        }

        void popFront()
        {
            import std.traits : isNarrowString;
            static if(isNarrowString!B)
            {
                auto prevLen = _buffer.length;
                _buffer.popFront;
                pos += prevLen - _buffer.length;
            }
            else
            {
                _buffer.popFront;
                ++pos;
            }
        }

        // the specialized buffer reference accessor.
        @property auto bufRef()
        {
            return BufRef(pos, _buffer.length);
        }
    }

Note already the sucky part in popFront.

But then I got a surprise when I went to use it:

    import std.algorithm : splitter;
    auto buf = "hi there this is a sentence";
    auto split1 = buf.bwin.splitter; // specialized split range
    auto split2 = buf.splitter; // normal split range
    while(!split1.empty)
    {
        assert(split1.front == split2.front);
        assert(split1.front.bufRef.concrete(buf) == split2.front); // FAILS!
        split1.popFront;
        split2.popfront;
    }

What happened? It turns out, the splitter looks for length and indexing *OR* that it is a narrow string. Splitter is trying to ignore the fact that Phobos forces autodecoding on char arrays to achieve performance. With this taken into account, I think my type does not pass any of the constraints for any of the overloads (not 100% sure on that), so it devolves to just using the alias this'd element directly, completely circumventing the point of my wrapper. The error I get is "no member `bufRef` for type `string`".

My next attempt will be to use byCodeUnit when I detect a narrow string, which hopefully will work OK. But I'm not sure if the performance is going to be the same, since now it will likely FORCE autodecoding on the algorithms that have specialized versions to AVOID autodecoding (I think).

I'm very tempted to start writing my own parsing utilities and avoid using Phobos algorithms...

-Steve
August 08, 2018
On Monday, 6 August 2018 at 13:57:10 UTC, Steven Schveighoffer wrote:
>
> I'm very tempted to start writing my own parsing utilities and avoid using Phobos algorithms...
>
> -Steve

Oh yes; the good old autodecoding.
August 08, 2018
On 8/6/2018 6:57 AM, Steven Schveighoffer wrote:
> But I'm not sure if the performance is going to be the same, since now it will likely FORCE autodecoding on the algorithms that have specialized versions to AVOID autodecoding (I think).

Autodecoding is expensive which is why the algorithms defeat it. Nearly none actually need it.

You can get decoding if needed by using .byDchar or .by!dchar (forgot which it was).
August 08, 2018
On 8/8/18 4:13 PM, Walter Bright wrote:
> On 8/6/2018 6:57 AM, Steven Schveighoffer wrote:
>> But I'm not sure if the performance is going to be the same, since now it will likely FORCE autodecoding on the algorithms that have specialized versions to AVOID autodecoding (I think).
> 
> Autodecoding is expensive which is why the algorithms defeat it. Nearly none actually need it.
> 
> You can get decoding if needed by using .byDchar or .by!dchar (forgot which it was).

There is byCodePoint and byCodeUnit, whereas byCodePoint forces auto decoding.

The problem is, I want to use this wrapper just like it was a string in all respects (including the performance gains had by ignoring auto-decoding).

Not trying to give too much away about the library I'm writing, but the problem I'm trying to solve is parsing out tokens from a buffer. I want to delineate the whole, as well as the parts, but it's difficult to get back to the original buffer once you split and slice up the buffer using phobos functions.

Consider that you are searching for something in a buffer. Phobos provides all you need to narrow down your range to the thing you are looking for. But it doesn't give you a way to figure out where you are in the whole buffer.

Up till now, I've done it by weird length math, but it gets tiring (see for instance: https://github.com/schveiguy/fastaq/blob/master/source/fasta/fasta.d#L125). I just want to know where the darned thing I've narrowed down is in the original range!

So this wrapper I thought would be a way to use things like you always do, but at any point, you just extract a piece of information (a buffer reference) that shows where it is in the original buffer. It's quite easy to do that part, the problem is getting it to be a drop-in replacement for the original type.

Here's where I'm struggling -- because a string provides indexing, slicing, length, etc. but Phobos ignores that. I can't make a new type that does the same thing. Not only that, but I'm finding the specializations of algorithms only work on the type "string", and nothing else.

I'll try using byCodeUnit and see how it fares.

-Steve
August 08, 2018
On 8/8/2018 2:01 PM, Steven Schveighoffer wrote:
> Here's where I'm struggling -- because a string provides indexing, slicing, length, etc. but Phobos ignores that. I can't make a new type that does the same thing. Not only that, but I'm finding the specializations of algorithms only work on the type "string", and nothing else.

One of the worst things about autodecoding is it is special, it *only* steps in for strings. Fortunately, however, that specialness enabled us to save things with byCodePoint and byCodeUnit.
August 09, 2018
On Wednesday, 8 August 2018 at 21:01:18 UTC, Steven Schveighoffer wrote:
> Not trying to give too much away about the library I'm writing, but the problem I'm trying to solve is parsing out tokens from a buffer. I want to delineate the whole, as well as the parts, but it's difficult to get back to the original buffer once you split and slice up the buffer using phobos functions.

I wonder if there are some parallels in the tsv utilities I wrote. The tsv parser is extremely simple, byLine and splitter on a char buffer. Most of the tools just iterate the split result in order, but a couple do things like operate on a subset of fields, potentially reordered. For these a separate structure is created that maps back the to original buffer to avoid copying. Likely quite simple compared to what you are doing.

The csv2tsv tool may be more interesting. Parsing is relatively simple, mostly identifying field values in the context of CSV escape syntax. It's modeled as reading an infinite stream of utf-8 characters, byte-by-byte. Occasionally the bytes forming the value need to be modified due to the escape syntax, but most of the time the characters in the original buffer remain untouched and parsing is identifying the start and end positions.

The infinite stream is constructed by reading fixed size blocks from the input stream and concatenating them with joiner. This eliminates the need to worry about utf-8 characters spanning block boundaries, but it comes at a cost: either write byte-at-a-time, or make an extra copy (also byte-at-a-time). Making an extra copy is faster, that what the code does. But, as a practical matter, most of the time large blocks could often be written directly from the original input buffer.

If I wanted it make it faster than current I'd do this. But I don't see an easy way to do this with phobos ranges. At minimum I'd have to be able to run code when the joiner operation hits block boundaries. And it'd also be necessary to create a mapping back to the original input buffer.

Autodecoding comes into play of course. Basically, splitter on char arrays is fine, but in a number of cases it's necessary to work using ubtye to avoid the performance penalty.

--Jon
September 08, 2018
On 8/9/18 2:44 AM, Walter Bright wrote:
> On 8/8/2018 2:01 PM, Steven Schveighoffer wrote:
>> Here's where I'm struggling -- because a string provides indexing, slicing, length, etc. but Phobos ignores that. I can't make a new type that does the same thing. Not only that, but I'm finding the specializations of algorithms only work on the type "string", and nothing else.
> 
> One of the worst things about autodecoding is it is special, it *only* steps in for strings. Fortunately, however, that specialness enabled us to save things with byCodePoint and byCodeUnit.

So it turns out that technically the problem here, even though it seemed like an autodecoding problem, is a problem with splitter.

splitter doesn't deal with encodings of character ranges at all.

For instance, when you have this:

"abc 123".byCodeUnit.splitter;

What happens is splitter only has one overload that takes one parameter, and that requires a character *array*, not a range.

So the byCodeUnit result is aliased-this to its original, and surprise! the elements from that splitter are string.

Next, I tried to use a parameter:

"abc 123".byCodeUnit.splitter(" ");

Nope, still devolves to string. It turns out it can't figure out how to split character ranges using a character array as input.

The only thing that does seem to work is this:

"abc 123".byCodeUnit.splitter(" ".byCodeUnit);

But this goes against most algorithms in Phobos that deal with character ranges -- generally you can use any width character range, and it just works. Having a drop-in replacement for string would require splitter to handle these transcodings (and I think in general, algorithms should be able to handle them as well). Not only that, but the specialized splitter that takes no separator can split on multiple spaces, a feature I want to have for my drop-in replacement.

I'll work on adding some issues to the tracker, and potentially doing some PRs so they can be fixed.

-Steve
September 09, 2018
On 9/8/18 8:36 AM, Steven Schveighoffer wrote:


Sent this when I was on a plane, and for some reason it posted with the timestamp when I hit "send later", not when I connected just now. So this is to bring the previous message back to the forefront.

-Steve
September 10, 2018
On Saturday, 8 September 2018 at 15:36:25 UTC, Steven Schveighoffer wrote:
> On 8/9/18 2:44 AM, Walter Bright wrote:
>> On 8/8/2018 2:01 PM, Steven Schveighoffer wrote:
>>> Here's where I'm struggling -- because a string provides indexing, slicing, length, etc. but Phobos ignores that. I can't make a new type that does the same thing. Not only that, but I'm finding the specializations of algorithms only work on the type "string", and nothing else.
>> 
>> One of the worst things about autodecoding is it is special, it *only* steps in for strings. Fortunately, however, that specialness enabled us to save things with byCodePoint and byCodeUnit.
>
> So it turns out that technically the problem here, even though it seemed like an autodecoding problem, is a problem with splitter.
>
> splitter doesn't deal with encodings of character ranges at all.

This could partially explain why when I tried byCodeUnit and friends awhile ago I concluded it wasn't a reasonable approach: splitter is in the middle of much of what I've written.

Even if splitter is changed I'll still be very doubtful about the byCodeUnit approach as a work-around. An automated way to validate that it is engaged only when necessary would be very helpful (@noautodecode perhaps? :))

--Jon

September 10, 2018
On Saturday, September 8, 2018 9:36:25 AM MDT Steven Schveighoffer via Digitalmars-d wrote:
> On 8/9/18 2:44 AM, Walter Bright wrote:
> > On 8/8/2018 2:01 PM, Steven Schveighoffer wrote:
> >> Here's where I'm struggling -- because a string provides indexing, slicing, length, etc. but Phobos ignores that. I can't make a new type that does the same thing. Not only that, but I'm finding the specializations of algorithms only work on the type "string", and nothing else.
> >
> > One of the worst things about autodecoding is it is special, it *only* steps in for strings. Fortunately, however, that specialness enabled us to save things with byCodePoint and byCodeUnit.
>
> So it turns out that technically the problem here, even though it seemed like an autodecoding problem, is a problem with splitter.
>
> splitter doesn't deal with encodings of character ranges at all.
>
> For instance, when you have this:
>
> "abc 123".byCodeUnit.splitter;
>
> What happens is splitter only has one overload that takes one parameter, and that requires a character *array*, not a range.
>
> So the byCodeUnit result is aliased-this to its original, and surprise! the elements from that splitter are string.
>
> Next, I tried to use a parameter:
>
> "abc 123".byCodeUnit.splitter(" ");
>
> Nope, still devolves to string. It turns out it can't figure out how to split character ranges using a character array as input.
>
> The only thing that does seem to work is this:
>
> "abc 123".byCodeUnit.splitter(" ".byCodeUnit);
>
> But this goes against most algorithms in Phobos that deal with character ranges -- generally you can use any width character range, and it just works. Having a drop-in replacement for string would require splitter to handle these transcodings (and I think in general, algorithms should be able to handle them as well). Not only that, but the specialized splitter that takes no separator can split on multiple spaces, a feature I want to have for my drop-in replacement.
>
> I'll work on adding some issues to the tracker, and potentially doing some PRs so they can be fixed.

Well, plenty of algorithms don't care one whit about strings specifically and thus their behavior is really dependent on what the element type of the range is (e.g. for byCodeUnit, filter would filter code units, and sort would sort code units, and arguably, that's what they should do). However, a big problem with with a number of the functions in Phobos that specifically operate on ranges of characters is that they tend to assume that a range of characters means a range of dchar. Some of the functions in Phobos have been fixed to be more flexible and operate on arbitrary ranges of char, wchar, or dchar, but it's mostly happened because of a bug report about a particular function not working with something like byCodeUnit, whereas what we really need to happen is to have tests added for all of the functions in Phobos which specifically operate on ranges of characters to ensure that they do the correct thing when given a range of char, wchar, dchar - or graphemes (much as we talk about graphemes being the correct level for a some types of string processing, nothing in Phobos outside of std.uni currently does anything with byGrapheme, even in tests).

And of course, with those tests, we'll inevitably find that a number of those functions won't work correctly and will need to be fixed. But as annoying as all of that is, it's work that needs to be done regardless of the situation with auto-decoding, since these functions need to work with arbitrary ranges of characters and not just ranges of dchar. And for those functions that don't need to try to avoid auto-decoding, they should then not even care whether strings are ranges of code units or code points, which should then reduce the impact of auto-decoding. And actually, a lot of the code that specializes on narrow strings to avoid auto-decoding would probably work whether auto-decoding was there or not. So, once we've actually managed to ensure that Phobos in general works with arbitrary ranges of characters, the main breakage that would be caused by removing auto-decoding (in Phobos at least) would be any code that used strings with functions that weren't specifically written to do something special for strings, and while I'm not at all convinced that we then have a path towards removing auto-decoding, it would minimize auto-decoding's impact, and with auto-decoding's impact minimized as much as possible, maybe at some point, we'll actually manage to figure out how to remove it.

But in any case, the issues that you're running into with splitter are a symptom of a larger problem with how Phobos currently handles ranges of characters. And when this sort of thing comes up, I'm reminded that I should take the time to start adding the appropriate tests to Phobos, and then I never get around to it - as with too many things. I really should fix that. :|

- Jonathan M Davis



« First   ‹ Prev
1 2 3 4