March 04, 2013
On Sat, 02 Mar 2013 17:29:38 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 3/2/13 2:32 PM, Steven Schveighoffer wrote:
>> This is not a personal attack, please don't take it that way.
>
> Not at all. I'm just confused about this exchange. You mention some correct facts ("MySplitter finished in 6 seconds instead of 10"), then some mistaken suppositions ("a string-specific version could be written"), and derive conclusions from both ("Maybe there is a bad constraint somewhere"). Hard to deal with the mix.

I didn't realize the situation.  I assumed that there wasn't a version of splitter for strings that took array-specific shortcuts.  My corrected statement should read "the existing string-specific version could be improved."

My conclusion of "Maybe there is a bad constraint somewhere" was not derived from that, it was based on your statement elsewhere that "I wrote a custom splitter specialized for arrays. It's as fast."

Given that my tests have shown I can write a faster one quite easily than the implementation selected by phobos, and that you said you wrote one that's "as fast," I took that to mean you had written one that was more optimized than the chosen splitter version (and logically assumed you had included this version in Phobos with the intent it would be chosen when compiling with strings), leading me to suggest possibly that due to some bug, the "fast" implementation wasn't being chosen.  I didn't realize that "as fast" didn't mean "as fast as yours".  I actually don't know what you meant by that now.

>> My
>> anecdotal tests with hand writing a custom splitter range to handle the
>> OP's program gave me a 40% savings. Whether it's find or not, I'm not
>> sure, but there definitely is room for improvement.
>
> I think I understand where the confusion comes from. If you're referring to MySplitter, that's not comparable. It uses this at the core:
>
> for(; i + 1 < source.length; i++)
> {
>      if(source[i] == '\r' && source[i + 1] == '\n')
>      {
>          found = true;
>          break;
>      }
>      ...
> }
>
> This is not close to the code that would work with arrays. We could specialize things for small arrays, but that hasn't been done yet. My point is it's not comparable with the classic brute force subsequence search.

Very good point, here is a new version that takes any string as a separator:

struct MySplitter
{
    private string s;
    private string separator;
    private string source;
    this(string src, string sep)
    {
        source = src;
        separator = sep;
        popFront();
    }

    @property string front()
    {
        return s;
    }

    @property bool empty()
    {
        return s.ptr == null;
    }

    void popFront()
    {
        if(!source.length)
        {
            s = source;
            source = null;
        }
        else
        {
            size_t i = 0;
            bool found = false;
            for(; i + separator.length <= source.length; i++)
            {
                if(source[i] == separator[0])
                {
                    found = true;
                    for(size_t j = i+1, k=1; k < separator.length; ++j, ++k)
                        if(source[j] != separator[k])
                        {
                            found = false;
                            break;
                        }
                    if(found)
                        break;
                }
            }
            s = source[0..i];
            if(found)
                source = source[i + separator.length..$];
            else
                source = source[$..$];
        }
    }
}

Takes 7 seconds on my machine instead of 6, but not 10 like std.algorithm.splitter.  I don't even like the loop that well, it looks crude, I can probably optimize it further.

And it does not use any of your specified tricks.

Is this sufficiently comparable, or am I missing something else?

-Steve
March 04, 2013
On Sun, 03 Mar 2013 22:58:07 -0500, deadalnix <deadalnix@gmail.com> wrote:

> On Monday, 4 March 2013 at 03:20:57 UTC, jerro wrote:
>>> Maybe it is time to look at the python implementation and see why it is faster.
>>
>> It isn't faster:
>>
>> $ time python3 test.py
>>
>> real    0m14.217s
>> user    0m14.209s
>> sys     0m0.004s
>> $ gdmd -O -inline -release -noboundscheck test
>> $ time ./test
>>
>> real    0m5.323s
>> user    0m5.312s
>> sys     0m0.008s
>>
>> D code here uses the same string as the python code, not the one in cvk012c's D code.
>
> Using noboundcheck isn't fair as you give up on safety, so you are not equivalent to python either.

In this type of test, there is no danger with using noboundscheck.  It's a very fair switch to use, D is just able to do it where python is not.  A sufficiently smart compiler could eliminate the bounds check for this code, since it can be proven not to go out of bounds (in fact, a simple run without the noboundscheck proves it in very deterministic code like this).

-Steve
March 04, 2013
On 3/4/13 10:39 AM, Steven Schveighoffer wrote:
> struct MySplitter
> {
> private string s;
> private string separator;
> private string source;
> this(string src, string sep)
> {
> source = src;
> separator = sep;
> popFront();
> }
>
> @property string front()
> {
> return s;
> }
>
> @property bool empty()
> {
> return s.ptr == null;
> }
>
> void popFront()
> {
> if(!source.length)
> {
> s = source;
> source = null;
> }
> else
> {
> size_t i = 0;
> bool found = false;
> for(; i + separator.length <= source.length; i++)
> {
> if(source[i] == separator[0])
> {
> found = true;
> for(size_t j = i+1, k=1; k < separator.length; ++j, ++k)
> if(source[j] != separator[k])
> {
> found = false;
> break;
> }
> if(found)
> break;
> }
> }
> s = source[0..i];
> if(found)
> source = source[i + separator.length..$];
> else
> source = source[$..$];
> }
> }
> }
>
> Takes 7 seconds on my machine instead of 6, but not 10 like
> std.algorithm.splitter. I don't even like the loop that well, it looks
> crude, I can probably optimize it further.
>
> And it does not use any of your specified tricks.
>
> Is this sufficiently comparable, or am I missing something else?
>
> -Steve

That's comparable, please file an enh request.


Thanks,

Andrei
March 04, 2013
On Sat, 02 Mar 2013 12:20:22 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
>> On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
>> <schveiguy@yahoo.com> wrote:
>>
>>> So it's just pure heuristics. Not hard to see how that would affect a
>>> 10 million cycle program.
>>>
>>> We may be able to create a string-specific version of splitter that
>>> would take advantage of the representation.
>>
>> Just found a disturbing artifact: splitter(message, '\n') is more than
>> twice as slow as splitter(message, "\n")
>>
>> -Steve
>
> That is a problem. Could you please file a but report?

http://d.puremagic.com/issues/show_bug.cgi?id=9645

In trying to make a minimal test case, I found that the algorithm performs worse vs. the substring version as the number of content characters between separators grows.  It actually starts out performing better than the substring version, by quite a bit (I would think we should be able to optimize there as well, the difference was about half) when the ratio is 1:1 (i.e. splitting "ababababa" on 'a' or "a"), but gets worse as you put more content between the separators.  That should be a large clue.

-Steve
March 04, 2013
On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> That's comparable, please file an enh request.

http://d.puremagic.com/issues/show_bug.cgi?id=9646
March 04, 2013
On Saturday, 2 March 2013 at 17:20:23 UTC, Andrei Alexandrescu wrote:
> On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
>> On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
>> <schveiguy@yahoo.com> wrote:
>>
>>> So it's just pure heuristics. Not hard to see how that would affect a
>>> 10 million cycle program.
>>>
>>> We may be able to create a string-specific version of splitter that
>>> would take advantage of the representation.
>>
>> Just found a disturbing artifact: splitter(message, '\n') is more than
>> twice as slow as splitter(message, "\n")
>>
>> -Steve
>
> That is a problem. Could you please file a but report?
>
> Thanks,
>
> Andrei

I would have filed a bug report, but I was WAY too busy.
March 04, 2013
On Saturday, 2 March 2013 at 17:21:17 UTC, Andrei Alexandrescu
wrote:
> On 3/2/13 12:20 PM, Andrei Alexandrescu wrote:
>> On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
>>> On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
>>> <schveiguy@yahoo.com> wrote:
>>>
>>>> So it's just pure heuristics. Not hard to see how that would affect a
>>>> 10 million cycle program.
>>>>
>>>> We may be able to create a string-specific version of splitter that
>>>> would take advantage of the representation.
>>>
>>> Just found a disturbing artifact: splitter(message, '\n') is more than
>>> twice as slow as splitter(message, "\n")
>>>
>>> -Steve
>>
>> That is a problem. Could you please file a but report?
>
> s/but/bug/
>
> :o)

Ohhhhh!
>
> Andrei
March 05, 2013
On 03/03/2013 07:31 PM, bearophile wrote:
> jerro:
>
>> $ time python3 test.py
>
> Are Python3 strings still like wstrings/dstrings, or have they applied
> the patch that makes them UTF8?
>
> Bye,
> bearophile

Looks like 3.3 will store them as utf8, but not 3.2 or below.
May 22, 2016
On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer wrote:
> On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
>
>> That's comparable, please file an enh request.
>
> http://d.puremagic.com/issues/show_bug.cgi?id=9646


Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely:

dmd -O -release -inline -noboundscheck:
std.algorithm.splitter took 5 secs, 170 ms, and 656 μs
MySplitter took 4 secs, 835 ms, 748 μs, and 1 hnsec
my_splitter took 4 secs, 862 ms, 914 μs, and 4 hnsecs

ldc2 -O -release:
std.algorithm.splitter took 3 secs, 540 ms, and 84 μs
MySplitter took 2 secs, 288 ms, and 535 μs
my_splitter took 3 secs, 463 ms, and 897 μs

CPU: Intel i7 M 620 2.67GHz
dmd: v2.070.2
ldc: 0.17.1 (based on DMD v2.068.2 and LLVM 3.7.1)

auto my_splitter(alias pred = "a == b", Range, Separator)(Range r, Separator s)
if (is(typeof(binaryFun!pred(r.front, s.front)) : bool)
        && (hasSlicing!Range || isNarrowString!Range)
        && isForwardRange!Separator
        && (hasLength!Separator || isNarrowString!Separator))
{
    import std.algorithm.searching : find;
    import std.conv : unsigned;

    static struct Result
    {
    private:
        Range _input;
        Separator _separator;
        size_t _next_index;
        bool _empty;

        @property auto separatorLength() { return _separator.length; }

        void findIndex()
        {
            auto found = find!pred(_input, _separator);
            _next_index = _input.length - found.length;
        }

    public:
        this(Range input, Separator separator)
        {
            _input = input;
            _separator = separator;
            _empty = false;
            findIndex();
        }

        @property Range front()
        {
            assert(!empty);
            return _input[0 .. _next_index];
        }

        static if (isInfinite!Range)
        {
            enum bool empty = false;  // Propagate infiniteness
        }
        else
        {
            @property bool empty()
            {
                return _empty;
            }
        }

        void popFront()
        {
            assert(!empty);
            if (_input.empty) {
                _empty = true;
                assert(_next_index == 0);
            } else if (_input.length == _next_index) {
                _input = _input[$ .. $];
                _empty = true;
            } else {
                _input = _input[_next_index + separatorLength .. $];
                findIndex();
            }
        }

        static if (isForwardRange!Range)
        {
            @property typeof(this) save()
            {
                auto ret = this;
                ret._input = _input.save;
                return ret;
            }
        }
    }

    return Result(r, s);
}

May 23, 2016
On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
> On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer wrote:
>> [...]
>
>
> Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely:
>
> [...]

have you thought about opening a PR to improve `splitter`?