March 22, 2015
On 3/22/15 10:13 AM, tcak wrote:
> On Sunday, 22 March 2015 at 16:03:11 UTC, Andrei Alexandrescu wrote:
>> On 3/22/15 3:10 AM, Vladimir Panteleev wrote:
>>> On Sunday, 22 March 2015 at 07:03:14 UTC, Andrei Alexandrescu wrote:
>>>> * For each line read there was a call to malloc() and one to free(). I
>>>> set things up that the buffer used for reading is reused by simply
>>>> making the buffer static.
>>>
>>> What about e.g.
>>>
>>> zip(File("a.txt").byLine, File("b.txt").byLine)
>>
>> No matter, the static buffer is copied into the result. -- Andrei
>
> I didn't see the code though, won't using "static" buffer make the
> function thread UNSAFE?

D's statics are thread-local. -- Andrei

March 23, 2015
On 3/22/15 3:03 AM, Andrei Alexandrescu wrote:

> * assumeSafeAppend() was unnecessarily used once per line read. Its
> removal led to a whopping 35% on top of everything else. I'm not sure
> what it does, but boy it does takes its sweet time. Maybe someone should
> look into it.

That's not expected. assumeSafeAppend should be pretty quick, and DEFINITELY should not be a significant percentage of reading lines. I will look into it.

Just to verify, your test application was a simple byline loop?

-Steve
March 23, 2015
On 3/23/15 7:52 AM, Steven Schveighoffer wrote:
> On 3/22/15 3:03 AM, Andrei Alexandrescu wrote:
>
>> * assumeSafeAppend() was unnecessarily used once per line read. Its
>> removal led to a whopping 35% on top of everything else. I'm not sure
>> what it does, but boy it does takes its sweet time. Maybe someone should
>> look into it.
>
> That's not expected. assumeSafeAppend should be pretty quick, and
> DEFINITELY should not be a significant percentage of reading lines. I
> will look into it.

Thanks!

> Just to verify, your test application was a simple byline loop?

Yes, the code was that in http://stackoverflow.com/questions/28922323/improving-line-wise-i-o-operations-in-d/29153508#29153508


Andrei

March 23, 2015
On Sunday, 22 March 2015 at 07:03:14 UTC, Andrei Alexandrescu wrote:
> I just took a look at making byLine faster. It took less than one evening:
>
> https://github.com/D-Programming-Language/phobos/pull/3089
>
> I confess I am a bit disappointed with the leadership being unable to delegate this task to a trusty lieutenant in the community. There's been a bug opened on this for a long time, it gets regularly discussed here (with the wrong conclusions ("we must redo D's I/O because FILE* is killing it!") about performance bottlenecks drawn from unverified assumptions), and the techniques used to get a marked improvement in the diff above are trivial fare for any software engineer. The following factors each had a significant impact on speed:
>
> * On OSX (which I happened to test with) getdelim() exists but wasn't being used. I made the implementation use it.
>
> * There was one call to fwide() per line read. I used simple caching (a stream's width cannot be changed once set, making it a perfect candidate for caching).
>
> (As an aside there was some unreachable code in ByLineImpl.empty, which didn't impact performance but was overdue for removal.)
>
> * For each line read there was a call to malloc() and one to free(). I set things up that the buffer used for reading is reused by simply making the buffer static.
>
> * assumeSafeAppend() was unnecessarily used once per line read. Its removal led to a whopping 35% on top of everything else. I'm not sure what it does, but boy it does takes its sweet time. Maybe someone should look into it.
>
> Destroy.
>
>
> Andrei

What would be really great would be a performance test suite for phobos. D is reaching a point where "It'll probably be fast because we did it right" or "I remember it being fast-ish 3 years ago when i wrote a small toy test" isn't going to cut it. Real data is needed, with comparisons to other languages where possible.
March 23, 2015
On Monday, 23 March 2015 at 15:00:07 UTC, John Colvin wrote:
> What would be really great would be a performance test suite for phobos.

I'm working on it

https://github.com/D-Programming-Language/phobos/pull/2995

March 23, 2015
On Monday, 23 March 2015 at 15:00:07 UTC, John Colvin wrote:

> What would be really great would be a performance test suite for phobos. D is reaching a point where "It'll probably be fast because we did it right" or "I remember it being fast-ish 3 years ago when i wrote a small toy test" isn't going to cut it. Real data is needed, with comparisons to other languages where possible.

I made the same test in C# using a 30MB plain ASCII text file. Compared to fastest method proposed by Andrei, results are not the best:

D:
readText.representation.count!(c => c == '\n') - 428 ms
byChunk(4096).joiner.count!(c => c == '\n') - 1160 ms

C#:
File.ReadAllLines.Length - 216 ms;

Win64, D 2.066.1, Optimizations were turned on in both cases.

The .net code is clearly not performance oriented (http://referencesource.microsoft.com/#mscorlib/system/io/file.cs,675b2259e8706c26), I suspect that .net runtime is performing some optimizations under the hood.


March 23, 2015
On 3/23/15 10:43 AM, rumbu wrote:
> On Monday, 23 March 2015 at 15:00:07 UTC, John Colvin wrote:
>
>> What would be really great would be a performance test suite for
>> phobos. D is reaching a point where "It'll probably be fast because we
>> did it right" or "I remember it being fast-ish 3 years ago when i
>> wrote a small toy test" isn't going to cut it. Real data is needed,
>> with comparisons to other languages where possible.
>
> I made the same test in C# using a 30MB plain ASCII text file. Compared
> to fastest method proposed by Andrei, results are not the best:
>
> D:
> readText.representation.count!(c => c == '\n') - 428 ms
> byChunk(4096).joiner.count!(c => c == '\n') - 1160 ms
>
> C#:
> File.ReadAllLines.Length - 216 ms;
>
> Win64, D 2.066.1, Optimizations were turned on in both cases.
>
> The .net code is clearly not performance oriented
> (http://referencesource.microsoft.com/#mscorlib/system/io/file.cs,675b2259e8706c26),
> I suspect that .net runtime is performing some optimizations under the
> hood.

At this point it gets down to the performance of std.algorithm.count, which could and should be improved. This code accelerates speed 2.5x over count and brings it in the zone of wc -l, which is probably near the lower bound achievable:

  auto bytes = args[1].readText.representation;
  for (auto p = bytes.ptr, lim = p + bytes.length;; )
  {
    import core.stdc.string;
    auto r = cast(immutable(ubyte)*) memchr(p, '\n', lim - p);
    if (!r) break;
    ++linect;
    p = r + 1;
  }

Would anyone want to put some work into accelerating count?


Andrei

March 23, 2015
> I made the same test in C# using a 30MB plain ASCII text file. Compared to fastest method proposed by Andrei, results are not the best:
>
> D:
> readText.representation.count!(c => c == '\n') - 428 ms
> byChunk(4096).joiner.count!(c => c == '\n') - 1160 ms
>
> C#:
> File.ReadAllLines.Length - 216 ms;
>
> Win64, D 2.066.1, Optimizations were turned on in both cases.
>
> The .net code is clearly not performance oriented (http://referencesource.microsoft.com/#mscorlib/system/io/file.cs,675b2259e8706c26), I suspect that .net runtime is performing some optimizations under the hood.

Does the C# version validate the input? Using std.file.read instead of readText.representation halves the runtime on my machine.
March 23, 2015
On Monday, 23 March 2015 at 19:25:08 UTC, Tobias Pankrath wrote:
>> I made the same test in C# using a 30MB plain ASCII text file. Compared to fastest method proposed by Andrei, results are not the best:
>>
>> D:
>> readText.representation.count!(c => c == '\n') - 428 ms
>> byChunk(4096).joiner.count!(c => c == '\n') - 1160 ms
>>
>> C#:
>> File.ReadAllLines.Length - 216 ms;
>>
>> Win64, D 2.066.1, Optimizations were turned on in both cases.
>>
>> The .net code is clearly not performance oriented (http://referencesource.microsoft.com/#mscorlib/system/io/file.cs,675b2259e8706c26), I suspect that .net runtime is performing some optimizations under the hood.
>
> Does the C# version validate the input? Using std.file.read instead of readText.representation halves the runtime on my machine.

Source code is available at the link above. Since the C# version works internally with streams and UTF-16 chars, the pseudocode looks like this:

---
initilialize a LIST with 16 items;
while (!eof)
{
  read 4096 bytes in a buffer;
  decode them to UTF-16 in a wchar[] buffer
  while (moredata in the buffer)
  {
    read from buffer until (\n or \r\n or \r);
    discard end of line;
    if (nomorespace in LIST)
       double its size.
    add the line to LIST.
  }
}
return number of items in the LIST.
---

Since this code is clearly not the best for this task, as I suspected, I looked into jitted code and it seems that the .net runtime is smart enough to recognize this pattern and is doing the following:
- file is mapped into memory using CreateFileMapping
- does not perform any decoding, since \r and \n are ASCII
- does not create any list
- searches incrementally for \r, \r\n, \n using CompareStringA and LOCALE_INVARIANT and increments at each end of line
- there is no temporary memory allocation since searching is performed directly on the mapping handle
- returns the count.

March 23, 2015
On 3/23/15 2:13 PM, rumbu wrote:
>
> Since this code is clearly not the best for this task, as I suspected, I
> looked into jitted code and it seems that the .net runtime is smart
> enough to recognize this pattern and is doing the following:
> - file is mapped into memory using CreateFileMapping
> - does not perform any decoding, since \r and \n are ASCII
> - does not create any list
> - searches incrementally for \r, \r\n, \n using CompareStringA and
> LOCALE_INVARIANT and increments at each end of line
> - there is no temporary memory allocation since searching is performed
> directly on the mapping handle
> - returns the count.

This is great investigative and measuring work. Thanks! -- Andrei