September 14, 2015
On Monday, 14 September 2015 at 18:31:38 UTC, H. S. Teoh wrote:
> I decided to give the code a spin with `gdc -O3 -pg`. Turns out that the hotspot is in std.array.split, contrary to expectations. :-)  Here are the first few lines of the gprof output:
>
> [...]

Perhaps using the new rangified splitter instead of split would help.
September 14, 2015
On Mon, Sep 14, 2015 at 08:07:45PM +0000, Kapps via Digitalmars-d-learn wrote:
> On Monday, 14 September 2015 at 18:31:38 UTC, H. S. Teoh wrote:
> >I decided to give the code a spin with `gdc -O3 -pg`. Turns out that the hotspot is in std.array.split, contrary to expectations. :-) Here are the first few lines of the gprof output:
> >
> >[...]
> 
> Perhaps using the new rangified splitter instead of split would help.

I tried it. It was slower, surprisingly. I didn't dig deeper into why.


T

-- 
I see that you JS got Bach.
September 15, 2015
On 15/09/15 5:41 AM, NX wrote:
> On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole wrote:
>> A lot of this hasn't been covered I believe.
>>
>> http://dpaste.dzfl.pl/f7ab2915c3e1
>
> I believe that should be:
> foreach (query, ref value; hitlists)
> Since an assignment happenin there..?

Probably.
September 15, 2015
On Monday, 14 September 2015 at 15:04:12 UTC, John Colvin wrote:
>
> I've had nothing but trouble when using different versions of libc. It would be easier to do this instead: http://wiki.dlang.org/Building_LDC_from_source
>
> I'm running a build of LDC git HEAD right now on an old server with 2.11, I'll upload the result somewhere once it's done if it might be useful

Thanks for the offer, but don't go out of your way for my sake. Maybe I'll just build this in a clean environment instead of on my work computer to get rid of all the hassle. The Red Hat llvm-devel packages are broken, dependent on libffi-devel which is unavailable. Getting the build environment up to speed on my main machine would take me a lot more time than I have right now.

Tried building LDC from scratch but it fails because of missing LLVM components, despite having LLVM 3.4.2 installed (though lacking devel components).
September 15, 2015
On Monday, 14 September 2015 at 16:13:14 UTC, Edwin van Leeuwen wrote:
> See this link for clarification on what the columns/numbers in the profile file mean
> http://forum.dlang.org/post/f9gjmo$2gce$1@digitalmars.com
>
> It is still difficult to parse though. I myself often use sysprof (only available on linux), which automatically ranks by time spent.

Thanks for the link. I read up on what everything means, but I think the problem isn't finding what consumes the most time, the problem is me not knowing the standard library well enough to translate the largest consumers to actual parts of my code :).
September 15, 2015
On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole wrote:
>
> A lot of this hasn't been covered I believe.
>
> http://dpaste.dzfl.pl/f7ab2915c3e1
>
> 1) You don't need to convert char[] to string via to. No. Too much. Cast it.
> 2) You don't need byKey, use foreach key, value syntax. That way you won't go around modifying things unnecessarily.
>
> Ok, I disabled GC + reserved a bunch of memory. It probably won't help much actually. In fact may make it fail so keep that in mind.
>
> Humm what else.
>
> I'm worried about that first foreach. I don't think it needs to exist as it does. I believe an input range would be far better. Use a buffer to store the Hit[]'s. Have a subset per set of them.
>
> If the first foreach is an input range, then things become slightly easier in the second. Now you can turn that into it's own input range.
> Also that .array usage concerns me. Many an allocation there! Hence why the input range should be the return from it.
>
> The last foreach, is lets assume dummy. Keep in mind, stdout is expensive here. DO NOT USE. If you must buffer output then do it large quantities.
>
>
> Based upon what I can see, you are definitely not able to use your cpu's to the max. There is no way that is the limiting factor here. Maybe your usage of a core is. But not the cpu's itself.
>
> The thing is, you cannot use multiple threads on that first foreach loop to speed things up. No. That needs to happen all on one thread.
> Instead after that thread you need to push the result into another.
>
> Perhaps, per thread one lock (mutex) + buffer for hits. Go round robin over all the threads. If mutex is still locked, you'll need to wait. In this situation a locked mutex means all you worker threads are working. So you can't do anything more (anyway).
>
> Of course after all this, the HDD may still be getting hit too hard. In which case I would recommend you memory mapping it. Which should allow the OS to more efficiently handle reading it into memory. But you'll need to rework .byLine for that.
>
>
> Wow that was a lot at 4:30am! So don't take it too seriously. I'm sure somebody else will rip that to shreds!

Thanks for your suggestions! That sure is a lot of details. I'll have to go through them carefully to understand what to do with all this. Going multithreaded sounds fun but would  effectively kill of all of my spare time, so I might have to skip that. :)

Using char[] all around might be a good idea, but it doesn't seem like the string conversions are really that taxing. What are the arguments for working on char[] arrays rather than strings?

I'm aware that printing output like that is a performance killer, but it's not supposed to write anything in the final program. It's just there for me to be able to compare the results to my Python code.
September 15, 2015
On Monday, 14 September 2015 at 18:08:31 UTC, John Colvin wrote:
> On Monday, 14 September 2015 at 17:51:43 UTC, CraigDillabaugh wrote:
>> On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote:
>>> [...]
>>
>> I am going to go off the beaten path here.  If you really want speed
>> for a file like this one way of getting that is to read the file
>> in as a single large binary array of ubytes (or in blocks if its too big)
>> and parse the lines yourself. Should be fairly easy with D's array slicing.
>
> my favourite for streaming a file:
> enum chunkSize = 4096;
> File(fileName).byChunk(chunkSize).map!"cast(char[])a".joiner()

Is this an efficient way of reading this type of file? What should one keep in mind when choosing chunkSize?
September 15, 2015
On Monday, 14 September 2015 at 18:31:38 UTC, H. S. Teoh wrote:
> I tried implementing a crude version of this (see code below), and found that manually calling GC.collect() even as frequently as once every 5000 loop iterations (for a 500,000 line test input file) still gives about 15% performance improvement over completely disabling the GC.  Since most of the arrays involved here are pretty small, the frequency could be reduced to once every 50,000 iterations and you'd pretty much get the 20% performance boost for free, and still not run out of memory too quickly.

Interesting, I'll have to go through your code to understand exactly what's going on. I also noticed some GC-related stuff high up in my profiling, but had no idea what could be done about that. Appreciate the suggestions!

September 15, 2015
On Tuesday, 15 September 2015 at 08:51:02 UTC, Fredrik Boulund wrote:
> Using char[] all around might be a good idea, but it doesn't seem like the string conversions are really that taxing. What are the arguments for working on char[] arrays rather than strings?

No, casting to string would be incorrect here because the line buffer is reused, your original usage of to!string is correct.
September 15, 2015
On 15/09/15 9:00 PM, Kagamin wrote:
> On Tuesday, 15 September 2015 at 08:51:02 UTC, Fredrik Boulund wrote:
>> Using char[] all around might be a good idea, but it doesn't seem like
>> the string conversions are really that taxing. What are the arguments
>> for working on char[] arrays rather than strings?
>
> No, casting to string would be incorrect here because the line buffer is
> reused, your original usage of to!string is correct.

I made the assumption it wasn't allocating. Ehhh.