Thread overview | |||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
January 26, 2021 200-600x slower Dlang performance with nested foreach loop | ||||
---|---|---|---|---|
| ||||
Greetings Dlang wizards, I seek knowledge/understanding of a very frustrating phenomenon I've experienced over the past several days. The problem space: 1) Read a list of strings from a file 2) De-duplicate all strings into the subset of unique strings 3) Sort the subset of unique strings by descending length and then by ascending lexicographic identity 4) Iterate through the sorted subset of unique strings, identifying smaller sequences with perfect identity to their largest possible parent string I have written a Dlang program that performantly progresses through step #3 above. I used a built-in AA (associative array) to uniquely de-duplicate the initial set of strings and then used multiSort(). Performance was good up till this point, especially with use of the LDC compiler. Things went sideways at step #4: because multiSort() returns a SortedRange, I used .array to convert the returned SortedRange into an array of type string[]. This appeared to work, and neither DMD nor LDC threw any warnings/errors for doing this. With the formally returned array, I then attempted to construct a double foreach loop to iterate through the sorted array of unique strings and find substring matches. foreach( i, ref pStr; sortedArr ) { foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] ) { if( indexOf( pStr, cStr ) > -1 ) { // ... } } } Before adding the code excerpt above, the Dlang program was taking ~1 second on an input file containing approx. 64,000 strings. By adding the code above, the program now takes 6 minutes to complete. An attempt was made to more efficiently perform ASCII-only substring searching by converting the sorted string[] into ubyte[][] and then using countUntil() instead of indexOf(), but this had an effect that was completely opposite to what I had previously experienced: the program then took over 20 minutes to complete! Thus, I am entirely baffled. My first attempt to solve this problem space used a small Perl program to perform steps 1 through 3, which would then pipe intermediate output to a small Dlang program handling only step #4 using dynamic arrays (no use of AAs) of ubyte[][] with use of countUntil(). The Dlang code for the nested foreach block above is essentially near-identical between my two Dlang implementations. Yet, the second implementation--where I'm trying to solve the entire problem space in D--has absolutely failed in terms of performance. Perl+D, ubyte[][], countUntil() :: under 2 seconds only D, string[], indexOf() :: ~6 minutes only D, ubyte[][], countUntil() :: >20 minutes Please advise. This nightmarish experience is shaking my confidence in using D. |
January 26, 2021 Re: 200-600x slower Dlang performance with nested foreach loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to methonash | On Tuesday, 26 January 2021 at 17:40:36 UTC, methonash wrote: > foreach( i, ref pStr; sortedArr ) > { > foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] ) > { > if( indexOf( pStr, cStr ) > -1 ) > { > // ... > } > } > } > > Before adding the code excerpt above, the Dlang program was taking ~1 second on an input file containing approx. 64,000 strings. > > By adding the code above, the program now takes 6 minutes to complete. It would be much easier for us to help you with this if you could post the full program, or at the very least a reduced version that reproduces the same issue. [1] Since your attempts so far have failed to fix the problem, it is quite likely that some part of the code you do not suspect is actually to blame. [1] https://idownvotedbecau.se/nomcve/ |
January 26, 2021 Re: 200-600x slower Dlang performance with nested foreach loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to methonash | On Tuesday, 26 January 2021 at 17:40:36 UTC, methonash wrote:
> Greetings Dlang wizards,
>
> I seek knowledge/understanding of a very frustrating phenomenon I've experienced over the past several days.
>
> [...]
Source please 👍
|
January 26, 2021 Re: 200-600x slower Dlang performance with nested foreach loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to Paul Backus | On Tuesday, 26 January 2021 at 17:56:22 UTC, Paul Backus wrote:
>
> It would be much easier for us to help you with this if you could post the full program, or at the very least a reduced version that reproduces the same issue. [1] Since your attempts so far have failed to fix the problem, it is quite likely that some part of the code you do not suspect is actually to blame.
I cannot post the full source code.
Regarding a reduced version reproducing the issue: well, that's exactly what the nested foreach loop does. Without it, the program reaches that point quickly.
With the nested foreach block, it slows to a crawl. More specifically, commenting-out the indexOf() or countUntil() sub-blocks preserves fast performance, but I'm not sure if that may be related to compiler optimizations realizing that there's nothing but "dead/nonexistent code" inside the loops and generating a binary that just never goes there.
If this may help: I've composed the second Dlang implementation as one extended block of code within main() and am thinking of soon refactoring the code into different functions. I remain pessimistic of whether this may help.
Is there any possibility this could be GC-related?
|
January 26, 2021 Re: 200-600x slower Dlang performance with nested foreach loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to methonash | On Tue, Jan 26, 2021 at 05:40:36PM +0000, methonash via Digitalmars-d-learn wrote: [...] > 1) Read a list of strings from a file > 2) De-duplicate all strings into the subset of unique strings > 3) Sort the subset of unique strings by descending length and then by > ascending lexicographic identity > 4) Iterate through the sorted subset of unique strings, identifying smaller > sequences with perfect identity to their largest possible parent string [...] > Things went sideways at step #4: because multiSort() returns a SortedRange, I used .array to convert the returned SortedRange into an array of type string[]. Do not do this. Every time you call .array it allocates a new array and copies all its contents over. If this code runs frequently, it will cause a big performance hit, not to mention high GC load. The function you're looking for is .release, not .array. [...] > With the formally returned array, I then attempted to construct a double foreach loop to iterate through the sorted array of unique strings and find substring matches. > > foreach( i, ref pStr; sortedArr ) > { > foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] ) > { > if( indexOf( pStr, cStr ) > -1 ) > { > // ... > } > } > } > > Before adding the code excerpt above, the Dlang program was taking ~1 second on an input file containing approx. 64,000 strings. > > By adding the code above, the program now takes 6 minutes to complete. That nested loop is an O(n^2) algorithm. Meaning it will slow down *very* quickly as the size of the array n increases. You might want to think about how to improve this algorithm. > An attempt was made to more efficiently perform ASCII-only substring searching by converting the sorted string[] into ubyte[][] and then using countUntil() instead of indexOf(), but this had an effect that was completely opposite to what I had previously experienced: the program then took over 20 minutes to complete! How are you doing the conversion? If you're using std.conv.to or something like that, it will definitely cause a big performance hit because of the needless allocations and copying. You probably want a direct cast instead. I.e., you want to reinterpret the array reference, not transcribe a copy of it into a ubyte[][]. Probably what you're looking for is to use .representation and .countUntil, or maybe just .canFind to bypass the decoding overhead. (If indeed that is the bottleneck; it may not be. Have you used a profiler to identify where the hotspot is?) > Thus, I am entirely baffled. > > My first attempt to solve this problem space used a small Perl program to perform steps 1 through 3, which would then pipe intermediate output to a small Dlang program handling only step #4 using dynamic arrays (no use of AAs) of ubyte[][] with use of countUntil(). Using AA's may not necessarily improve performance. It depends on what your code does with it. Because AA's require random access to memory, it's not friendly to the CPU cache hierarchy, whereas traversing linear arrays is more cache-friendly and in some cases will out-perform AA's. > The Dlang code for the nested foreach block above is essentially near-identical between my two Dlang implementations. Yet, the second implementation--where I'm trying to solve the entire problem space in D--has absolutely failed in terms of performance. > > Perl+D, ubyte[][], countUntil() :: under 2 seconds > only D, string[], indexOf() :: ~6 minutes > only D, ubyte[][], countUntil() :: >20 minutes > > Please advise. This nightmarish experience is shaking my confidence in using D. First of all, you need to use a profiler to identify where the hotspots are. Otherwise you could well be pouring tons of effort into "optimizing" code that doesn't actually need optimizing, while completely missing the real source of the problem. Whenever you run into performance problems, do not assume you know where the problem is, profile, profile, profile! Second, you only posted a small fragment of your code, so it's hard to say where the problem really is. I can only guess based on what you described. If you could post the entire program, or at least a complete, compilable and runnable excerpt thereof that displays the same (or similar) performance problems, then we could better help you pinpoint where the problem is. In general, though, things to look for are: (1) Unnecessary memory allocations, e.g., calling .array on a SortedRange when you really should be using .release, or calling a conversion function to transcribe an array instead of just casting to reinterpret it; (2) Algorithms with poor big-O characteristics, e.g., the O(n^2) nested loop you have above. (3) Expensive operations inside inner loops, because the loop nesting magnifies any slowness in the code. But above all, before randomly changing your code in the hopes that you will hit upon a solution, use a profiler. Don't shoot in the dark; use a profiler to identify the actual performance bottlenecks. Don't waste time optimizing things that don't need optimizing; focus your efforts on the hotspots identified by your profiler. (And don't think you're smarter than your profiler -- I've been coding for ~20 years, and I still find that my bottlenecks are usually not where I think they are. Profile, profile, profile!) T -- What do you call optometrist jokes? Vitreous humor. |
January 26, 2021 Re: 200-600x slower Dlang performance with nested foreach loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to methonash | On Tue, Jan 26, 2021 at 06:13:54PM +0000, methonash via Digitalmars-d-learn wrote: [...] > I cannot post the full source code. Then we are limited in how much we can help you. > Regarding a reduced version reproducing the issue: well, that's exactly what the nested foreach loop does. Without it, the program reaches that point quickly. By "reduced version" we mean a code exceprt that's (1) compilable, (2) runnable, and (3) exhibits the same problem that you're seeing in your full code. Posting an isolated loop cut out of some unknown function in some unknown module somewhere is not helping us identify where your problem is. > With the nested foreach block, it slows to a crawl. More specifically, commenting-out the indexOf() or countUntil() sub-blocks preserves fast performance, but I'm not sure if that may be related to compiler optimizations realizing that there's nothing but "dead/nonexistent code" inside the loops and generating a binary that just never goes there. When in doubt, use a disassembler to see exactly what is the generated code. > If this may help: I've composed the second Dlang implementation as one extended block of code within main() and am thinking of soon refactoring the code into different functions. I remain pessimistic of whether this may help. As I said in my other reply: don't guess, profile! Randomly changing your code in the hopes that it will help, in general, won't. > Is there any possibility this could be GC-related? Without more information and a complete, compilable example, it's anybody's guess. T -- Debian GNU/Linux: Cray on your desktop. |
January 26, 2021 Re: 200-600x slower Dlang performance with nested foreach loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to methonash | On 1/26/21 12:40 PM, methonash wrote:
> My first attempt to solve this problem space used a small Perl program to perform steps 1 through 3, which would then pipe intermediate output to a small Dlang program handling only step #4 using dynamic arrays (no use of AAs) of ubyte[][] with use of countUntil().
>
> The Dlang code for the nested foreach block above is essentially near-identical between my two Dlang implementations. Yet, the second implementation--where I'm trying to solve the entire problem space in D--has absolutely failed in terms of performance.
>
> Perl+D, ubyte[][], countUntil() :: under 2 seconds
> only D, string[], indexOf() :: ~6 minutes
> only D, ubyte[][], countUntil() :: >20 minutes
Maybe try a different approach.
Replace the perl code with D, and still have it output to the same small D program that processes the results. It seems from your description that everything is "identical" that if your conclusions are correct, it should be at least as fast as the Perl+D version.
But I think you are missing something else.
-Steve
|
January 26, 2021 Re: 200-600x slower Dlang performance with nested foreach loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to methonash | On Tuesday, 26 January 2021 at 17:40:36 UTC, methonash wrote:
> foreach( i, ref pStr; sortedArr )
> {
> foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] )
> {
> if( indexOf( pStr, cStr ) > -1 )
> {
> // ...
> }
> }
> }
>
> Before adding the code excerpt above, the Dlang program was taking ~1 second on an input file containing approx. 64,000 strings.
What's the typical length of your strings?
|
January 26, 2021 Re: 200-600x slower Dlang performance with nested foreach loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to mw | On Tuesday, 26 January 2021 at 21:55:47 UTC, mw wrote:
> On Tuesday, 26 January 2021 at 17:40:36 UTC, methonash wrote:
>> foreach( i, ref pStr; sortedArr )
>> {
>> foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] )
>> {
>> if( indexOf( pStr, cStr ) > -1 )
>> {
>> // ... yourInnerOp
>> }
>> }
>> }
>>
>> Before adding the code excerpt above, the Dlang program was taking ~1 second on an input file containing approx. 64,000 strings.
>
> What's the typical length of your strings?
Actually, I think it's reasonable given your algo:
Your algo (double-loop) is O(n^2), n = 64,000
so the loop will run n^2 = 4,096,000,000 i.e 4G
Suppose your CPU is 2GHz, and suppose each loop operation take just 1 machine cycle (very unlikely), this algo will take 2 seconds.
However, string searching i.e `indexOf`, or `yourInnerLoop` can easily take hundreds of cycles, let's suppose it's 100 machine cycles (still a very low estimate), then the algo will take ~200 seconds = ~3.3 minutes.
If you want, you can try to rewrite your algo in Java or Python, and compare the run time with the Dlang version.
|
January 26, 2021 Re: 200-600x slower Dlang performance with nested foreach loop | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Tuesday, 26 January 2021 at 18:17:31 UTC, H. S. Teoh wrote: > Do not do this. Every time you call .array it allocates a new array and copies all its contents over. If this code runs frequently, it will cause a big performance hit, not to mention high GC load. > > The function you're looking for is .release, not .array. Many thanks for the tip! I look forward to trying this soon. For reference, the .array call is only performed once. > That nested loop is an O(n^2) algorithm. Meaning it will slow down *very* quickly as the size of the array n increases. You might want to think about how to improve this algorithm. Nice observation, and yes, this would typically be an O(n^2) approach. However, due to subsetting the input dataset to unique strings and then sorting in descending length, one might notice that the inner foreach loop does not iterate over all of n, only on the iterator value i+1 through the end of the array. Thus, I believe this would then become approximately O(n^2/2). More precisely, it should be O( ( n^2 + n ) / 2 ). Further: the original dataset has 64k strings. Squaring that yields 4.1 billion string comparisons. Once uniquely de-duplicated, the dataset is reduced to ~46k strings. Considering roughly O(n^2/2) at this level, this yields 1.06 billion string comparisons. So, performing steps 1 through 3 improves the brute-force string comparison problem four-fold in my test development dataset. > Using AA's may not necessarily improve performance. It depends on what your code does with it. Because AA's require random access to memory, it's not friendly to the CPU cache hierarchy, whereas traversing linear arrays is more cache-friendly and in some cases will out-perform AA's. I figured a built-in AA might be an efficient path to performing unique string de-duplication. If there's a more performant method available, I'll certainly try it. > First of all, you need to use a profiler to identify where the hotspots are. Otherwise you could well be pouring tons of effort into "optimizing" code that doesn't actually need optimizing, while completely missing the real source of the problem. Whenever you run into performance problems, do not assume you know where the problem is, profile, profile, profile! Message received. Given that D is the first compiled language I've semi-seriously dabbled with, I have no real experience with profiler usage. > Second, you only posted a small fragment of your code, so it's hard to say where the problem really is. I can only guess based on what you described. If you could post the entire program, or at least a complete, compilable and runnable excerpt thereof that displays the same (or similar) performance problems, then we could better help you pinpoint where the problem is. Yes, I'll be looking to present a complete, compilable, and executable demo of code for this issue if/when subsequent efforts continue to fail. For professional reasons (because I no longer work in academia), I cannot share the original source code for the issue presented here, but I can attempt to reproduce it in a minimally complete form for a public dataset. |
Copyright © 1999-2021 by the D Language Foundation