January 27, 2021
On Tuesday, 26 January 2021 at 23:57:43 UTC, methonash wrote:
>> Using AA's may not necessarily improve performance.  It depends on what your code does with it.  Because AA's require random access to memory, it's not friendly to the CPU cache hierarchy, whereas traversing linear arrays is more cache-friendly and in some cases will out-perform AA's.
>
> I figured a built-in AA might be an efficient path to performing unique string de-duplication. If there's a more performant method available, I'll certainly try it.

You could try sorting the array first, and then using `uniq` [1] to discard duplicate elements. There's an example in the docs that shows how to do this in-place (without allocating additional memory).

[1] http://phobos.dpldocs.info/std.algorithm.iteration.uniq.html
January 26, 2021
On Wed, Jan 27, 2021 at 01:28:33AM +0000, Paul Backus via Digitalmars-d-learn wrote:
> On Tuesday, 26 January 2021 at 23:57:43 UTC, methonash wrote:
> > > Using AA's may not necessarily improve performance.  It depends on what your code does with it.  Because AA's require random access to memory, it's not friendly to the CPU cache hierarchy, whereas traversing linear arrays is more cache-friendly and in some cases will out-perform AA's.
> > 
> > I figured a built-in AA might be an efficient path to performing unique string de-duplication. If there's a more performant method available, I'll certainly try it.
> 
> You could try sorting the array first, and then using `uniq` [1] to discard duplicate elements. There's an example in the docs that shows how to do this in-place (without allocating additional memory).
> 
> [1] http://phobos.dpldocs.info/std.algorithm.iteration.uniq.html

Yes, definitely try this.  This will completely eliminate the overhead of using an AA, which has to allocate memory (at least) once per entry added.  Especially since the data has to be sorted eventually anyway, you might as well sort first then use the sortedness as a convenient property for fast de-duplication.  Since .uniq traverses the range linearly, this will be cache-friendly, and along with eliminating GC load should give you a speed boost.


T

-- 
Nearly all men can stand adversity, but if you want to test a man's character, give him power. -- Abraham Lincoln
January 27, 2021
On Wednesday, 27 January 2021 at 02:14:39 UTC, H. S. Teoh wrote:
> Yes, definitely try this.  This will completely eliminate the overhead of using an AA, which has to allocate memory (at least) once per entry added.  Especially since the data has to be sorted eventually anyway, you might as well sort first then use the sortedness as a convenient property for fast de-duplication.  Since .uniq traverses the range linearly, this will be cache-friendly, and along with eliminating GC load should give you a speed boost.
>
>
> T

Associative arrays allocate per entry added?!

https://github.com/dlang/druntime/blob/master/src/rt/aaA.d#L205 Oh God, associative arrays allocate per entry added!
January 27, 2021
On Wednesday, 27 January 2021 at 14:15:26 UTC, FeepingCreature wrote:
>
> Associative arrays allocate per entry added?!
>
> https://github.com/dlang/druntime/blob/master/src/rt/aaA.d#L205 Oh God, associative arrays allocate per entry added!

Maybe it's to avoid invalidating the result of `key in aa` when adding or removing entries? The spec doesn't say anything about it either way [1], but allowing invalidation would make AAs much more difficult to use in @safe code.

[1] https://dlang.org/spec/hash-map.html
January 27, 2021
On Wednesday, 27 January 2021 at 15:12:32 UTC, Paul Backus wrote:
>
> Maybe it's to avoid invalidating the result of `key in aa` when adding or removing entries? The spec doesn't say anything about it either way [1], but allowing invalidation would make AAs much more difficult to use in @safe code.
>
> [1] https://dlang.org/spec/hash-map.html

Correction: the GC would take care of the safety issue, of course. I haven't had my morning tea yet, and clearly I'm not thinking straight. :)
January 27, 2021
On 1/27/21 10:14 AM, Paul Backus wrote:
> On Wednesday, 27 January 2021 at 15:12:32 UTC, Paul Backus wrote:
>>
>> Maybe it's to avoid invalidating the result of `key in aa` when adding or removing entries? The spec doesn't say anything about it either way [1], but allowing invalidation would make AAs much more difficult to use in @safe code.

Yes, that's the reason.

> 
> Correction: the GC would take care of the safety issue, of course. I haven't had my morning tea yet, and clearly I'm not thinking straight. :)

No, it wouldn't. The problem is not a GC issue, but an issue with aliasing expectations (if you rehash, you completely rearrange the buckets).

so if you have:

auto x = key in aa;

aa[someOtherKey] = value; // rehashes

Code at this point currently can count on x pointing at the item being used in the AA.

If we change it now, lots of code will break.

This is not truly a horrible issue, you can use a custom implementation (see any number of container libraries on code.dlang.org).

What would be nice though, is to provide an actual template, that builtin AAs are an alias for (like string), and then if you want slightly different behavior, you provide different parameters.

But at the end of the day, the builtin AAs will ALWAYS do this. We can't change it now.

-Steve
January 30, 2021
Greetings all,

Many thanks for sharing your collective perspective and advice thus far! It has been very helpful and instructive. I return bearing live data and a minimally complete, compilable, and executable program to experiment with and potentially optimize. The dataset can be pulled from here:

https://filebin.net/qf2km1ea9qgd5skp/seqs.fasta.gz?t=97kgpebg

Running "cksum" on this file:

1477520542 2199192 seqs.fasta.gz

Naturally, you'll need to gunzip this file. The decompressed file contains strings on every even-numbered line that have already been reduced to the unique de-duplicated subset, and they have already been sorted by descending length and alphabetical identity.

From my initial post, the focus is now entirely on step #4: finding/removing strings that can be entirely absorbed (substringed) by their largest possible parent.

And now for the code:


import std.stdio : writefln, File, stdin;
import std.conv : to;
import std.string : indexOf;

void main()
{
        string[] seqs;

        foreach( line; stdin.byLine() )
        {
                if( line[ 0 ] == '>' ) continue;
                else seqs ~= to!string( line );
        }

        foreach( i; 0 .. seqs.length )
        {
                if( seqs[ i ].length == 0 ) continue;

                foreach( j; i + 1 .. seqs.length )
                {
                        if( seqs[ j ].length == 0 ||
                            seqs[ i ].length == seqs[ j ].length ) continue;

                        if( indexOf( seqs[ i ], seqs[ j ] ) > -1 )
                        {
                                seqs[ j ] = "";

                                writefln( "%s contains %s", i, j );
                        }
                }
        }
}


Compile the source and then run the executable via redirecting stdin:

./substr < seqs.fasta

See any straightforward optimization paths here?

For curiosity, I experimented with use of string[] and ubyte[][] and several functions (indexOf, canFind, countUntil) to assess for the best potential performer. My off-the-cuff results:

string[] with indexOf() :: 26.5-27 minutes
string[] with canFind() :: >28 minutes
ubyte[][] with canFind() :: 27.5 minutes
ubyte[][] with countUntil() :: 27.5 minutes

Resultantly, the code above uses string[] with indexOf(). Tests were performed with an Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz.

I have additional questions/concerns/confusion surrounding the foreach() syntax I have had to apply above, but performance remains my chief immediate concern.
January 30, 2021
On 1/30/21 6:13 PM, methonash wrote:
> Greetings all,
> 
> Many thanks for sharing your collective perspective and advice thus far! It has been very helpful and instructive. I return bearing live data and a minimally complete, compilable, and executable program to experiment with and potentially optimize. The dataset can be pulled from here:
> 
> https://filebin.net/qf2km1ea9qgd5skp/seqs.fasta.gz?t=97kgpebg
> 
> Running "cksum" on this file:
> 
> 1477520542 2199192 seqs.fasta.gz
> 
> Naturally, you'll need to gunzip this file. The decompressed file contains strings on every even-numbered line that have already been reduced to the unique de-duplicated subset, and they have already been sorted by descending length and alphabetical identity.
> 
>  From my initial post, the focus is now entirely on step #4: finding/removing strings that can be entirely absorbed (substringed) by their largest possible parent.
> 
> And now for the code:
> 
> 
> import std.stdio : writefln, File, stdin;
> import std.conv : to;
> import std.string : indexOf;
> 
> void main()
> {
>          string[] seqs;
> 
>          foreach( line; stdin.byLine() )
>          {
>                  if( line[ 0 ] == '>' ) continue;
>                  else seqs ~= to!string( line );
>          }
> 
>          foreach( i; 0 .. seqs.length )
>          {
>                  if( seqs[ i ].length == 0 ) continue;
> 
>                  foreach( j; i + 1 .. seqs.length )
>                  {
>                          if( seqs[ j ].length == 0 ||
>                              seqs[ i ].length == seqs[ j ].length ) continue;
> 
>                          if( indexOf( seqs[ i ], seqs[ j ] ) > -1 )
>                          {
>                                  seqs[ j ] = "";
> 
>                                  writefln( "%s contains %s", i, j );
>                          }
>                  }
>          }
> }
> 
> 
> Compile the source and then run the executable via redirecting stdin:
> 
> ./substr < seqs.fasta
> 
> See any straightforward optimization paths here?
> 
> For curiosity, I experimented with use of string[] and ubyte[][] and several functions (indexOf, canFind, countUntil) to assess for the best potential performer. My off-the-cuff results:
> 
> string[] with indexOf() :: 26.5-27 minutes
> string[] with canFind() :: >28 minutes
> ubyte[][] with canFind() :: 27.5 minutes
> ubyte[][] with countUntil() :: 27.5 minutes
> 
> Resultantly, the code above uses string[] with indexOf(). Tests were performed with an Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz.
> 
> I have additional questions/concerns/confusion surrounding the foreach() syntax I have had to apply above, but performance remains my chief immediate concern.

The code looks pretty minimal.

I'd suggest trying it in reverse. If you have the sequence "cba", "ba", "a", then determining "a" is in "ba" is probably cheaper than determining "a" is in "cba".

Are you still convinced that it's possible to do it in under 2 seconds? That would seem a huge discrepancy. If not, what specifically are you looking for in terms of performance?

-Steve
January 31, 2021
On Tuesday, 26 January 2021 at 23:57:43 UTC, methonash wrote:
clip
>> That nested loop is an O(n^2) algorithm. Meaning it will slow down *very* quickly as the size of the array n increases.  You might want to think about how to improve this algorithm.
>
> Nice observation, and yes, this would typically be an O(n^2) approach.
>
> However, due to subsetting the input dataset to unique strings and then sorting in descending length, one might notice that the inner foreach loop does not iterate over all of n, only on the iterator value i+1 through the end of the array.
>
> Thus, I believe this would then become approximately O(n^2/2). More precisely, it should be O( ( n^2 + n ) / 2 ).
>

But that is still O(n^2), you've only changed the constant.
January 31, 2021
On Sunday, 31 January 2021 at 00:53:05 UTC, Steven Schveighoffer wrote:
> I'd suggest trying it in reverse. If you have the sequence "cba", "ba", "a", then determining "a" is in "ba" is probably cheaper than determining "a" is in "cba".

I have user requirements that this application track string IDs that get collapsed under parents. To minimize/simplify array concatenations, I figured that going in descending order might make those operations less expensive.

Though I think this is also worth trying.

> Are you still convinced that it's possible to do it in under 2 seconds? That would seem a huge discrepancy. If not, what specifically are you looking for in terms of performance?

Good question. As a sanity check, at some point this past week, I re-wrote everything in a single contained Perl program, and running that program on the dataset I've shared above took well over 2 hours of wall time.

Considering that the index() function in Perl is pre-compiled, I imagine that Dlang running 4-5x faster is a pretty good speed boost, as it may only be benefiting from the speed of compiled loops over interpreted loops.

What confuses me, at this point, is this: I originally wrote the D code using foreach in this style:

foreach( i, ref parentString; strings )
{
    foreach( j, ref childString; strings[ i + 1 .. $ ] )
    {
        // ...
    }
}

Essentially, the value of j printed to stdout should always be larger than the value of i. Yet, when running in the above style, the values of j reported to stdout inevitably become smaller than i, suggesting that the loop is somehow traversing backwards. How can this be explained?