February 19, 2021
On Wednesday, 17 February 2021 at 04:10:24 UTC, tsbockman wrote:
> I spent some time experimenting with this problem, and here is the best solution I found, assuming that perfect de-duplication is required. (I'll put the code up on GitHub / dub if anyone wants to have a look.)

It would be interesting to see how the performance compares to tsv-uniq (https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The prebuilt binaries turn on all the optimizations (https://github.com/eBay/tsv-utils/releases).

tsv-uniq wasn't included in the different comparative benchmarks I published, but I did run my own benchmarks and it holds up well. However, it should not be hard to beat it. What might be more interesting is what the delta is.

tsv-uniq is using the most straightforward approach of popping things into an associate array. No custom data structures. Enough memory is required to hold all the unique keys in memory, so it won't handle arbitrarily large data sets. It would be interesting to see how the straightforward approach compares with the more highly tuned approach.

--Jon

February 23, 2021
On Friday, 19 February 2021 at 00:13:19 UTC, Jon Degenhardt wrote:
> On Wednesday, 17 February 2021 at 04:10:24 UTC, tsbockman wrote:
>> I spent some time experimenting with this problem, and here is the best solution I found, assuming that perfect de-duplication is required. (I'll put the code up on GitHub / dub if anyone wants to have a look.)
>
> It would be interesting to see how the performance compares to tsv-uniq (https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The prebuilt binaries turn on all the optimizations (https://github.com/eBay/tsv-utils/releases).

My program (called line-dedup below) is modestly faster than yours, with the gap gradually widening as files get bigger. Similarly, when not using a memory-mapped scratch file, my program is modestly less memory hungry than yours, with the gap gradually widening as files get bigger.

In neither case is the difference very exciting though; the real benefit of my algorithm is that it can process files too large for physical memory. It might also handle frequent hash collisions better, and could be upgraded to handle huge numbers of very short lines efficiently.

Test 0 (139 MiB, 1537300 unique lines of 2000004):
    sort source | uniq > dest
        Wall time: 2.04 s
        User time: 9.34 s
        System time: 0.23 s
        Max resident set: 385 MiB
    tsv-uniq source > dest
        Wall time: 0.82 s
        User time: 0.73 s
        System time: 0.13 s
        Max resident set: 229 MiB
    line-dedup source dest
        Wall time: 0.58 s
        User time: 0.52 s
        System time 0.05 s
        Max resident set: 206 MiB

Test 1 (1.4 GiB, 14338280 unique lines of 20000004):
    sort source | uniq > dest
        Wall time: 22.9 s
        User time: 107 s
        System time: 2.16 s
        Max resident set: 3.76 GiB
    tsv-uniq source > dest
        Wall time: 9.94 s
        User time: 9.25 s
        System time: 1.3 s
        Max resident set: 2.34 GiB
    line-dedup source dest
        Wall time: 6.34 s
        User time: 5.88 s
        System time 0.44 s
        Max resident set: 1.98 GiB

Test 2 (4.2 GiB, 44379959 unique lines of 60000004):
    sort source | uniq > dest
        Wall time: 87.1 s
        User time: 352.3 s
        System time: 7.41 s
        Max resident set: 7.84 GiB
    tsv-uniq source > dest
        Wall time: 42.3 s
        User time: 40.7 s
        System time: 4.82 s
        Max resident set: 7.86 GiB
    line-dedup source dest
        Wall time: 23 s
        User time: 19.6 s
        System time 1.4 s
        Max resident set: 5.96 GiB

The test files have a fractal distribution, with many lines having a few duplicates, and a few lines having many duplicates.
February 23, 2021
On Tuesday, 23 February 2021 at 00:08:40 UTC, tsbockman wrote:
> On Friday, 19 February 2021 at 00:13:19 UTC, Jon Degenhardt wrote:
>> It would be interesting to see how the performance compares to tsv-uniq (https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The prebuilt binaries turn on all the optimizations (https://github.com/eBay/tsv-utils/releases).
>
> My program (called line-dedup below) is modestly faster than yours, with the gap gradually widening as files get bigger. Similarly, when not using a memory-mapped scratch file, my program is modestly less memory hungry than yours, with the gap gradually widening as files get bigger.
>
> In neither case is the difference very exciting though; the real benefit of my algorithm is that it can process files too large for physical memory. It might also handle frequent hash collisions better, and could be upgraded to handle huge numbers of very short lines efficiently.

Thanks for running the comparison! I appreciate seeing how other implementations compare.

I'd characterize the results a differently though. Based on the numbers, line-dedup is materially faster than tsv-uniq, at least on the tests run. To your point, it may not make much practical difference on data sets that fit in memory. tsv-uniq is fast enough for most needs. But it's still a material performance delta. Nice job!

I agree also that the bigger pragmatic benefit is fast processing of files much larger than will fit in memory. There are other useful problems like this. One I often need is creating a random weighted ordering. Easy to do for data sets that fit in memory, but hard to do fast for data sets that do not.

--Jon


1 2
Next ›   Last »