Replacing tango.text.Ascii.isearch (page 2)

Settings

Help

Index » Learn » Replacing tango.text.Ascii.isearch (page 2)

October 06, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by torhu
in reply to rassoc

Permalink

torhu

Posted in reply to rassoc

Permalink

On Thursday, 6 October 2022 at 21:36:48 UTC, rassoc wrote:

And what kind of testing was that? Mind to share? Because I did the following real quick and wasn't able to measure a "two orders of magnitude" difference. Sure, the regex version came on top, but they were both faster than the ruby baseline I cooked up.

Originally I just loaded a one megabyte file and searched the whole thing. I changed it to split it into (40 000) lines instead, regex is about ten times faster then. I compile with -release -O -inline. Here is the second version:

import std;
import std.datetime.stopwatch;

enum FILE = "test.lst";
string text;
string needle;

void test(bool delegate(string haystack) dg)
{

    auto sw = StopWatch(AutoStart.yes);
    int counter = 0;
    foreach (line; lineSplitter(text)) {
        if (dg(line))
            counter++;
    }
    sw.stop();
    writefln("%s", sw.peek());
    writefln("counter: %s", counter);
}

void main(char[][] args)
{
    enforce(args.length > 1, "Need a needle argument.");

    text = cast(string)read(FILE);
    needle = args[1].idup;
    auto re = regex(to!string(escaper(needle)), "i");
    string needleLower = needle.toLower();

    test((h) => !!h.matchFirst(re));
    test((h) => h.asLowerCase().canFind(needleLower));
}

October 07, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by rassoc
in reply to torhu

Permalink

rassoc

Posted in reply to torhu

Permalink

On 10/7/22 01:39, torhu via Digitalmars-d-learn wrote:
> regex is about ten times faster then.

Interesting! Using your code, I'm seeing a 1.5x max difference for ldc, nothing close to 10x. Welp, the woes of superficial benchmarking. :)

October 07, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by Siarhei Siamashka
in reply to rassoc

Permalink

Siarhei Siamashka

Posted in reply to rassoc

Permalink

On Friday, 7 October 2022 at 00:57:38 UTC, rassoc wrote:

On 10/7/22 01:39, torhu via Digitalmars-d-learn wrote:

regex is about ten times faster then.

Interesting! Using your code, I'm seeing a 1.5x max difference for ldc, nothing close to 10x. Welp, the woes of superficial benchmarking. :)

Benchmark results depend on many things, such as the actual text in both needle and haystack and the needle length. Are we dealing with unicode text by the way? One example is searching for something like "äußere" in https://www.gutenberg.org/ebooks/6343.txt.utf-8

If it's the source code, then searching for "sqlite3_value_bytes16" in the sqlite3.c file from https://www.sqlite.org/2022/sqlite-amalgamation-3390400.zip may be a good test too.

I'm getting at least 5x difference in favor of regex with LDC on these two examples.

Also are we allowed to artificially construct needle and haystack to blow up this test rather than only benchmarking it on typical real data?

October 07, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by Siarhei Siamashka
in reply to Siarhei Siamashka

Permalink

Siarhei Siamashka

Posted in reply to Siarhei Siamashka

Permalink

On Friday, 7 October 2022 at 06:34:50 UTC, Siarhei Siamashka wrote:

Also are we allowed to artificially construct needle and haystack to blow up this test rather than only benchmarking it on typical real data?

Such as generating the input data via running:

python -c "print(('a' * 49 + 'b') * 20000)" > test.lst

And then using "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" (the character 'a' replicated 50 times) as the needle to search for. Much longer needles work even better. In Linux the command line size is limited by 128K, so there's a huge room for improvement.

October 07, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by bachmeier
in reply to Siarhei Siamashka

Permalink

bachmeier

Posted in reply to Siarhei Siamashka

Permalink

On Friday, 7 October 2022 at 07:16:19 UTC, Siarhei Siamashka wrote:

On Friday, 7 October 2022 at 06:34:50 UTC, Siarhei Siamashka wrote:

Also are we allowed to artificially construct needle and haystack to blow up this test rather than only benchmarking it on typical real data?

Such as generating the input data via running:

python -c "print(('a' * 49 + 'b') * 20000)" > test.lst

https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/

"the longer the pattern is, the faster the algorithm goes"

October 07, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by Siarhei Siamashka
in reply to bachmeier

Permalink

Siarhei Siamashka

Posted in reply to bachmeier

Permalink

On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:

https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/

"the longer the pattern is, the faster the algorithm goes"

Yes, that's how substring search works in the standard libraries of the other programming languages. Now please take the torhu's test program (posted a few comments above) and generate input for it using

python -c "print(('a' * 49 + 'b') * 20000)" > test.lst

Run it to search for "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" (the character 'a' replicated 50 times) and then compare its performance against the other similar programs doing the same thing:

Python:

import sys
assert len(sys.argv) >= 2, "Need a needle argument."
needle = sys.argv[1].lower()
print(sum([1 if line.lower().find(needle) != -1 else 0 for line in open("test.lst")]))

Ruby/Crystal:

abort "Need a needle argument." unless ARGV.size >= 1
needle = ARGV[0].downcase
puts File.open("test.lst").each_line.sum {|line| line.downcase.index(needle) ? 1 : 0 }

A generic substring search is tuned to be fast on any input in the other programming languages. But in Phobos a simpleminded slow search algorithm is used by default. The Boyer-Moore algorithm can be used too if it's explicitly requested, but it has some gotchas of its own.

October 08, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by rassoc
in reply to Siarhei Siamashka

Permalink

rassoc

Posted in reply to Siarhei Siamashka

Permalink

On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learn wrote:
> On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
> python -c "print(('a' * 49 + 'b') * 20000)" > test.lst

That's generating a file with a single line:

$> wc -l test.lst
1 test.lst

Going with an appropriate 100k mixed line file and your mentioned needle, D is still quite a bit slower, but the results aren't as drastic and nowhere near "two orders of magnitude".

$> crystal build --release -o search-cr search.cr
abort "Need a needle argument." unless ARGV.size >= 1
needle = ARGV[0].downcase
puts File.open("words.txt").each_line.count(&.downcase.includes? needle)

$> ldc2 -O2 --release -of search-ldc search.d
import std;
void main(string[] args) {
    enforce(args.length > 1, "Need a needle argument.");
    auto needle = args[1].toLower;
    File("words.txt").byLine.count!(ln => ln.asLowerCase.canFind(needle)).writeln;
}

October 09, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by Siarhei Siamashka
in reply to rassoc

Permalink

Siarhei Siamashka

Posted in reply to rassoc

Permalink

On Saturday, 8 October 2022 at 01:07:46 UTC, rassoc wrote:

On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learn wrote:

On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
python -c "print(('a' * 49 + 'b') * 20000)" > test.lst

That's generating a file with a single line:

$> wc -l test.lst
1 test.lst

If you insist on having 100K lines in the input data, then you can try a different test (run it as a shell script):

python -c "print(('a' * 9999 + 'b') * 89 + '\\n' * 99999 + 'a' * 10000)" > words.txt
cp words.txt test.lst
wc -l words.txt

NEEDLE=`python -c "print('a' * 10000, end='')"`

echo "=== Python ==="
time python search.py "$NEEDLE"

echo "=== Ruby ==="
time ruby search.rb "$NEEDLE"

echo "=== Crystal ==="
time ./search-cr "$NEEDLE"

echo "=== D (LDC) ==="
time ./search-ldc "$NEEDLE"

It is testing a 1MB file with 100K lines and a 10K characters long needle. Results from my computer (with Python 3.10, because earlier versions of Python are slower):

100000 words.txt
=== Python ===
1

real    0m0.083s
user    0m0.072s
sys     0m0.010s
=== Ruby ===
1

real    0m0.313s
user    0m0.313s
sys     0m0.000s
=== Crystal ===
1

real    0m1.492s
user    0m1.482s
sys     0m0.010s
=== D (LDC) ===
1

real    1m10.314s
user    1m10.234s
sys     0m0.050s

Going with an appropriate 100k mixed line file and your mentioned needle, D is still quite a bit slower, but the results aren't as drastic

Your "appropriate" file is also entirely artificial and happens to be specifically crafted to work best with the current ".canFind" implementation from Phobos. The primary source of slowness are partial matches. Such as the "int i = " prefix when searching for "int i = 0;" substring in a string, which contains "int i = 1;". The time is wasted on comparing the initial 8 characters before encountering a mismatch and bailing out. But when searching in a randomly generated gibberish, the chances of encountering long partial matches are very low. At least lower than in the real text files.

and nowhere near "two orders of magnitude".

Does the difference really have to be two orders of magnitude for you to acknowledge that there might be a performance problem in Phobos? Moreover, you are quoting torhu, who compared two D implementations compiled by DMD (regex vs. canFind) rather than standard libraries of different programming languages.

import std;
void main(string[] args) {
    enforce(args.length > 1, "Need a needle argument.");
    auto needle = args[1].toLower;
    File("words.txt").byLine.count!(ln => ln.asLowerCase.canFind(needle)).writeln;
}

It's a nicely looking one-liner implementation in D language and everything is fine. Except that similar one-liners implemented using other programming languages are faster and more versatile (can handle any input data without catastrophic performance pitfalls).

BTW, changing ".asLowerCase" to ".toLower" introduces an extra memory allocation, but still significantly improves handling of the worst case. Conversion to lowercase is expensive for unicode characters and revisiting the same character to repeat such conversion again and again is bad for performance.

October 09, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by rassoc
in reply to Siarhei Siamashka

Permalink

rassoc

Posted in reply to Siarhei Siamashka

Permalink

On 10/9/22 03:08, Siarhei Siamashka via Digitalmars-d-learn wrote:
> Does the difference really have to be two orders of magnitude for you to acknowledge that there might be a performance problem in Phobos? [...] Except that similar one-liners implemented using other programming languages are faster and more versatile (can handle any input data without catastrophic performance pitfalls).

Oh, I get all that, there's no reason to argue with me or win me over, I can see that the implementation is subpar. Since I'm never hitting this performance bottleneck in my code, and being a regular dev and not a core maintainer, I simply haven't been motivated enough to contribute an improvement. Change like that isn't happening in the forums. Various optimizations have made it into Phobos in the past, don't think you would get any pushback if you can show that a new implementation improves the situation in almost all cases while maintaining compatibility.

October 13, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by bauss
in reply to Steven Schveighoffer

Permalink

bauss

Posted in reply to Steven Schveighoffer

Permalink

On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven Schveighoffer wrote:

On 10/5/22 12:59 PM, torhu wrote:

I need a case-insensitive check to see if a string contains another string for a "quick filter" feature. It should preferrably be perceived as instant by the user, and needs to check a few thousand strings in typical cases. Is a regex the best option, or what would you suggest?

https://dlang.org/phobos/std_uni.html#asLowerCase

bool isearch(S1, S2)(S1 haystack, S2 needle)
{
    import std.uni;
    import std.algorithm;
    return haystack.asLowerCase.canFind(needle.asLowerCase);
}

untested.

-Steve

This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct.

Ex. Turkish will not work with it properly.

Very interesting article: http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

Top | Forum index | About this forum

Forums