Replacing tango.text.Ascii.isearch

Oct 05, 2022

torhu

Oct 05, 2022

Oct 05, 2022

Oct 05, 2022

Oct 05, 2022

Oct 05, 2022

Oct 05, 2022

Oct 06, 2022

Oct 06, 2022

Oct 06, 2022

Oct 06, 2022

Oct 07, 2022

Oct 07, 2022

Oct 07, 2022

Oct 07, 2022

Oct 07, 2022

Oct 08, 2022

Oct 09, 2022

Oct 09, 2022

Oct 13, 2022

Oct 13, 2022

Oct 13, 2022

Oct 13, 2022

Oct 13, 2022

Oct 13, 2022

Oct 13, 2022

Oct 13, 2022

Oct 25, 2022

Oct 25, 2022

Oct 26, 2022

Oct 26, 2022

Oct 26, 2022

Oct 26, 2022

Oct 26, 2022

Oct 28, 2022

Oct 28, 2022

October 06, 2022

Re: Replacing tango.text.Ascii.isearch

Posted by Siarhei Siamashka
in reply to torhu

Permalink

Siarhei Siamashka

Posted in reply to torhu

Permalink

On Wednesday, 5 October 2022 at 21:50:32 UTC, torhu wrote:

I did some basic testing, and regex was two orders of magnitude faster. So now I know, I guess.

Substring search functionality is currently in a very bad shape in Phobos. I discovered this myself a few weeks ago when I was trying to solve the https://www.facebook.com/codingcompetitions/hacker-cup/2022/round-1/problems/A2 puzzle using D language. A part of the solution requires a fast substring search (actually a subarray search) and Phobos doesn't offer anything with even remotely acceptable performance.

Phobos does have a Boyer-Moore implementation. This algorithm is historically famous in computer science as one of the first attempts to optimize substring search, but it also has pathologically bad performance on certain input data and probably shouldn't be recommended for any practical use nowadays. The users of old versions of Python discovered this too: https://codeforces.com/blog/entry/106849?#comment-952483

The standard 'find' function from the following simple example also becomes awfully slow when arrays 'a' and 'b' are large and/or are adversarially constructed:

import std;
void main() {
  auto a = [1, 2, 3, 4];
  auto b = [2, 3];
  writefln("Is %s a subarray of %s? The answer is %s.",
           b, a, a.find(b).empty ? "no" : "yes");
}

I think that the best fit for your use case is the https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and there's an old issue in bugzilla about this: https://issues.dlang.org/show_bug.cgi?id=16066

BTW, if anyone is curious, one of the possible solutions for the hacker-cup/2022/round-1/problems/A2 puzzle in D language can be found here: https://codeforces.com/blog/entry/106768?#comment-952808

On 10/5/22 23:50, torhu via Digitalmars-d-learn wrote: > I did some basic testing, and regex was two orders of magnitude faster. So now I know, I guess. And what kind of testing was that? Mind to share? Because I did the following real quick and wasn't able to measure a "two orders of magnitude" difference. Sure, the regex version came on top, but they were both faster than the ruby baseline I cooked up. First, generate a word file with 100k entries of various lengths: $> dmd -run words.d foobaz 100000 --- import std; string randomWord(ulong n) { static chars = letters.array; return generate!(() => chars.choice).take(n).text; } void main(string[] args) { enforce(args.length == 3, "Usage: dmd -run words.d needle num"); auto f = File("words.txt", "w"); foreach (i; 0..args[2].to!ulong) { ulong n = uniform(0, 50), m = uniform(0, 50); if (i % 2 == 0) f.writeln(randomWord(n), args[1], randomWord(m)); else f.writeln(randomWord(n + m)); } } --- And then for the actual measuring: $> dmd -O -version={range,regex} -of=search-{range,regex} search.d $> ldc -O -d-version={range,regex} -of=search-{range,regex}-ldc search.d $> time ./search-{range,regex}{,-ldc} foobaz --- import std; void main(string[] args) { enforce(args.length == 2, "Usage: search 'needle'"); version (regex) auto rx = regex(args[1], "i"); else version (range) auto needle = args[1].asLowerCase.text; else static assert(0, "use -version={regex,range}"); ulong matches; foreach (line; File("words.txt").byLine) { version (regex) if (line.matchFirst(rx)) matches++; version (range) if (line.asLowerCase.canFind(needle)) matches++; } writeln(matches); } ---

Forums