I need a case-insensitive check to see if a string contains another string for a "quick filter" feature. It should preferrably be perceived as instant by the user, and needs to check a few thousand strings in typical cases. Is a regex the best option, or what would you suggest?
Thread overview | ||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
October 05, 2022 Replacing tango.text.Ascii.isearch | ||||
---|---|---|---|---|
| ||||
October 05, 2022 Re: Replacing tango.text.Ascii.isearch | ||||
---|---|---|---|---|
| ||||
Posted in reply to torhu | On 10/5/22 12:59 PM, torhu wrote: >I need a case-insensitive check to see if a string contains another string for a "quick filter" feature. It should preferrably be perceived as instant by the user, and needs to check a few thousand strings in typical cases. Is a regex the best option, or what would you suggest? https://dlang.org/phobos/std_uni.html#asLowerCase
untested. -Steve |
October 05, 2022 Re: Replacing tango.text.Ascii.isearch | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven Schveighoffer wrote: I wanted to do some quick benchmarking to figure out what works. When I run this:
It prints this: >2 weeks, 6 days, 9 hours, 34 minutes, 43 secs, 214 ms, 946 ╬╝s, and 7 hnsecs Am I doing something wrong here? |
October 05, 2022 Re: Replacing tango.text.Ascii.isearch | ||||
---|---|---|---|---|
| ||||
Posted in reply to torhu | On Wednesday, 5 October 2022 at 20:40:46 UTC, torhu wrote: >Am I doing something wrong here? Right, you can instantiate structs without arguments. It's been ten years since I last used D, I was thinking of structs like if they were classes. |
October 05, 2022 Re: Replacing tango.text.Ascii.isearch | ||||
---|---|---|---|---|
| ||||
Posted in reply to torhu | On Wednesday, 5 October 2022 at 20:45:55 UTC, torhu wrote: >On Wednesday, 5 October 2022 at 20:40:46 UTC, torhu wrote: >Am I doing something wrong here? Right, you can instantiate structs without arguments. It's been ten years since I last used D, I was thinking of structs like if they were classes. I think there should be sensible default here, seems like an easy trap to remove. |
October 05, 2022 Re: Replacing tango.text.Ascii.isearch | ||||
---|---|---|---|---|
| ||||
Posted in reply to torhu | On 10/5/22 13:40, torhu wrote:
> auto sw = StopWatch();
Either this:
auto sw = StopWatch(AutoStart.yes);
or this:
auto sw = StopWatch();
sw.start();
Ali
|
October 05, 2022 Re: Replacing tango.text.Ascii.isearch | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven Schveighoffer wrote: >
untested. -Steve I did some basic testing, and regex was two orders of magnitude faster. So now I know, I guess. |
October 06, 2022 Re: Replacing tango.text.Ascii.isearch | ||||
---|---|---|---|---|
| ||||
Posted in reply to torhu | On Wednesday, 5 October 2022 at 21:50:32 UTC, torhu wrote: >I did some basic testing, and regex was two orders of magnitude faster. So now I know, I guess. Substring search functionality is currently in a very bad shape in Phobos. I discovered this myself a few weeks ago when I was trying to solve the https://www.facebook.com/codingcompetitions/hacker-cup/2022/round-1/problems/A2 puzzle using D language. A part of the solution requires a fast substring search (actually a subarray search) and Phobos doesn't offer anything with even remotely acceptable performance. Phobos does have a Boyer-Moore implementation. This algorithm is historically famous in computer science as one of the first attempts to optimize substring search, but it also has pathologically bad performance on certain input data and probably shouldn't be recommended for any practical use nowadays. The users of old versions of Python discovered this too: https://codeforces.com/blog/entry/106849?#comment-952483 The standard 'find' function from the following simple example also becomes awfully slow when arrays 'a' and 'b' are large and/or are adversarially constructed:
I think that the best fit for your use case is the https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and there's an old issue in bugzilla about this: https://issues.dlang.org/show_bug.cgi?id=16066 BTW, if anyone is curious, one of the possible solutions for the hacker-cup/2022/round-1/problems/A2 puzzle in D language can be found here: https://codeforces.com/blog/entry/106768?#comment-952808 |
October 06, 2022 Re: Replacing tango.text.Ascii.isearch | ||||
---|---|---|---|---|
| ||||
Posted in reply to Siarhei Siamashka | On Thursday, 6 October 2022 at 08:15:10 UTC, Siarhei Siamashka wrote: >On Wednesday, 5 October 2022 at 21:50:32 UTC, torhu wrote: Please don’t tell us that D will be slower than Python again?) |
October 06, 2022 Re: Replacing tango.text.Ascii.isearch | ||||
---|---|---|---|---|
| ||||
Posted in reply to torhu | On 10/5/22 23:50, torhu via Digitalmars-d-learn wrote: > I did some basic testing, and regex was two orders of magnitude faster. So now I know, I guess. And what kind of testing was that? Mind to share? Because I did the following real quick and wasn't able to measure a "two orders of magnitude" difference. Sure, the regex version came on top, but they were both faster than the ruby baseline I cooked up. First, generate a word file with 100k entries of various lengths: $> dmd -run words.d foobaz 100000 --- import std; string randomWord(ulong n) { static chars = letters.array; return generate!(() => chars.choice).take(n).text; } void main(string[] args) { enforce(args.length == 3, "Usage: dmd -run words.d needle num"); auto f = File("words.txt", "w"); foreach (i; 0..args[2].to!ulong) { ulong n = uniform(0, 50), m = uniform(0, 50); if (i % 2 == 0) f.writeln(randomWord(n), args[1], randomWord(m)); else f.writeln(randomWord(n + m)); } } --- And then for the actual measuring: $> dmd -O -version={range,regex} -of=search-{range,regex} search.d $> ldc -O -d-version={range,regex} -of=search-{range,regex}-ldc search.d $> time ./search-{range,regex}{,-ldc} foobaz --- import std; void main(string[] args) { enforce(args.length == 2, "Usage: search 'needle'"); version (regex) auto rx = regex(args[1], "i"); else version (range) auto needle = args[1].asLowerCase.text; else static assert(0, "use -version={regex,range}"); ulong matches; foreach (line; File("words.txt").byLine) { version (regex) if (line.matchFirst(rx)) matches++; version (range) if (line.asLowerCase.canFind(needle)) matches++; } writeln(matches); } --- |