Thread overview
std.regex is fat
Oct 12, 2018
Chris Katko
Oct 12, 2018
Alex
Oct 14, 2018
Chris Katko
Oct 14, 2018
Chris Katko
Oct 14, 2018
Adam D. Ruppe
Oct 14, 2018
Chris Katko
Oct 14, 2018
Adam D. Ruppe
October 12, 2018
Like, insanely fat.

All I wanted was a simple regex. The second include a regex function, my program would no longer compile "out of memory for fork".

/usr/bin/time -v reports it went from 150MB of RAM for D, DAllegro, and Allegro5.

To over 650MB of RAM, and from 1.5 seconds to >5.5 seconds to compile. Now I have to close all my Chrome tabs just to compile.

Just for one line of regex. And I get it, it's the overhead of the library import, not the single line. But good gosh, more than 3X the RAM of the entire project for a single library import?

Something doesn't add up!


October 12, 2018
On Friday, 12 October 2018 at 13:25:33 UTC, Chris Katko wrote:
> Like, insanely fat.
>
> All I wanted was a simple regex. The second include a regex function, my program would no longer compile "out of memory for fork".
>
> /usr/bin/time -v reports it went from 150MB of RAM for D, DAllegro, and Allegro5.
>
> To over 650MB of RAM, and from 1.5 seconds to >5.5 seconds to compile. Now I have to close all my Chrome tabs just to compile.
>
> Just for one line of regex. And I get it, it's the overhead of the library import, not the single line. But good gosh, more than 3X the RAM of the entire project for a single library import?
>
> Something doesn't add up!

Hm... maybe, you run into this:
https://forum.dlang.org/post/mailman.3091.1517866806.9493.digitalmars-d@puremagic.com

October 14, 2018
On Friday, 12 October 2018 at 13:42:34 UTC, Alex wrote:
> On Friday, 12 October 2018 at 13:25:33 UTC, Chris Katko wrote:
>> Like, insanely fat.
>>
>> All I wanted was a simple regex. The second include a regex function, my program would no longer compile "out of memory for fork".
>>
>> /usr/bin/time -v reports it went from 150MB of RAM for D, DAllegro, and Allegro5.
>>
>> To over 650MB of RAM, and from 1.5 seconds to >5.5 seconds to compile. Now I have to close all my Chrome tabs just to compile.
>>
>> Just for one line of regex. And I get it, it's the overhead of the library import, not the single line. But good gosh, more than 3X the RAM of the entire project for a single library import?
>>
>> Something doesn't add up!
>
> Hm... maybe, you run into this:
> https://forum.dlang.org/post/mailman.3091.1517866806.9493.digitalmars-d@puremagic.com

So wait, if their solution was to simply REMOVE std.regex from isEmail. That doesn't solve the regex problem at all. And from what I read in that thread, this penalty is paid per template INSTANTIATION which could explode.

 1 - Does anyone know WHY it's so incredibly fat?

 2 - If this isn't going to be fixed anytime soon, shouldn't there be a DISCLAIMER on the documentation? (+potential workarounds like keeping regex queries in their own file.)

I mean, this kind of thing shouldn't require looking through forums. It's a clear bug, and if it's a WONTFIX (even temporarily), it should be documented clearly as such.

If I'm running into this issue, how many other people already did, and possibly even gave up on using D?


October 14, 2018
On Sunday, 14 October 2018 at 02:44:55 UTC, Chris Katko wrote:
> On Friday, 12 October 2018 at 13:42:34 UTC, Alex wrote:
>> [...]
>
> So wait, if their solution was to simply REMOVE std.regex from isEmail. That doesn't solve the regex problem at all. And from what I read in that thread, this penalty is paid per template INSTANTIATION which could explode.
>
> [...]

For comparison, I just tested and grep uses about 4 MB of RAM to run.

So it's not the regex. It's the dmd / templates / CTFE, right?
October 14, 2018
On Sunday, 14 October 2018 at 03:07:59 UTC, Chris Katko wrote:
> For comparison, I just tested and grep uses about 4 MB of RAM to run.

Running and compiling are two entirely different things. Running the D regex code should be comparable, but compiling it is slow, in great part because of internal templates...

There was an effort to speed up the template code, but it is still not complete.
October 14, 2018
On Sunday, 14 October 2018 at 02:44:55 UTC, Chris Katko wrote:
> So wait, if their solution was to simply REMOVE std.regex from isEmail.

That was ctRegex, which is different than regex.

> That doesn't solve the regex problem at all. And from what I read in that thread, this penalty is paid per template INSTANTIATION which could explode.

Template instantiation, which is a big issue for ctRegex, but not for regular regex.
October 14, 2018
On Sunday, 14 October 2018 at 03:26:33 UTC, Adam D. Ruppe wrote:
> On Sunday, 14 October 2018 at 03:07:59 UTC, Chris Katko wrote:
>> For comparison, I just tested and grep uses about 4 MB of RAM to run.
>
> Running and compiling are two entirely different things. Running the D regex code should be comparable, but compiling it is slow, in great part because of internal templates...
>
> There was an effort to speed up the template code, but it is still not complete.

I know that. I figured people would miss my point on it though so I should have clarified. That's why I said it's likely the templates/DMD that's exploding--not the actual regex action.

From a simple program, it takes ~100-150MB of RAM to compile. Adding a single regex (not compiled regex) balloons to 550MB at 5 seconds of compile time.

-----------

Anyhow, I wrote my own simple "dgrep" and compared the results with grep, it's very competitive: (NOT to be confused with the above RAM stats for COMPILING)


Command being timed: "sh -c cat dgrep.d | ./dgrep 'write' "
	User time (seconds): 0.00
	System time (seconds): 0.00
	Percent of CPU this job got: 0%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3192
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 301
	Voluntary context switches: 5
	Involuntary context switches: 124
	Swaps: 0
	File system inputs: 8
	File system outputs: 8
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
 	Command being timed: "sh -c cat dgrep.d | grep 'write'"
	User time (seconds): 0.00
	System time (seconds): 0.00
	Percent of CPU this job got: 0%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2224
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 2
	Minor (reclaiming a frame) page faults: 282
	Voluntary context switches: 10
	Involuntary context switches: 0
	Swaps: 0
	File system inputs: 760
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

So I have to say I'm impressed with the actual performance of the regular expressions engine--especially considering "grep" is, IIRC, considered a fine-tuned beast.