Thread overview
Going on std.regex & std.uni bug-fixing hunt
Sep 05, 2017
Dmitry Olshansky
Sep 06, 2017
Jon Degenhardt
Sep 06, 2017
Dmitry Olshansky
Sep 10, 2017
Chad Joan
Sep 10, 2017
Dmitry Olshansky
Sep 10, 2017
Chad Joan
Sep 10, 2017
Dmitry Olshansky
September 05, 2017
It's been tough time on D front for me, going down from about ~1 week of activity during July to ~2-3 days in August.

One thing I realised is that doing a new GC is going to be a long battle.
Before I'm too deep down this rabbit hole I decided to first address the long-standing backlog of issues of std.regex and std.uni.

My burndown list for std.regex:

https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216638&product=D&query_format=advanced&resolution=---&short_desc=regex&short_desc_type=allwordssubstr

Should get to about 5-6 enhacements.

Same for std.uni:

https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216641&query_format=advanced&resolution=---&short_desc=std.uni&short_desc_type=allwordssubstr

This should be actually 0.


Anything not showing up on these lists will not get fixed anytime soon.

The starting point for std.regex:
https://github.com/dlang/phobos/pull/5722

September 06, 2017
On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky wrote:
> Before I'm too deep down this rabbit hole I decided to first address the long-standing backlog of issues of std.regex and std.uni.

Not intending to add to your workload, but I added an enhancement request for something I hit ran into recently: Having a version of wcwidth/wcswidth functions available as part of unicode support. These functions calculate the width of a character in fixed-width fonts, they are useful in command-line apps. (CJK characters typically have a width of two.) Character width is part of the unicode standard. Issue: https://issues.dlang.org/show_bug.cgi?id=17810

--Jon


September 06, 2017
On Wednesday, 6 September 2017 at 05:23:51 UTC, Jon Degenhardt wrote:
> On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky wrote:
>> Before I'm too deep down this rabbit hole I decided to first address the long-standing backlog of issues of std.regex and std.uni.
>
> Not intending to add to your workload, but I added an enhancement request for something I hit ran into recently: Having a version of wcwidth/wcswidth functions available as part of unicode support. These functions calculate the width of a character in fixed-width fonts, they are useful in command-line apps.

Yeah, there is quite a few missing bits from std.uni. I didn't expect this one though.

Hopefully we can add them on pay as go principle, so if you don't use it you don't pay for it.

>
> --Jon


September 10, 2017
On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky wrote:
> It's been tough time on D front for me, going down from about ~1 week of activity during July to ~2-3 days in August.
>
> One thing I realised is that doing a new GC is going to be a long battle.
> Before I'm too deep down this rabbit hole I decided to first address the long-standing backlog of issues of std.regex and std.uni.
>
> My burndown list for std.regex:
>
> https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216638&product=D&query_format=advanced&resolution=---&short_desc=regex&short_desc_type=allwordssubstr
> ...

I was working on std.regex a bit myself, so I created this bug report to capture some of the findings/progress:
https://issues.dlang.org/show_bug.cgi?id=17820

It seems like something you might be interested in, or might even have a small chance of fixing in the course of other things.

...

There are other regex improvements I might be interested in, but I'm not sure I have time to make bug reports for them right now.  I might be convinced to fast track them if someone wants to make legitimate effort towards fixing them, otherwise I'll eventually get around to writing the reports and/or making PRs someday.

Examples:

-- Calls to malloc in the CTFE path cause some regexes to fail at compile time.  I suspect this happens due to the Captures (n > smallString) condition when the number of possible captures is greater than 3, but I haven't tested it (time consuming...).

-- I remember being unable to iterate over named captures.  But I'm not confident that I'm remembering this correctly, and I'm not sure if it's still true.

-- The Captures struct does not specify what value is returned for submatches that were in the branch of an alternation that wasn't taken or in a repetition that matched 0 or more than 1 times.

-- The Captures struct does not seem to have a way to access all of the strings matched by a submatch in repetition context, not to mention nested repetition contexts.


I'm not sure how much those mentions help without proper bug reports, but at least I got it off my chest (for the time being) without having to spend my whole weekend writing bug reports ;)

...

Dmitry, I appreciate your working towards making the regex module easier to work on.  Thanks.

...

I'm curious what you're thinking about when you mention something ambitious like writing a new GC :)
(like this https://imgur.com/cWa4evD)

I can't help but fantasize about cheap ways to get GC allocations to parallelize well and not end up writing an entire generational collector!  But I doubt I'll ever have the opportunity to work on such things.  I hope your GC attempt works out!

September 10, 2017
On Sunday, 10 September 2017 at 00:16:10 UTC, Chad Joan wrote:
> On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky wrote:
>> My burndown list for std.regex:
>>
>> https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216638&product=D&query_format=advanced&resolution=---&short_desc=regex&short_desc_type=allwordssubstr
>> ...
>
> I was working on std.regex a bit myself, so I created this bug report to capture some of the findings/progress:
> https://issues.dlang.org/show_bug.cgi?id=17820
>
> It seems like something you might be interested in, or might even have a small chance of fixing in the course of other things.

Yeah, well known problem. Solution is to keep a bit of memory cached eg  in TLS variable.

>
> ...
>
> There are other regex improvements I might be interested in, but I'm not sure I have time to make bug reports for them right now.  I might be convinced to fast track them if someone wants to make legitimate effort towards fixing them, otherwise I'll eventually get around to writing the reports and/or making PRs someday.
>
> Examples:
>
> -- Calls to malloc in the CTFE path cause some regexes to fail at compile time.  I suspect this happens due to the Captures (n
> > smallString) condition when the number of possible captures
> is greater than 3, but I haven't tested it (time consuming...).
>

Sholudn't be a problem, but please report an example.

> -- I remember being unable to iterate over named captures.  But I'm not confident that I'm remembering this correctly, and I'm not sure if it's still true.
>

Would be nice and simple enhancement.

> -- The Captures struct does not specify what value is returned for submatches that were in the branch of an alternation that wasn't taken or in a repetition that matched 0 or more than 1 times.

As every engine out there the value is "", empty string.

>
> -- The Captures struct does not seem to have a way to access all of the strings matched by a submatch in repetition context, not to mention nested repetition contexts.
>

Just like any other regex library.

>
> I'm not sure how much those mentions help without proper bug reports, but at least I got it off my chest (for the time being) without having to spend my whole weekend writing bug reports ;)
>

Well they are warmly welcome shouldypu get to it.

> ...
>
> Dmitry, I appreciate your working towards making the regex module easier to work on.  Thanks.
>
> ...
>
> I'm curious what you're thinking about when you mention something ambitious like writing a new GC :)
> (like this https://imgur.com/cWa4evD)
>
> I can't help but fantasize about cheap ways to get GC allocations to parallelize well and not end up writing an entire generational collector!

ThreadCache can go a long way to help that.

> But I doubt I'll ever have the opportunity to work on such things.  I hope your GC attempt works out!

Me too. It's won't be trivial effort though.


September 10, 2017
On Sunday, 10 September 2017 at 11:47:02 UTC, Dmitry Olshansky wrote:
> On Sunday, 10 September 2017 at 00:16:10 UTC, Chad Joan wrote:
>> I was working on std.regex a bit myself, so I created this bug report to capture some of the findings/progress:
>> https://issues.dlang.org/show_bug.cgi?id=17820
>>
>> It seems like something you might be interested in, or might even have a small chance of fixing in the course of other things.
>
> Yeah, well known problem. Solution is to keep a bit of memory cached eg  in TLS variable.
>

Indeed.

Is there another issue I can mark it as a duplicate of?

>>
>> [...]
>> -- The Captures struct does not specify what value is returned for submatches that were in the branch of an alternation that wasn't taken or in a repetition that matched 0 or more than 1 times.
>
> As every engine out there the value is "", empty string.
>

I usually don't refer to other libraries while using a library.  If an API doesn't define something, then it is, by definition, undefined behavior, and thus quite undesirable to rely upon.

This one seems pretty easy to fix though.  I will probably make a documentation PR at some point.

>>
>> -- The Captures struct does not seem to have a way to access all of the strings matched by a submatch in repetition context, not to mention nested repetition contexts.
>>
>
> Just like any other regex library.
>

Counterexample: https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.group.captures(v=vs.110).aspx#code-snippet-3

I actually have a strong interest in this.  And not because I need to write regular expressions that extract lists of patterns all the time (well, it might've happened).  More importantly: this would make it easier to integrate Phobos' regex engine into a parser generator framework.  Current plans involve regular expression + parsing expression grammars.  I'm pretty sure it is possible to mechanically convert a subset of PEGs into Regexes and gain some useful optimizations, but this requires granular control over regular expression captures to be able to extract the text matched by the original PEG symbols.

>>
>> I'm not sure how much those mentions help without proper bug reports, but at least I got it off my chest (for the time being) without having to spend my whole weekend writing bug reports ;)
>>
>
> Well they are warmly welcome shouldypu get to it.
>

Thanks!

>> ...
>>
>> Dmitry, I appreciate your working towards making the regex module easier to work on.  Thanks.
>>
>> ...
>>
>> I'm curious what you're thinking about when you mention something ambitious like writing a new GC :)
>> (like this https://imgur.com/cWa4evD)
>>
>> I can't help but fantasize about cheap ways to get GC allocations to parallelize well and not end up writing an entire generational collector!
>
> ThreadCache can go a long way to help that.
>

Google didn't help me with this one.  Any chance I could get a link?

>> But I doubt I'll ever have the opportunity to work on such things.  I hope your GC attempt works out!
>
> Me too. It's won't be trivial effort though.

Good luck!
September 10, 2017
On Sunday, 10 September 2017 at 18:54:21 UTC, Chad Joan wrote:
> On Sunday, 10 September 2017 at 11:47:02 UTC, Dmitry Olshansky wrote:
>>
>> Yeah, well known problem. Solution is to keep a bit of memory cached eg  in TLS variable.
>>
>
> Indeed.
>
> Is there another issue I can mark it as a duplicate of?

No it's just somthing that showed up in a number of benchmarks people posted, never got around to it. So just file it so I'll have an issue to tick once I fix this.

>
>>>
>>> [...]
>>> -- The Captures struct does not specify what value is returned for submatches that were in the branch of an alternation that wasn't taken or in a repetition that matched 0 or more than 1 times.
>>
>> As every engine out there the value is "", empty string.
>>
>
> I usually don't refer to other libraries while using a library.
>  If an API doesn't define something, then it is, by definition, undefined behavior, and thus quite undesirable to rely upon.

In many way we just copy ECMAScript regex semantics, with some extensions from Python and Perl.

>
> This one seems pretty easy to fix though.  I will probably make a documentation PR at some point.
>

Please do.
>>>
>>> -- The Captures struct does not seem to have a way to access all of the strings matched by a submatch in repetition context, not to mention nested repetition contexts.
>>>
>>
>> Just like any other regex library.
>>
>
> Counterexample: https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.group.captures(v=vs.110).aspx#code-snippet-3
>

Horrible, horrible idea. Performance-wise that is.

> I actually have a strong interest in this.  And not because I need to write regular expressions that extract lists of patterns all the time (well, it might've happened).  More importantly: this would make it easier to integrate Phobos' regex engine into a parser generator framework.

No-no-no. Please don't.

> Current plans involve regular expression + parsing expression grammars.  I'm pretty sure it is possible to mechanically convert a subset of PEGs into Regexes and gain some useful optimizations, but this requires granular control over regular expression captures to be able to extract the text matched by the original PEG symbols.

This is heavily misguided, but a common idea. PEGs are actually way simpler then regex. PEGs * and + have nothing to do with regex * and + qualifiers.

In PEG [ab]*b will never match because [ab]* eats any sequence of a-s and b-s and never backtracks.

In regex [ab]* will match b, ab, bb, .... because regex 'backtracks'.

Thus one can easily see that PEGs constructs are trivial to implement, and provide quite a number of optimizations as well.

I'd argue that PEG can be made to run faster in 'parse' scenario whereas nothing beats _simple_ regex in 'search' scenario.

>> ThreadCache can go a long way to help that.
>>
>
> Google didn't help me with this one.  Any chance I could get a link?
The manpage for jemalloc mentions it explicitly search for tcache. Also look through jemalloc and related papers, I think they all call it thread cache.

 https://linux.die.net/man/3/jemalloc

In essene you keep a small cache of free memory in TLS to serve next allocations faster. Some suggest a per-cpu cache, which is bit trickier but also interestingly avoid contention.

>>> things.  I hope your GC attempt works out!
>>
>> Me too. It's won't be trivial effort though.
>
> Good luck!

Thanks!