Thread overview
[Issue 10772] std.regex.splitter generates spurious empty elements with empty delimiter
Dec 18, 2013
Dmitry Olshansky
Dec 19, 2013
Dmitry Olshansky
Dec 19, 2013
Dmitry Olshansky
Dec 19, 2013
Dmitry Olshansky
December 18, 2013
https://d.puremagic.com/issues/show_bug.cgi?id=10772


Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh@gmail.com


--- Comment #1 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2013-12-18 13:10:33 PST ---
(In reply to comment #0)
> CODE:
> --------
> void main() {
>     import std.string, std.stdio, std.regex;
>     string s = "test";
>     writeln(std.regex.splitter(s.toUpper, regex("")));
> }
> --------
> 
> Output:
> --------
> ["", "T", "E", "S", "T", ""]
> --------
> 
> The first and last empty elements should not be included in the result. Cf.
> Perl's split(//, "test").

The matter is more or less trivial, the only problem is what you actually want me to do?

No matter how I read this passage:
http://perldoc.perl.org/functions/split.html
I can only gather that 0-width match at front of input is never produced
(okay..). What the heck must be happening with the one at the end isn't clear
to me at all.

Given that the following test produces ["T", "E", "S", "T"] we may just ignore 0-width match at both ends to be in line. The behaviour needs to be documented though. Anyway - seems good?

void main() {
    import std.string, std.algorithm, std.stdio;//, std.regex;
    string s = "test";
    writeln(splitter(s.toUpper, ""));
}

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 18, 2013
https://d.puremagic.com/issues/show_bug.cgi?id=10772



--- Comment #2 from hsteoh@quickfur.ath.cx 2013-12-18 14:28:29 PST ---
$ perl -e'print join(":", split(//, "test")), "\n";'
t:e:s:t
$


So yes, I expect std.regex.splitter to return ["t", "e", "s", "t"] when the delimiter is empty.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 19, 2013
https://d.puremagic.com/issues/show_bug.cgi?id=10772



--- Comment #3 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2013-12-19 08:51:38 PST ---
(In reply to comment #2)
> $ perl -e'print join(":", split(//, "test")), "\n";'
> t:e:s:t
> $
> 
> 
> So yes, I expect std.regex.splitter to return ["t", "e", "s", "t"] when the delimiter is empty.

But then there is this:
perl -e'print join(":", split(/, /, ", test, ")), "\n";'
:test

That makes no sense to me whatsoever.

We have tests already (always had, even before 2.056) that state the opposite.
In fact they make sure that both zero-width pieces are found (at start and at
end).

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 19, 2013
https://d.puremagic.com/issues/show_bug.cgi?id=10772


Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |pull


--- Comment #4 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2013-12-19 10:11:20 PST ---
https://github.com/D-Programming-Language/phobos/pull/1790

My compromise is to follow std.algorithm splitter and special case 0-width matches as if 0-width needles.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 19, 2013
https://d.puremagic.com/issues/show_bug.cgi?id=10772


monarchdodra@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |monarchdodra@gmail.com


--- Comment #5 from monarchdodra@gmail.com 2013-12-19 10:18:12 PST ---
Arguably, 0-width splitting makes no sense: If it were to rigorously follow the
rules, then you'd simply end up with an infinite amount of leading tokens.
["", "T", "E", "S", "T", ""]
Makes no sense to me. Why is there an empty leading/trailing token, but none
between each letter?

This means that in regards to 0-length splitting, it should either be an *error*, or have a *special behavior*

std.algorithm.split simply special cases to do what seems most useful (what is documented by pearl, AFAIK). I think having regex do the same is most sensible.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 19, 2013
https://d.puremagic.com/issues/show_bug.cgi?id=10772



--- Comment #6 from monarchdodra@gmail.com 2013-12-19 10:22:46 PST ---
(In reply to comment #5)
> Arguably, 0-width splitting makes no sense: If it were to rigorously follow the
> rules, then you'd simply end up with an infinite amount of leading tokens.
> ["", "T", "E", "S", "T", ""]
> Makes no sense to me. Why is there an empty leading/trailing token, but none
> between each letter?
> 
> This means that in regards to 0-length splitting, it should either be an *error*, or have a *special behavior*
> 
> std.algorithm.split simply special cases to do what seems most useful (what is documented by pearl, AFAIK). I think having regex do the same is most sensible.

Hum... Actually, now that I think about it, algorithm splitter works on constant length separator. I'm not sure how you'd handle a separator that may and or may not be empty...

See the examples in your pull.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 19, 2013
https://d.puremagic.com/issues/show_bug.cgi?id=10772



--- Comment #7 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2013-12-19 10:48:49 PST ---
(In reply to comment #5)
> Arguably, 0-width splitting makes no sense: If it were to rigorously follow the
> rules, then you'd simply end up with an infinite amount of leading tokens.
> ["", "T", "E", "S", "T", ""]
> Makes no sense to me. Why is there an empty leading/trailing token, but none
> between each letter?
> 
> This means that in regards to 0-length splitting, it should either be an *error*, or have a *special behavior*

There is already special case that 0-width regular expression match advances input by one codepoint. It's exactly this special case that produces the string above, nothing to worry about.

> 
> std.algorithm.split simply special cases to do what seems most useful (what is documented by pearl, AFAIK). I think having regex do the same is most sensible.

Doing special-special case is kind of bad. The more I look at this the more it's clear to me that we either have to decipher Perl's behaviour or give up.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------