Index » Learn » regex issue (page 2)

March 19, 2012
On Monday, 19 March 2012 at 13:27:03 UTC, Jay Norwood wrote:
> ok, global.  So the document implies that I should be able to get a single match object with a count of the submatches.  So I think maybe I've jumped to the wrong conclusion about how to use it, thinking I could just use "\n" and "g" flag got get all the matches for the range of "\n".  So it looks like instead that the term "submatches" needs more explanation.  What exactly constitutes a submatch?  I infered it just meant any single match among many.
>
>   //create static regex at compile-time, contains fast native code
>   enum ctr = ctRegex!(`^.*/([^/]+)/?$`);
>
>   //works just like normal regex:
>   auto m2 = match("foo/bar", ctr);   //first match found here if any
>   assert(m2);   // be sure to check if there is a match, before examining contents!
>   assert(m2.captures[1] == "bar");//captures is a range of submatches, 0 - full match
>
>
> btw, I couldn't get this \p option to work for the uni properties.  Can you provide some example of that which works?
>
> \p{PropertyName}  Matches character that belongs to unicode PropertyName set. Single letter abreviations could be used without surrounding {,}.


so, to answer my own question,  it appears that the (regex) is the portion that is considered a submatch that gets counted.

so counting lines would be something that has a (\n) in it, although I'll have to figure out what that will be exactly.


(regex)  Matches subexpression regex, saving matched portion of text for later retrival.




March 19, 2012
On 19.03.2012 17:27, Jay Norwood wrote:
> On Monday, 19 March 2012 at 08:05:18 UTC, Dmitry Olshansky wrote:
>> Like I told in main D group it's wrong - regex doesn't only count
>> matches. It finds slices that do match.
>> Thus to make it more efficient, it returns lazy range that does
>> searches on request. "g" - means global :)
>> Then code like this is cool and fast:
>> foreach(m; match(input, ctr))
>> {
>> if(m.hit == "magic we are looking for")
>> break; // <<< ---- no greedy find it all syndrome
>> }
>>
>
> ok, global. So the document implies that I should be able to get a
> single match object with a count of the submatches. So I think maybe
> I've jumped to the wrong conclusion about how to use it, thinking I
> could just use "\n" and "g" flag got get all the matches for the range
> of "\n". So it looks like instead that the term "submatches" needs more
> explanation. What exactly constitutes a submatch? I infered it just
> meant any single match among many.

Maybe a replacement of submatch ---> capture helps. But I thought it was easy to get that any subexpression in regex e.g. "(\w+)" is captured into submatch. Are you aware sub-expressions in regex are also extracted from the text?

>
> //create static regex at compile-time, contains fast native code
> enum ctr = ctRegex!(`^.*/([^/]+)/?$`);
>
> //works just like normal regex:
> auto m2 = match("foo/bar", ctr); //first match found here if any
> assert(m2); // be sure to check if there is a match, before examining
> contents!
> assert(m2.captures[1] == "bar");//captures is a range of submatches, 0 -
> full match

BTW, In the above example what captures are should be clearly visible.

>
>
> btw, I couldn't get this \p option to work for the uni properties. Can
> you provide some example of that which works?
>
> \p{PropertyName} Matches character that belongs to unicode PropertyName
> set. Single letter abreviations could be used without surrounding {,}.
>

Ouch, I see that docs are no good :)
But well, they are reference-like anyway, you might want to take a look for more healthy and lengthy overview:
http://blackwhale.github.com/regular-expression.html


-- 
Dmitry Olshansky
March 19, 2012
On 19.03.2012 17:39, Jay Norwood wrote:
> On Monday, 19 March 2012 at 13:27:03 UTC, Jay Norwood wrote:
>> ok, global. So the document implies that I should be able to get a
>> single match object with a count of the submatches. So I think maybe
>> I've jumped to the wrong conclusion about how to use it, thinking I
>> could just use "\n" and "g" flag got get all the matches for the range
>> of "\n". So it looks like instead that the term "submatches" needs
>> more explanation. What exactly constitutes a submatch? I infered it
>> just meant any single match among many.
>>
>> //create static regex at compile-time, contains fast native code
>> enum ctr = ctRegex!(`^.*/([^/]+)/?$`);
>>
>> //works just like normal regex:
>> auto m2 = match("foo/bar", ctr); //first match found here if any
>> assert(m2); // be sure to check if there is a match, before examining
>> contents!
>> assert(m2.captures[1] == "bar");//captures is a range of submatches, 0
>> - full match
>>
>>
>> btw, I couldn't get this \p option to work for the uni properties. Can
>> you provide some example of that which works?
>>
>> \p{PropertyName} Matches character that belongs to unicode
>> PropertyName set. Single letter abreviations could be used without
>> surrounding {,}.
>
>
> so, to answer my own question, it appears that the (regex) is the
> portion that is considered a submatch that gets counted.
>
> so counting lines would be something that has a (\n) in it, although
> I'll have to figure out what that will be exactly.

That's right, however counting is completely separate from regex, you'd want to use std.algorithm count:
count(match(....,"\n"));

or more unicode-friendly:
count(match(...., regex("$","m")); //note the multi-line flag

Also observe that there is simply no way to get more then constant number of submatches.

>
> (regex) Matches subexpression regex, saving matched portion of text for
> later retrival.
>

An example of unicode properties:
\p{WhiteSpace} matches any unicode whitespace char


-- 
Dmitry Olshansky
March 19, 2012
On Monday, 19 March 2012 at 13:55:39 UTC, Dmitry Olshansky wrote:
> That's right, however counting is completely separate from regex, you'd want to use std.algorithm count:
> count(match(....,"\n"));
>
> or more unicode-friendly:
> count(match(...., regex("$","m")); //note the multi-line flag

This only sets l_cnt to 1

void wcp_cnt_match1 (string fn)
{
	string input = cast(string)std.file.read(fn);
	enum ctr =  ctRegex!("$","m");
	ulong l_cnt = std.algorithm.count(match(input,ctr));
}

This works ok, but though concise it is not very fast

void wcp (string fn)
{
	string input = cast(string)std.file.read(fn);
     ulong l_cnt = std.algorithm.count(input,"\n");
}


>
> Also observe that there is simply no way to get more then constant number of submatches.
>
>>
>> (regex) Matches subexpression regex, saving matched portion of text for
>> later retrival.
>>
>
> An example of unicode properties:
> \p{WhiteSpace} matches any unicode whitespace char

This fails to build, so I'd guess is missing \p

void wcp (string fn)
{
	enum ctr =  ctRegex!("\p{WhiteSpace}","m");
}

------ Build started: Project: a7, Configuration: Release Win32
------
Building Release\a7.exe...
a7.d(210): undefined escape sequence \p

Building Release\a7.exe failed!
Details saved as "file://G:\d\a7\a7\Release\a7.buildlog.html"
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped
==========


March 20, 2012
On Monday, 19 March 2012 at 19:24:30 UTC, Jay Norwood wrote:
> This fails to build, so I'd guess is missing \p
>
> void wcp (string fn)
> {
> 	enum ctr =  ctRegex!("\p{WhiteSpace}","m");
> }
>
> ------ Build started: Project: a7, Configuration: Release Win32
> ------
> Building Release\a7.exe...
> a7.d(210): undefined escape sequence \p
>
> Building Release\a7.exe failed!
> Details saved as "file://G:\d\a7\a7\Release\a7.buildlog.html"
> ========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped
> ==========

So I tried something a little different, and this apparently gets further along to another error message.  But it looks like at this point it decides that the unicode properties are not available at compile time...


void wcp_bug_no_p(string fn)
{
	enum ctr =  ctRegex!(r"\p{WhiteSpace}","m");
}


------ Build started: Project: a7, Configuration: Debug Win32 ------
Building Debug\a7.exe...
G:\d\dmd2\windows\bin\..\..\src\phobos\std\regex.d(786): Error: static variable unicodeProperties cannot be read at compile time
G:\d\dmd2\windows\bin\..\..\src\phobos\std\regex.d(786):        called from here: assumeSorted(unicodeProperties)
G:\d\dmd2\windows\bin\..\..\src\phobos\std\regex.d(1937):        called from here: getUnicodeSet(result[0u..k],negated,cast(bool)(this.re_flags & cast(RegexOption)2u))



March 20, 2012
On 19.03.2012 23:24, Jay Norwood wrote:
> On Monday, 19 March 2012 at 13:55:39 UTC, Dmitry Olshansky wrote:
>> That's right, however counting is completely separate from regex,
>> you'd want to use std.algorithm count:
>> count(match(....,"\n"));
>>
>> or more unicode-friendly:
>> count(match(...., regex("$","m")); //note the multi-line flag
>
Ehm, forgot "g" flag myself, so it would be

count(match(...., regex("$","gm"));

and

count(match(...., regex("\n","g"));

Note that if your task is to split buffer by exactly '\n' byte then loop with memchr is about as fast as it gets, no amount of magic compiler optimizations would make other generic ways better (even theoretically). What they *could* do is bring the difference lower.

> This only sets l_cnt to 1
>
> void wcp_cnt_match1 (string fn)
> {
> string input = cast(string)std.file.read(fn);
> enum ctr = ctRegex!("$","m");
> ulong l_cnt = std.algorithm.count(match(input,ctr));
> }
>
> This works ok, but though concise it is not very fast
>
> void wcp (string fn)
> {
> string input = cast(string)std.file.read(fn);
> ulong l_cnt = std.algorithm.count(input,"\n");
> }
>
>

BTW I suggest to separate I/O from actual work or better yet, time both separately via std.datetime.StopWatch.

> This fails to build, so I'd guess is missing \p
>
> void wcp (string fn)
> {
> enum ctr = ctRegex!("\p{WhiteSpace}","m");
> }
>
> ------ Build started: Project: a7, Configuration: Release Win32
> ------
> Building Release\a7.exe...
> a7.d(210): undefined escape sequence \p
>

Not a bug, a compiler escape sequence.
How do you think \n works in your non-regex examples ? ;)


-- 
Dmitry Olshansky
March 20, 2012
On Tuesday, 20 March 2012 at 10:28:11 UTC, Dmitry Olshansky wrote:
> Note that if your task is to split buffer by exactly '\n' byte then loop with memchr is about as fast as it gets, no amount of magic compiler optimizations would make other generic ways better (even theoretically). What they *could* do is bring the difference lower.
>

ok, I'll use memchr.

 >> This works ok, but though concise it is not very fast
>>
>> void wcp (string fn)
>> {
>> string input = cast(string)std.file.read(fn);
>> ulong l_cnt = std.algorithm.count(input,"\n");
>> }
>>
>>
>
> BTW I suggest to separate I/O from actual work or better yet, time both separately via std.datetime.StopWatch.

I'm timing with the stopwatch.  I have separate functions where I've measured empty func, just the file reads with empty loop, so I can see the deltas.  All these are being executed inside a parallel foreach loop ... so 7 threads reading different files, and since that is the end target, the overall measurement in the context is more meaningful to me.  The file io is on the order of 25ms for chunk reads or 30ms for full file reads in these results, as it is all reads of about 20MB for the full test from a 510 series ssd drive with sata3.  The reads are being done in parallel by the threads in the threadpool.  Each file is 2MB.   So any total times you see in my comments are for 10 tasks being executed in a parallel foreach loop, with the file read portion previously timed at around 30ms.
>
>> This fails to build, so I'd guess is missing \p
>>
>> void wcp (string fn)
>> {
>> enum ctr = ctRegex!("\p{WhiteSpace}","m");
>> }
>>
>> ------ Build started: Project: a7, Configuration: Release Win32
>> ------
>> Building Release\a7.exe...
>> a7.d(210): undefined escape sequence \p
>>
>
> Not a bug, a compiler escape sequence.
> How do you think \n works in your non-regex examples ? ;)

yes, thanks.  I read your other link and that was helpful.   I think I presumed that the escape handling was something belonging to stdio, while regex would have its own valid escapes that would include \p.  But I see now that the string literals have their own set of escapes.

March 20, 2012
On 21 March 2012 04:26, Jay Norwood <jayn@prismnet.com> wrote:
> yes, thanks.  I read your other link and that was helpful.   I think I presumed that the escape handling was something belonging to stdio, while regex would have its own valid escapes that would include \p.  But I see now that the string literals have their own set of escapes.

Can you imagine the madness if escapes were specific to stdio, or some other library! "Ok, and I'll just send this newline over the network... Dammit, std.network doesn't escape \n". Also means that you have perfect consistency between usages of strings, no strange other usages of the same escape sequence...

--
James Miller
Next ›   Last »
1 2
Top | Discussion index | About this forum | D home