View mode: basic / threaded / horizontal-split · Log in · Help
August 02, 2012
Re: Let's stop parser Hell
On Thursday, August 02, 2012 07:06:25 Christophe Travert wrote:
> "Jonathan M Davis" , dans le message (digitalmars.D:173942), a écrit :
> > It may very well be a good idea to templatize Token on range type. It
> > would be nice not to have to templatize it, but that may be the best
> > route to go. The main question is whether str is _always_ a slice (or the
> > result of takeExactly) of the orignal range. I _think_ that it is, but
> > I'd have to make sure of that. If it's not and can't be for whatever
> > reason, then that poses a problem.
> 
> It can't if it is a simple input range! Like a file read with most
> 'lazy' methods. Then you need either to transform the input range in a
> forward range using a range adapter that performs buffering, or perform
> your own buffering internally. You also have to decide how long the
> token will be valid (I believe if you want lexing to be blazing fast,
> you don't want to allocate for each token).

My lexer specifically requires a forward range. The more that I deal with input 
ranges, the more that I'm convinced that they're nearly useless. If you need 
even _one_ character of lookahead, then an input range doesn't fly at all, and 
considered the performance gains in using slicing (or takeExactly), I just 
don't think that it makes sense to operate on an input range. Besides, if 
anyone wants full performance, they'll probably need to use one of the built-
in string types. Any range of dchar which needs to decode on the call to front 
or popFront will take a serious performance hit. It'll work, but it's not 
really advisable if you need performance.

> Also, you said in this thread that you only need to consider ansy
> characters in the lexer because non-ansy characters are only used in
> non-keyword identifier. That is not entirely true: EndOfLine defines 2
> non-ansy characters, namely LINE SEPARATOR and PARAGRAPH SEPARATOR.
>   http://dlang.org/lex.html#EndOfLine
>   Maybe they should be dropped, since other non-ansy whitespace are not
> supported. You may want the line count to be consistent with other
> programs. I don't know what text-processing programs usualy considers an
> end of line.

I'm well aware of all fo that. Almost all of the lexer can operate entirely on 
ASCII, with a few exceptions, and even in some of those cases, decoding isn't 
required (e.g. lineSep and paraSep can be dealt with as code units rather than 
having to decode to compare agains them). The lexer that I'm writing will 
follow the spec. And any lexer that wants to get into Phobos will need to do 
the same. So, stuff like lineSep and the end of file characters that the spec 
has will be supported.

- Jonathan M Davis
August 06, 2012
Re: Let's stop parser Hell
On 07/31/2012 03:27 PM, Dmitry Olshansky wrote:
> On 31-Jul-12 20:10, Kai Meyer wrote:
>> On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:
>>> There are more and more projects requiring parsing D code (IDE plugins,
>>> DCT and dfmt now).
>>>
>>> We definitely need a _standard_ fast D parser (suitable as tokenizer).
>>> We already have a good one at least in Visual D. Maybe dmd's parser is
>>> faster. If so, it can be (easily?) rewritten in D. We also already have
>>> some other parsers (in C# from Mono-D etc.).
>>>
>>> What about to get one and call it standard?
>>>
>>
>> I know I'm a little late coming into this conversation. This seems like
>> a nice thread to toss myself into. I've started working on a generic
>> lexer that is based off of a defined grammar.
>>
>> As I read through the thread (I unfortunately don't have enough time to
>> read every post, but I skimmed through as many as I could, and read the
>> ones that seemed important), it seems like we need a parser in D that
>> can lex D, and provide a Range of tokens that could be consumed.
>>
>> With some very minor tweaks, and a properly written Grammar class, I
>> basically have it already done. D was going to be one of the first
>> languages I would have written a definition for.
>>
>> https://github.com/kai4785/firstfront
>>
>> I haven't had time to look through Pegged, but there are some
>> differences in my approach that I can immediately see in Pegged's.
>>
>> 1) Pegged does compile-time or run-time code generation for your parser.
>> Mine is run-time only and regex based.
>> 2) Pegged uses PEG format, which I have been introduced to through this
>> thread. One of my goals was to actually invent something like PEG. This
>> will save me time :)
>>
>
>> I would love to receive some critical feedback on my approach as well as
>> the actual code base.
>
> Okay. Critics go as follows:
>
> First, as an author of std.regex I'm pleased that you find it suitable
> to use it in lexer. :)
>
> Still there is a BIG word of warning:
>
> Lexer based on trying out an array of regexes until one will match is
> NOT fast and not even close to fast ones. It is acceptable as proof of
> concept/prototype kind of thing only.
>
> To help this use case I though of making multi-regex matching a-la:
> match(text, regex1, regex2, regex3...);
> then lexing would be effectively a loop of form:
>
> foreach(tok; match(text, regex1, regex2, regex3...)){
> switch(tok[0]){//suppose match returns tuple - N of regex matched + the
> usual piece of text
> ...
> }
> }
> (with some glitches on /+ ... +/ and other non-regular stuff).
>
> But until such time it's not to be used seriously in this context.
>
> The reason is that each call to match does have some overhead to setup
> regex engine, it's constant but it's huge compared to what usual lexers
> typically do. The other thing is that you should use ctRegex for this
> kind of job if you can (i.e. compiler doesn't explode on it).
>
>
> (keep in mind I only glimpsed the source, will post more feedback later)
>
>>
>> To build it, you'll need cmake, and cmaked2 from here:
>> http://code.google.com/p/cmaked2/wiki/GettingStarted
>>
>> Or just dmd *.d :) I haven't tried to build it on Windows yet, but I
>> don't see anything that immediately jumps out as not cross-platform.
>> I've been working on it on both Fedora and CentOS.
>>
>
> rdmd for the win!
>
>

I cut my teeth on perl, so everything gets compared to perl in my mind.

In perl, I can 'precompile' a regular expression, so I can avoid some 
overhead. So something like this:

while(<STDIN>){
    if($_ =~ m/<someregex>/){
        some work;
    }
}

Would end up re-compiling the regex on each line from STDIN. Perl uses 
the term "compiling" the regular expression, which may be different than 
what you call "setup regex engine".

Does/Can D's std.regex offer something like this? If not, I would be 
interested in why not.

-Kai Meyer
August 06, 2012
Re: Let's stop parser Hell
> I cut my teeth on perl, so everything gets compared to perl in my mind.
>
> In perl, I can 'precompile' a regular expression, so I can avoid some
> overhead. So something like this:
>
> while(<STDIN>){
>     if($_ =~ m/<someregex>/){
>         some work;
>     }
> }
>
> Would end up re-compiling the regex on each line from STDIN. Perl uses the
> term "compiling" the regular expression, which may be different than what
> you call "setup regex engine".
>
> Does/Can D's std.regex offer something like this? If not, I would be
> interested in why not.

D does have compiled regex, it's called ctRegex (meaning compile-time regex):

http://dlang.org/phobos/std_regex.html#ctRegex

The tests done recently put it among the fastest regex engine known.
Caution: not every regex extension known to mankind is implemented
here!
August 06, 2012
Re: Let's stop parser Hell
On 06-Aug-12 22:52, Philippe Sigaud wrote:
>> I cut my teeth on perl, so everything gets compared to perl in my mind.
>>
>> In perl, I can 'precompile' a regular expression, so I can avoid some
>> overhead. So something like this:
>>
>> while(<STDIN>){
>>      if($_ =~ m/<someregex>/){
>>          some work;
>>      }
>> }
>>
>> Would end up re-compiling the regex on each line from STDIN. Perl uses the
>> term "compiling" the regular expression, which may be different than what
>> you call "setup regex engine".

Of course regex is first compiled to bytecode (same thing as "compile" 
in perl). Moreover if you use regex pattern directly it is compiled on 
first use and put into TLS cache of compiled patterns. From now on it's 
used in compiled form. (there about 8 entries in cache, don't relay on 
it too much).

What I mean by "setup engine" is extra cost spent on _starting_ matching 
with compiled pattern. For one thing it adds 1 call to malloc/free.
Again I think in Perl the same thing applies. In other words doing 
continuous search (via foreach(x; match(..., "g" )) ) is much faster 
then calling match on individual pieces over and over again in cycle.

>>
>> Does/Can D's std.regex offer something like this? If not, I would be
>> interested in why not.
>
Of course it does, in fact you can't match without compiling pattern 
first (it just provides a shortcut that does it for you behind the scenes).

> D does have compiled regex, it's called ctRegex (meaning compile-time regex):
>
> http://dlang.org/phobos/std_regex.html#ctRegex
>

And there is a second version - compiled native code. Unlike perl it's 
not bytecode and thus usually much faster.
Frankly the most slow regex I've seen is in Python, the second most 
sucky one is PCRE (but is incredibly popular somehow). Perl is not bad 
but usually slower then top dogs from C++ & std.regex.

> The tests done recently put it among the fastest regex engine known.

Yeah, on top of said chart. Needless to say the test favored my 
implementation, ctRegex is not the fastest regex engine  in general (yet).

> Caution: not every regex extension known to mankind is implemented
> here!

Sure as hell. In fact, the most problematic thing is that parser often 
fails during CTFE.
Also I have a solid plan on enhancing a bunch of things effectively 
making std.regex v2 but no sooner then October-November.


-- 
Dmitry Olshansky
August 06, 2012
Re: Let's stop parser Hell
On Mon, Aug 6, 2012 at 9:31 PM, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:

> Of course regex is first compiled to bytecode (same thing as "compile" in
> perl). Moreover if you use regex pattern directly it is compiled on first
> use and put into TLS cache of compiled patterns. From now on it's used in
> compiled form. (there about 8 entries in cache, don't relay on it too much).

Btw, I wanted to ask you that for a long time: what do you mean by
'compiled to bytecode', for D source?

> And there is a second version - compiled native code. Unlike perl it's not
> bytecode and thus usually much faster.

Which?


> Frankly the most slow regex I've seen is in Python, the second most sucky
> one is PCRE (but is incredibly popular somehow). Perl is not bad but usually
> slower then top dogs from C++ & std.regex.

Do people *really* need speed, or a great number of extensions?

> Sure as hell. In fact, the most problematic thing is that parser often fails
> during CTFE.

For example?

> Also I have a solid plan on enhancing a bunch of things effectively making
> std.regex v2 but no sooner then October-November.

That's good to know.
August 06, 2012
Re: Let's stop parser Hell
On 06-Aug-12 23:43, Philippe Sigaud wrote:
> On Mon, Aug 6, 2012 at 9:31 PM, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:
>
>> Of course regex is first compiled to bytecode (same thing as "compile" in
>> perl). Moreover if you use regex pattern directly it is compiled on first
>> use and put into TLS cache of compiled patterns. From now on it's used in
>> compiled form. (there about 8 entries in cache, don't relay on it too much).
>
> Btw, I wanted to ask you that for a long time: what do you mean by
> 'compiled to bytecode', for D source?
>

by this:
auto r = regex(some_string); // where some_string can come from user input.
r - contains bytecode that matches pattern.

Unlike ctRegex which does output D source code and compiles it to native 
code.
>> And there is a second version - compiled native code. Unlike perl it's not
>> bytecode and thus usually much faster.
>
> Which?

Compiled native code is faster. Or what you
>
>
>> Frankly the most slow regex I've seen is in Python, the second most sucky
>> one is PCRE (but is incredibly popular somehow). Perl is not bad but usually
>> slower then top dogs from C++ & std.regex.
>
> Do people *really* need speed, or a great number of extensions?
>

They want both. In my experience you can't satisfy everybody.
I think features are not what people look for in regex, even basic 
ECMA-262 stuff is way above what most programmers need to do. Unlike 
extra speed which even newbies could use :)
And while I'm at it - we already have a damn good collection of 
extensions, way too much I'd say.

For example, I've yet to see one regex engine that support Unicode to 
the same level as std.regex, namely I haven't seen a single one with 
full set operations (not only union but subtraction, intersection etc.) 
inside of char class [...]. Funny even ICU one didn't support them a 
year ago, dunno about now.

Also some extensions come from implementations inherent inefficiency 
e.g. (as in Perl) possessive quantifiers, atomic groups. No wonder it's 
so hard to make look-behind unrestricted in Perl, the whole thing is a mess.


>> Sure as hell. In fact, the most problematic thing is that parser often fails
>> during CTFE.
>
> For example?

Ehmn. See bugzilla, search ctRegex.
But not yet filed are: said non-union set operations usually fail + the 
fact that Unicode properties are not readable at CT (thus \p{xxx} fails).

>
>> Also I have a solid plan on enhancing a bunch of things effectively making
>> std.regex v2 but no sooner then October-November.
>
> That's good to know.
>


-- 
Dmitry Olshansky
August 06, 2012
Re: Let's stop parser Hell
On 06-Aug-12 23:59, Dmitry Olshansky wrote:
> >> And there is a second version - compiled native code. Unlike perl it's not
>> bytecode and thus usually much faster.
>
> Which?

Compiled native code is faster. Perl is slower.


-- 
Dmitry Olshansky
August 07, 2012
Re: Let's stop parser Hell
On Mon, Aug 6, 2012 at 10:00 PM, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:
> On 06-Aug-12 23:59, Dmitry Olshansky wrote:
>>
>> >> And there is a second version - compiled native code. Unlike perl it's
>> >> not
>>>
>>> bytecode and thus usually much faster.
>>
>>
>> Which?
>
>
> Compiled native code is faster. Perl is slower.

Sorry, I meant: what second version? How do you distinguish between
bytecode-producing and compiled-code producing ctRegex?
August 07, 2012
Re: Let's stop parser Hell
On Tuesday, 7 August 2012 at 04:58:05 UTC, Philippe Sigaud wrote:
> Sorry, I meant: what second version? How do you distinguish 
> between
> bytecode-producing and compiled-code producing ctRegex?

As far as I know, ctRegex _always_ produces machine code. 
Bytecode is what the runtime implementation, i.e. regex, uses.

David
August 07, 2012
Re: Let's stop parser Hell
On Tue, Aug 7, 2012 at 7:02 AM, David Nadlinger <see@klickverbot.at> wrote:

> As far as I know, ctRegex _always_ produces machine code. Bytecode is what
> the runtime implementation, i.e. regex, uses.

That's what I had in mind too, but Dmitry seemed to say things are
more complicated. Maybe I misunderstood.
And do you know why it's called bytecode? I never heard of 'bytecode'
for D outside of std.regex. Does that just mean 'intermediate form D
code', as in 'bunch of small and easily optimized instructions.'?
21 22 23 24 25 26
Top | Discussion index | About this forum | D home