August 02, 2012
On Thursday, August 02, 2012 07:06:25 Christophe Travert wrote:
> "Jonathan M Davis" , dans le message (digitalmars.D:173942), a écrit :
> > It may very well be a good idea to templatize Token on range type. It would be nice not to have to templatize it, but that may be the best route to go. The main question is whether str is _always_ a slice (or the result of takeExactly) of the orignal range. I _think_ that it is, but I'd have to make sure of that. If it's not and can't be for whatever reason, then that poses a problem.
> 
> It can't if it is a simple input range! Like a file read with most 'lazy' methods. Then you need either to transform the input range in a forward range using a range adapter that performs buffering, or perform your own buffering internally. You also have to decide how long the token will be valid (I believe if you want lexing to be blazing fast, you don't want to allocate for each token).

My lexer specifically requires a forward range. The more that I deal with input ranges, the more that I'm convinced that they're nearly useless. If you need even _one_ character of lookahead, then an input range doesn't fly at all, and considered the performance gains in using slicing (or takeExactly), I just don't think that it makes sense to operate on an input range. Besides, if anyone wants full performance, they'll probably need to use one of the built- in string types. Any range of dchar which needs to decode on the call to front or popFront will take a serious performance hit. It'll work, but it's not really advisable if you need performance.

> Also, you said in this thread that you only need to consider ansy
> characters in the lexer because non-ansy characters are only used in
> non-keyword identifier. That is not entirely true: EndOfLine defines 2
> non-ansy characters, namely LINE SEPARATOR and PARAGRAPH SEPARATOR.
>   http://dlang.org/lex.html#EndOfLine
>   Maybe they should be dropped, since other non-ansy whitespace are not
> supported. You may want the line count to be consistent with other
> programs. I don't know what text-processing programs usualy considers an
> end of line.

I'm well aware of all fo that. Almost all of the lexer can operate entirely on ASCII, with a few exceptions, and even in some of those cases, decoding isn't required (e.g. lineSep and paraSep can be dealt with as code units rather than having to decode to compare agains them). The lexer that I'm writing will follow the spec. And any lexer that wants to get into Phobos will need to do the same. So, stuff like lineSep and the end of file characters that the spec has will be supported.

- Jonathan M Davis
August 06, 2012
On 07/31/2012 03:27 PM, Dmitry Olshansky wrote:
> On 31-Jul-12 20:10, Kai Meyer wrote:
>> On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:
>>> There are more and more projects requiring parsing D code (IDE plugins,
>>> DCT and dfmt now).
>>>
>>> We definitely need a _standard_ fast D parser (suitable as tokenizer).
>>> We already have a good one at least in Visual D. Maybe dmd's parser is
>>> faster. If so, it can be (easily?) rewritten in D. We also already have
>>> some other parsers (in C# from Mono-D etc.).
>>>
>>> What about to get one and call it standard?
>>>
>>
>> I know I'm a little late coming into this conversation. This seems like
>> a nice thread to toss myself into. I've started working on a generic
>> lexer that is based off of a defined grammar.
>>
>> As I read through the thread (I unfortunately don't have enough time to
>> read every post, but I skimmed through as many as I could, and read the
>> ones that seemed important), it seems like we need a parser in D that
>> can lex D, and provide a Range of tokens that could be consumed.
>>
>> With some very minor tweaks, and a properly written Grammar class, I
>> basically have it already done. D was going to be one of the first
>> languages I would have written a definition for.
>>
>> https://github.com/kai4785/firstfront
>>
>> I haven't had time to look through Pegged, but there are some
>> differences in my approach that I can immediately see in Pegged's.
>>
>> 1) Pegged does compile-time or run-time code generation for your parser.
>> Mine is run-time only and regex based.
>> 2) Pegged uses PEG format, which I have been introduced to through this
>> thread. One of my goals was to actually invent something like PEG. This
>> will save me time :)
>>
>
>> I would love to receive some critical feedback on my approach as well as
>> the actual code base.
>
> Okay. Critics go as follows:
>
> First, as an author of std.regex I'm pleased that you find it suitable
> to use it in lexer. :)
>
> Still there is a BIG word of warning:
>
> Lexer based on trying out an array of regexes until one will match is
> NOT fast and not even close to fast ones. It is acceptable as proof of
> concept/prototype kind of thing only.
>
> To help this use case I though of making multi-regex matching a-la:
> match(text, regex1, regex2, regex3...);
> then lexing would be effectively a loop of form:
>
> foreach(tok; match(text, regex1, regex2, regex3...)){
> switch(tok[0]){//suppose match returns tuple - N of regex matched + the
> usual piece of text
> ...
> }
> }
> (with some glitches on /+ ... +/ and other non-regular stuff).
>
> But until such time it's not to be used seriously in this context.
>
> The reason is that each call to match does have some overhead to setup
> regex engine, it's constant but it's huge compared to what usual lexers
> typically do. The other thing is that you should use ctRegex for this
> kind of job if you can (i.e. compiler doesn't explode on it).
>
>
> (keep in mind I only glimpsed the source, will post more feedback later)
>
>>
>> To build it, you'll need cmake, and cmaked2 from here:
>> http://code.google.com/p/cmaked2/wiki/GettingStarted
>>
>> Or just dmd *.d :) I haven't tried to build it on Windows yet, but I
>> don't see anything that immediately jumps out as not cross-platform.
>> I've been working on it on both Fedora and CentOS.
>>
>
> rdmd for the win!
>
>

I cut my teeth on perl, so everything gets compared to perl in my mind.

In perl, I can 'precompile' a regular expression, so I can avoid some overhead. So something like this:

while(<STDIN>){
    if($_ =~ m/<someregex>/){
        some work;
    }
}

Would end up re-compiling the regex on each line from STDIN. Perl uses the term "compiling" the regular expression, which may be different than what you call "setup regex engine".

Does/Can D's std.regex offer something like this? If not, I would be interested in why not.

-Kai Meyer
August 06, 2012
> I cut my teeth on perl, so everything gets compared to perl in my mind.
>
> In perl, I can 'precompile' a regular expression, so I can avoid some overhead. So something like this:
>
> while(<STDIN>){
>     if($_ =~ m/<someregex>/){
>         some work;
>     }
> }
>
> Would end up re-compiling the regex on each line from STDIN. Perl uses the term "compiling" the regular expression, which may be different than what you call "setup regex engine".
>
> Does/Can D's std.regex offer something like this? If not, I would be interested in why not.

D does have compiled regex, it's called ctRegex (meaning compile-time regex):

http://dlang.org/phobos/std_regex.html#ctRegex

The tests done recently put it among the fastest regex engine known. Caution: not every regex extension known to mankind is implemented here!
August 06, 2012
On 06-Aug-12 22:52, Philippe Sigaud wrote:
>> I cut my teeth on perl, so everything gets compared to perl in my mind.
>>
>> In perl, I can 'precompile' a regular expression, so I can avoid some
>> overhead. So something like this:
>>
>> while(<STDIN>){
>>      if($_ =~ m/<someregex>/){
>>          some work;
>>      }
>> }
>>
>> Would end up re-compiling the regex on each line from STDIN. Perl uses the
>> term "compiling" the regular expression, which may be different than what
>> you call "setup regex engine".

Of course regex is first compiled to bytecode (same thing as "compile" in perl). Moreover if you use regex pattern directly it is compiled on first use and put into TLS cache of compiled patterns. From now on it's used in compiled form. (there about 8 entries in cache, don't relay on it too much).

What I mean by "setup engine" is extra cost spent on _starting_ matching with compiled pattern. For one thing it adds 1 call to malloc/free.
Again I think in Perl the same thing applies. In other words doing continuous search (via foreach(x; match(..., "g" )) ) is much faster then calling match on individual pieces over and over again in cycle.

>>
>> Does/Can D's std.regex offer something like this? If not, I would be
>> interested in why not.
>
Of course it does, in fact you can't match without compiling pattern first (it just provides a shortcut that does it for you behind the scenes).

> D does have compiled regex, it's called ctRegex (meaning compile-time regex):
>
> http://dlang.org/phobos/std_regex.html#ctRegex
>

And there is a second version - compiled native code. Unlike perl it's not bytecode and thus usually much faster.
Frankly the most slow regex I've seen is in Python, the second most sucky one is PCRE (but is incredibly popular somehow). Perl is not bad but usually slower then top dogs from C++ & std.regex.

> The tests done recently put it among the fastest regex engine known.

Yeah, on top of said chart. Needless to say the test favored my implementation, ctRegex is not the fastest regex engine  in general (yet).

> Caution: not every regex extension known to mankind is implemented
> here!

Sure as hell. In fact, the most problematic thing is that parser often fails during CTFE.
Also I have a solid plan on enhancing a bunch of things effectively making std.regex v2 but no sooner then October-November.


-- 
Dmitry Olshansky
August 06, 2012
On Mon, Aug 6, 2012 at 9:31 PM, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:

> Of course regex is first compiled to bytecode (same thing as "compile" in perl). Moreover if you use regex pattern directly it is compiled on first use and put into TLS cache of compiled patterns. From now on it's used in compiled form. (there about 8 entries in cache, don't relay on it too much).

Btw, I wanted to ask you that for a long time: what do you mean by 'compiled to bytecode', for D source?

> And there is a second version - compiled native code. Unlike perl it's not bytecode and thus usually much faster.

Which?


> Frankly the most slow regex I've seen is in Python, the second most sucky one is PCRE (but is incredibly popular somehow). Perl is not bad but usually slower then top dogs from C++ & std.regex.

Do people *really* need speed, or a great number of extensions?

> Sure as hell. In fact, the most problematic thing is that parser often fails during CTFE.

For example?

> Also I have a solid plan on enhancing a bunch of things effectively making std.regex v2 but no sooner then October-November.

That's good to know.
August 06, 2012
On 06-Aug-12 23:43, Philippe Sigaud wrote:
> On Mon, Aug 6, 2012 at 9:31 PM, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:
>
>> Of course regex is first compiled to bytecode (same thing as "compile" in
>> perl). Moreover if you use regex pattern directly it is compiled on first
>> use and put into TLS cache of compiled patterns. From now on it's used in
>> compiled form. (there about 8 entries in cache, don't relay on it too much).
>
> Btw, I wanted to ask you that for a long time: what do you mean by
> 'compiled to bytecode', for D source?
>

by this:
auto r = regex(some_string); // where some_string can come from user input.
r - contains bytecode that matches pattern.

Unlike ctRegex which does output D source code and compiles it to native code.
>> And there is a second version - compiled native code. Unlike perl it's not
>> bytecode and thus usually much faster.
>
> Which?

Compiled native code is faster. Or what you
>
>
>> Frankly the most slow regex I've seen is in Python, the second most sucky
>> one is PCRE (but is incredibly popular somehow). Perl is not bad but usually
>> slower then top dogs from C++ & std.regex.
>
> Do people *really* need speed, or a great number of extensions?
>

They want both. In my experience you can't satisfy everybody.
I think features are not what people look for in regex, even basic ECMA-262 stuff is way above what most programmers need to do. Unlike extra speed which even newbies could use :)
And while I'm at it - we already have a damn good collection of extensions, way too much I'd say.

For example, I've yet to see one regex engine that support Unicode to the same level as std.regex, namely I haven't seen a single one with full set operations (not only union but subtraction, intersection etc.) inside of char class [...]. Funny even ICU one didn't support them a year ago, dunno about now.

Also some extensions come from implementations inherent inefficiency e.g. (as in Perl) possessive quantifiers, atomic groups. No wonder it's so hard to make look-behind unrestricted in Perl, the whole thing is a mess.


>> Sure as hell. In fact, the most problematic thing is that parser often fails
>> during CTFE.
>
> For example?

Ehmn. See bugzilla, search ctRegex.
But not yet filed are: said non-union set operations usually fail + the fact that Unicode properties are not readable at CT (thus \p{xxx} fails).

>
>> Also I have a solid plan on enhancing a bunch of things effectively making
>> std.regex v2 but no sooner then October-November.
>
> That's good to know.
>


-- 
Dmitry Olshansky
August 06, 2012
On 06-Aug-12 23:59, Dmitry Olshansky wrote:
> >> And there is a second version - compiled native code. Unlike perl it's not
>> bytecode and thus usually much faster.
>
> Which?

Compiled native code is faster. Perl is slower.


-- 
Dmitry Olshansky
August 07, 2012
On Mon, Aug 6, 2012 at 10:00 PM, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:
> On 06-Aug-12 23:59, Dmitry Olshansky wrote:
>>
>> >> And there is a second version - compiled native code. Unlike perl it's not
>>>
>>> bytecode and thus usually much faster.
>>
>>
>> Which?
>
>
> Compiled native code is faster. Perl is slower.

Sorry, I meant: what second version? How do you distinguish between bytecode-producing and compiled-code producing ctRegex?
August 07, 2012
On Tuesday, 7 August 2012 at 04:58:05 UTC, Philippe Sigaud wrote:
> Sorry, I meant: what second version? How do you distinguish between
> bytecode-producing and compiled-code producing ctRegex?

As far as I know, ctRegex _always_ produces machine code. Bytecode is what the runtime implementation, i.e. regex, uses.

David
August 07, 2012
On Tue, Aug 7, 2012 at 7:02 AM, David Nadlinger <see@klickverbot.at> wrote:

> As far as I know, ctRegex _always_ produces machine code. Bytecode is what the runtime implementation, i.e. regex, uses.

That's what I had in mind too, but Dmitry seemed to say things are
more complicated. Maybe I misunderstood.
And do you know why it's called bytecode? I never heard of 'bytecode'
for D outside of std.regex. Does that just mean 'intermediate form D
code', as in 'bunch of small and easily optimized instructions.'?