Thread overview | |||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
February 21, 2006 Template regexes, version 2 | ||||
---|---|---|---|---|
| ||||
I've had another go at doing compile-time regexps. With all the bugfixes in recent releases, it's now possible to do it with mixins, generating local functions instead of global ones. This provides _considerably_ more flexibility. The ultimately generated functions can look like: char [] firstmatch(char [] s) { char [][] results; int someOtherResult; void func(char [] s) { // this parses the string, using the original regexp string. // intermediate results (eg expressions in parentheses) are // in the local variables: results[][], someOtherResult, etc. } func(s); return results[0]; } Usage is like: char [] x = firstmatch!("ab+")(someStr); It seems to me that there are 3 types of regexes: * pure static -- where the regex string is a string literal, known at compile time * pure dynamic -- eg, as used in a grep utility. * pseudo-static. By this I mean regexps where the structure is constant, but some strings are replaced with variables. Perl-ish languages deal with the last case by allowing embedded variables in strings. The template regexes Ive been developing can cope with them by using additional parameters and extended syntax, eg @1 for the string value of the first parameter. eg char [] var1 = "here"; char [] input = " a56aherea"; char [] x = firstmatch!("a@1a")(input, var1); var1 = "56"; char [] y = firstmatch!("a@1a")(input, var1); would give x == "aherea", y=="aherea". As far as runtime efficiency goes, it's almost ideal, except that consecutive terms result in heavily nested trivial functions: eg for the trivial "x\dy": bool func(char [] s) { bool a(char [] s) { bool a(char [] s) { return s[0]=='y'; } return isdigit(s[0]) && a(s[1..$]); } return s[0]=='x' && a(s[1..$]); } and we need to rely on the optimiser to turn this into: s[0]=='x' && isdigit(s[1]) && s[2]=='y'; How good is DMD at performing this optimisation? A nested function that is only called once _should_ always be inlined, no matter how deeply nested it is, but reality could be quite different... Many optimisations are possible with this scheme (for example, the inner functions could just use int parameters, so that they don't need to pass s[] repeatedly; almost all of the optimisations from the runtime std.regexp could be applied). It should also be possible to implement Perl 6-style regexps. ...And then I return to the D newsgroups after a week and find the goalposts have moved: regexps are now built into the language. The mixin regexps are only at an early stage of development, but given the current discussions I think it's important to know what can be done with templates (probably more than most people expect). In the case of what I've called "pseudo-static" regexps, they are arguably more powerful than the built-in regexps of DMD 0.147. I don't know where to go from here. There are a few possibilities: 1. use template regexps as a demonstration of the power of D templates. --> Implement reduced syntax, keep the code simple and not well optimised; more of a proof-of-concept; go for "maximum coolness". 2. Like 1, but with optimisations; try to be the fastest regexp of any language. (cleaner-faster-better ... but we use the built-in regexps anyway <g>). 3. Create a standard library. This is what I was aiming for, but I don't think it makes sense any more. 4. potential integration into the language. Is this even possible? Probably the most sensible is to go with 1, to wake up the C++/boost crowd. Hopefully this should also provide some direction for the built-in syntax. Thoughts? |
February 21, 2006 Re: Template regexes, version 2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Don Clugston | Awesome! Don Clugston wrote: > I've had another go at doing compile-time regexps. With all the bugfixes in recent releases, it's now possible to do it with mixins, generating local functions instead of global ones. This provides _considerably_ more flexibility. > > The ultimately generated functions can look like: > > char [] firstmatch(char [] s) > { > char [][] results; > int someOtherResult; > > void func(char [] s) { > // this parses the string, using the original regexp string. > // intermediate results (eg expressions in parentheses) are > // in the local variables: results[][], someOtherResult, etc. > } > func(s); > return results[0]; > } > > Usage is like: > char [] x = firstmatch!("ab+")(someStr); Nice!! > It seems to me that there are 3 types of regexes: > * pure static -- where the regex string is a string literal, known at compile time > * pure dynamic -- eg, as used in a grep utility. > * pseudo-static. By this I mean regexps where the structure is constant, but some strings are replaced with variables. Pure static is what I always wanted with regexps in D! Pure dynamic has less of utility to me, but others may disagree. And that is taken care of by Walter now. Pseudo-static is cool!! And I believe *immensely* useful. If I do log analyzers, net statistics programs, serious data mining frameworks, then pseudo-static regexps is what I do all day long! And, at the other end, for even most 'dscript' tasks this is the core! > As far as runtime efficiency goes, it's almost ideal, Amazing! > ...And then I return to the D newsgroups after a week and find the goalposts have moved: regexps are now built into the language. Awwww, they're just runtime. Mere syntax sherades to smoke-and-mirror away the fact. > The mixin regexps are only at an early stage of development, but given the current discussions I think it's important to know what can be done > with templates (probably more than most people expect). Most people couldn't. :-) But then again, "a language can't be Serious if all its features are graspable to VB programmers". > In the case of what I've called "pseudo-static" regexps, they are arguably more powerful than the built-in regexps of DMD 0.147. I DEMAND that pseudo-static regexps be in the very next release of DMD. > I don't know where to go from here. There are a few possibilities: > 1. use template regexps as a demonstration of the power of D templates. > --> Implement reduced syntax, keep the code simple and not well optimised; more of a proof-of-concept; go for "maximum coolness". From a "marketing point of view", that may be wise. > 2. Like 1, but with optimisations; try to be the fastest regexp of any language. (cleaner-faster-better ... but we use the built-in regexps anyway <g>). I probably would not use the "built-ins" (if that refers to the current runtime library-come-syntax stuff) hardly at all. > 3. Create a standard library. This is what I was aiming for, > but I don't think it makes sense any more. Of course it does! > 4. potential integration into the language. Is this even possible? Wanna guess my vote on this?! > Probably the most sensible is to go with 1, to wake up the C++/boost crowd. Why wake 'em up???? Unless... you expect them all to abandon C++ on sight and stampede over here? Heh, I wonder what 20 Dons would achieve together! |
February 21, 2006 Re: Template regexes, version 2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Don Clugston | > The mixin regexps are only at an early stage of development, but given the
> current discussions I think it's important to know what can be done
> with templates (probably more than most people expect). In the case of
> what I've called "pseudo-static" regexps, they are arguably more powerful
> than the built-in regexps of DMD 0.147.
>
> I don't know where to go from here. There are a few possibilities:
> 1. use template regexps as a demonstration of the power of D templates.
> --> Implement reduced syntax, keep the code simple and not well optimised;
> more of a proof-of-concept; go for "maximum coolness".
> 2. Like 1, but with optimisations; try to be the fastest regexp of any
> language. (cleaner-faster-better ... but we use the built-in regexps
> anyway <g>).
> 3. Create a standard library. This is what I was aiming for, but I don't
> think it makes sense any more.
> 4. potential integration into the language. Is this even possible?
>
> Probably the most sensible is to go with 1, to wake up the C++/boost
> crowd. Hopefully this should also provide some direction for the built-in
> syntax.
> Thoughts?
You are the man! Keep up the good work!
As far as what direction to go, a stardard library makes perfect sense. If Walter's kludge for regex's doesn't fit with this new approach, perhaps he can adapt it to make it work. Even without this, I would much rather have faster regex's than language integration.
BTW, I developed a new syntax that is a spinoff of regular expressions (using C++). It includes syntax to manipulate a stack and store values in a parse tree. Using this language, I was able to define a parser for a modern object-oriented language in like 230 lines of code. I could show you this stuff if you are interested.
-Craig
|
February 21, 2006 Re: Template regexes, version 2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Craig Black | Craig Black wrote:
> Don Clugston wrote:
> BTW, I developed a new syntax that is a spinoff of regular
> expressions (using C++). It includes syntax to manipulate a stack
> and store values in a parse tree. Using this language, I was able to
> define a parser for a modern object-oriented language in like 230
> lines of code. I could show you this stuff if you are interested.
>
> -Craig
I'm not even gonna try to imagine what happens when Don gets his hands on that stuff. And I used to be pretty good at imagining stuff.
|
February 21, 2006 Re: Template regexes, version 2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Don Clugston | "Don Clugston" <dac@nospam.com.au> wrote in message news:dtev42$1g98$1@digitaldaemon.com... > ...And then I return to the D newsgroups after a week and find the goalposts have moved: regexps are now built into the language. Don't worry about that, they'll be gone again in the next version <g>. Based on the considerable feedback about this issue, I think I've got a better design. |
February 21, 2006 Re: Template regexes, version 2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Don Clugston | "Don Clugston" <dac@nospam.com.au> wrote in message news:dtev42$1g98$1@digitaldaemon.com... > Perl-ish languages deal with the last case by allowing embedded variables > in strings. The template regexes Ive been developing can cope with them by > using additional parameters and extended syntax, eg @1 for the string > value of the first parameter. > eg > > char [] var1 = "here"; > char [] input = " a56aherea"; > char [] x = firstmatch!("a@1a")(input, var1); > var1 = "56"; > char [] y = firstmatch!("a@1a")(input, var1); > > would give x == "aherea", y=="aherea". Great! But can I suggest using the format used by std.regexp.RegExp.replace() ? It uses $$, $&, $`, $', and $n. > How good is DMD at performing this optimisation? Not good enough. But I wouldn't worry about that for the moment. > I don't know where to go from here. There are a few possibilities: > 1. use template regexps as a demonstration of the power of D templates. > --> Implement reduced syntax, keep the code simple and not well optimised; > more of a proof-of-concept; go for "maximum coolness". > 2. Like 1, but with optimisations; try to be the fastest regexp of any > language. (cleaner-faster-better ... but we use the built-in regexps > anyway <g>). > 3. Create a standard library. This is what I was aiming for, but I don't > think it makes sense any more. > 4. potential integration into the language. Is this even possible? > > Probably the most sensible is to go with 1, to wake up the C++/boost > crowd. Hopefully this should also provide some direction for the built-in > syntax. > Thoughts? #1. No question about it! |
February 21, 2006 Re: Template regexes, version 2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Georg Wrede | In article <43FB4A18.4080904@nospam.org>, Georg Wrede says...
>
>Craig Black wrote:
> > Don Clugston wrote:
>
>> BTW, I developed a new syntax that is a spinoff of regular expressions (using C++). It includes syntax to manipulate a stack and store values in a parse tree. Using this language, I was able to define a parser for a modern object-oriented language in like 230 lines of code. I could show you this stuff if you are interested.
>>
>> -Craig
>
>I'm not even gonna try to imagine what happens when Don gets his hands on that stuff. And I used to be pretty good at imagining stuff.
You and me both. I'm just as excited as everyone else.
- Eric Anderton at yahoo
|
February 21, 2006 Re: Template regexes, version 2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Craig Black Attachments:
| Since it seems that there may be some interest, and to prove that I'm not bull-shitting, I thought I would post the code that I mentioned earlier. It describes a parser for an unnamed object-oriented language of my design (that rips off a lot of stuff from D). It's actually less than 230 lines if you don't count the comments and whitespace. It's all one big string that gets parsed at run-time, but perhaps D could turn it into optimized source code. const char *languageParser = "Real::(!Delims&(-?(/d+(/.(/d)+)?)|(/d*/.(/d)+)([eE][/-+]?/d{1,3})?)@~;)" // A real number "Char::(!Delims&(\')@~&((\\)?.)@~\';;)" // char literal "Text::(!Delims&(\")@~&([^\".]*)@~\";;)" // text literal // Delimeters "Comment1::(//(^(/n).)*)" "Comment2::(/*(^(/*//).)*/*//)" "Delims::(/s(//(!Comment1|!Comment2)/s)*)" // // Expressions // "BiOp1:([/./^])" // . ^ (exponentiation) "BiOp2:([///*/%/#])" // / div * mul % mod # concat "BiOp3:([/-+])" // - + "BiOp4:(=|/!=|<|>|<=|>=|in|/~|/!/~)" // = != < > <= >= in ~ !~ "BiOp5:(/&|/||/./.)" // & | .. (range operator) "Parens:(/(!Expr/))" // Expression bound by parenthesis "MParam:(((`=)@~&(/i)@~;;=)?!Expr)" "MParens:(&(/()@~(!MParam(,!MParam)*)?/);)" // 0 or more comma delimited expressions bound by parens "Index:(&(/[)@~!Expr(,!Expr)*/];)" // Array index "Tuple1:(&(/[)@~!MParam(,!MParam)*/];)" // Array index tuple "Tuple2:(&(/()@~!MParam(,!MParam)*/);)" // Parens tuple "Tuple:(!Tuple1|!Tuple2)" "TBang:(&(/!)@~;)" // template instantiation bang "DotExpr:(&(/.)@~;&(/i)@~;)" // dots in an id expression "IdExpr:((`/:idexpr)@~&(/i)@~;(!DotExpr)*(((!TBang)?!MParens)|!Index)*;)" // Id expression "LExpr:(!Real|!Text|!Char|!TypeofExpr|!IdExpr|!Tuple|!Parens|!MethodExpr|!DelegateExpr|!ClosureExpr|!PrefixExpr)" // Prefix operators "PrefixOp:([/*/-/!/&]|sizeof|typeid|typeof)" "PrefixExpr:(&(!PrefixOp)@~!LExpr;)" // Exponentiation "BiExpr1:((`t1)@~!LExpr&(!BiOp1)$~!BiSub2;)" // Multiplication, division, and modulus "BiSub2:(!BiExpr1|!LExpr)" "BiExpr2:((`t2)@~!BiSub2&(!BiOp2)$~!BiSub3;)" // Addition and Subtraction "BiSub3:(!BiExpr2|!BiSub2)" "BiExpr3:((`t3)@~!BiSub3&(!BiOp3)$~!BiSub4;)" // Comparisons == != < > <= >= "BiSub4:(!BiExpr3|!BiSub3)" "BiExpr4:((`t4)@~!BiSub4&(!BiOp4)$~!BiSub5;)" // & | .. "BiSub5:(!BiExpr4|!BiSub4)" "BiSub6:(!BiExpr5|!BiSub5)" "BiExpr5:((`t5)@~!BiSub5&(!BiOp5)$~!BiSub6;)" "Keywords:(module|interface|class|enum|static|const|delegate|closure|method)" // Type expression "TypeofExpr:(&(typeof)@~!Expr(/.&(/i)@~;)*;)" "TypeDots:(^(!Keywords)&(/i)@~(/.&(/i)@~;)*;)" "Array:(&(/[)@~(!Expr(,!Expr)*)?/];)" // Array type "Const:((&(/!)@~;)|((const(`/!))@~;))" "MTypeExpr:(((!Const)?(&(/*)@~;|!Array))*(!Const)?(!DelegateType|!ClosureType|!TypeofExpr|!TypeDots))" "TypeExpr:((`/:type)@~!MTypeExpr;)" // delegate, closure types // @void(int) myDelegate = @myMethod; // @void(int) myDelegate = @myInstance.myMethod; // #void(int) myClosure = #myMethod(5); // #void(int) myDelegate = #myInstance.myMethod(5); "SID:(&(/i)@~;^(/i))" "DParam:((`/:var)@~!TypeExpr(!SID(,!SID)*)?(,;!DParam)?)" "DParens:(/((!DParam;)?/))" "Delegate:(/@|delegate)" "Closure:(/#|closure)" "DelegateType:((`/@)@~!Delegate!TypeExpr!DParens;)" "ClosureType:((`/#)@~!Closure!TypeExpr!DParens;)" // Anonymous method expression // %void(int a) myDelegate = %void(int a) Write(a);; "AParam:((`/:var)@~!TypeExpr!SID(,!SID)*(,;!AParam)?)" "AParens:(/((!AParam;)?/))" "MethodExpr:((`/%)@~!Method!TypeExpr!AParens!Block;)" "DelegateExpr:((`/@)@~!Delegate(!IdExpr/.)?&(/i)@~;;)" // Closure expression "ClosureExpr:((`/#)@~!Closure!IdExpr;)" // New expression "NewExpr:(&(new)@~!TypeExpr(!MParens)?;)" // CastExpr "CastExpr:((`cast)@~!Expr&((!Static)?as)$~!TypeExpr;)" // An expression "BiOps:(!BiOp1|!BiOp2|!BiOp3|!BiOp4|!BiOp5)" "PreLExpr:(!LExpr^(!BiOps))" "Expr:(^[/]/;/)/}](!PreLExpr|!BiExpr5|!BiExpr4|!BiExpr3|" "!BiExpr2|!BiExpr1|!NewExpr|!CastExpr))" // // Statements // // Assignment statement "AssOp:(=|/+=|/-=|/*=|//=|/^=)" "AssSt:((`ass)@~!IdExpr(&(!AssOp)$~)!Expr;)" // Delete statement "DeleteSt:(&(delete)@~!IdExpr;/;)" // Return, break, continue statements "ReturnSt:(&(return)@~(!Expr)?;/;)" "BreakSt:(&(break)@~;/;)" "ContinueSt:(&(continue)@~;/;)" // Postfix statement "PostfixOp:(/+/+|/-/-)" "PostfixSt:((`pf)@~!IdExpr&(!PostfixOp)$~;)" // Variable declaration statement "Var:(&(/i)@~(=(!NewExpr|!Expr))?;)" // var = expr "Static:((&(/$)@~;)|(static(`/$)@~;))" "VarSt:((`/:var)@~(!Static)?!TypeExpr!Var(,!Var)*;/;)" "SVarSt:((`/:var)@~!TypeExpr!Var(,!Var)*;/;)" // If statement "IfSt:(&(if)@~!Expr(/:)?!Block(else!Block)?;)" // Static if statement "SIfSt:(&(!Static(if))@~!Expr(/:)?!Block(else!Block)?;)" // Switch statement "CaseSt:((`case)@~(case)?!Expr*(/:)?!Block;)" "DefaultSt:(&(default)@~(/:)?!Block;)" "SwitchSt:(&(switch)@~/{(!CaseSt)*(!DefaultSt)?/};)" // While statement "WhileSt:(&(while)@~!Expr!Block;)" // Do statement "DoSt:(&(do)@~!Block(while)!Expr;/;)" // Method call statement "MethodSt:((`/(/))@~!IdExpr;/;)" // Foreach statement "FEVarDecl:((`/:var)@~!TypeExpr&(/i)@~;;)" "ForeachHeader:(!FEVarDecl(,!FEVarDecl;)?/;!Expr(&(step)@~!Expr;)?)" "ForeachSt:(&(foreach)@~/(!ForeachHeader/)!Block(else!Block);)" // For statement "FVarDecl:((`/:var)@~!TypeExpr&(/i)@~;(=!Expr);)" "ForInit:(!FVarDecl|!AssSt)" "FStatement:(!AssSt|!PostfixSt)" "ForSt:(&(for)@~/((!ForInit(,!ForInit)*)?/;!Expr/;(!FStatement(,!FStatement)*)?/)!Block;)" // In and out blocks can include preconditions and postconditions "InBlock:(&(in)@~!Block;)" "OutBlock:(&(out)@~!Block;)" // Trace block "TraceBlock:(&(trace)@~!Block;)" // Echo, assert "EchoSt:(&(echo)@~!Expr;)" "AssertSt:(&(assert)@~!Expr;)" // Exceptions "CatchSt:((`catch)@~(catch)?&(/i)@~;*(/:)?!Block;)" "DefaultSt:(&(default)@~(/:)?!Block;)" "TryBlock:(&(try)@~/{(!CatchSt)*(!DefaultSt)?/};)" "ThrowSt:(&(throw)@~;)" // Enumeration definition "EnumNode:(&(/i)@~(=!Expr);)" "EnumDef:(&(enum)@~/{!EnumNode(,!EnumNode)*/};)" // Method definition "TID:(&(/i)@~^(/i)(=!Expr)?;)" "TParam:((`/:var)@~(&(in|out|inout)@~;)?!TypeExpr!TID(,!TID)*(,;!TParam)?)" "TParens:(/((!TParam;)?/))" "Method:(/%|method)" "SMethodDef:((`/%)@~!Method!TypeExpr&(/i)@@~;!TParens!Block;)" "MMethodDef:((`/%)@~(!Static|!Const)?!Method!TypeExpr&(/i)@@~;!TParens!Block;)" "IMethodDef:((`/%)@~(!Const)?!Method!TypeExpr&(/i)@@~;!TParens/;)" //"Construct:(&(construct)@~!Block;)" //"Destruct:(&(destruct)@~!Block;)" //"Startup:..." //"Terminate:..." //"Set:(&(set)@~!TParens!Block;)" //"Get:(!TypeExprget!Block;)" // Operator overloading "OpOverload:(&(operator)@~(!PrefixO|!PostfixO|!AssignO|!BinaryO)!Block;)" "BinaryO:(&(^(/.)!BiOps)@~;!TypeExpr/(!TypeExpr&(/i)@~;,!TypeExpr&(/i)@~;/))" "AssignO:(&(^(=)!AssOp)@~;void/(!TypeExpr&(/i)@~;/))" "PrefixO:(&(/-/!)@~;!TypeExpr/(/))" "PostfixO:(&(/+/+|/-/-)@~;void/(/))" "BinaryOverload:(&(operator)@~!BinaryO!Block;)" // Class definition "ProtSt:(&(public|private|protected)@~;/:)" "ClassSt:(!ProtSt|!Alias|!TypeDef|!VarSt|!MMethodDef|!ClassDef|!EnumDef|!InterfaceDef|!OpOverload)" "ClassBlock:(&(/{)@~(!ClassSt)*/};)" "TemplateParam:((&(int)@~;)?(&(/i)@~;))" "TemplateParams:(!TemplateParam(,!TemplateParam)*)" "ClassDef:(&(class)@~&(/i)@~(/(!TemplateParams/))?;(/:!IdExpr(,!IdExpr)*)?!ClassBlock;)" // Interface definition "InterfaceSt:(!ProtSt|!Alias|!TypeDef|!IMethodDef|!ClassDef|!EnumDef|!InterfaceDef)" "InterfaceBlock:(&(/{)@~(!InterfaceSt)*/};)" "InterfaceDef:(&(interface)@~&(/i)@~;(/:!IdExpr(,!IdExpr)*)?!InterfaceBlock;)" "Alias:(&(alias)@~&(/i)@~;=!TypeExpr;)" "TypeDef:(&(typedef)@~&(/i)@~;=!TypeExpr;)" // import "ImportNode:(&(/i)@~(/.&(/i)@@~;)*;)" "ImportSt:(&(import)@~!ImportNode(,!ImportNode)*;/;)" // error checking, exceptions // in, out, inout params // in, out code blocks "ModuleInit:(&(module)@~&(/i)@@~;/:)" "ModuleSt:(!ProtSt|!Alias|!TypeDef|!ImportSt|!SVarSt|!SMethodDef|!ClassDef|" "!EnumDef|!InterfaceDef|!BinaryOverload)" "ModuleDef:(!ModuleInit(!ModuleSt)*)" // A statement "Statement:(!IfSt|!ForSt|!ForeachSt|(!PostfixSt/;)|(!AssSt/;)|!MethodSt|" "!VarSt|!InBlock|!OutBlock|!TraceBlock|!TryBlock|!ThrowSt|!EchoSt|" "!AssertSt|!SwitchSt|!WhileSt|!DoSt|!ReturnSt|!BreakSt|!ContinueSt|" "!DeleteSt|!SIfSt|!EnumDef|!SMethodDef|!ClassDef)" // A statement block "Block:((`/{)@~(/{(!Statement)*/}|!Statement);)"; |
February 21, 2006 Re: Template regexes, version 2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Craig Black | Craig Black wrote:
> Since it seems that there may be some interest, and to prove that I'm not bull-shitting, I thought I would post the code that I mentioned earlier. It describes a parser for an unnamed object-oriented language of my design (that rips off a lot of stuff from D). It's actually less than 230 lines if you don't count the comments and whitespace. It's all one big string that gets parsed at run-time, but perhaps D could turn it into optimized source code.
>
<snip>
That looks scary :)
|
February 21, 2006 Re: Template regexes, version 2 | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ivan Senji | > That looks scary :)
Be afraid, be very afraid.
|
Copyright © 1999-2021 by the D Language Foundation