compile-time regex redux

String mixins, in order to be useful, need an ability to manipulate strings at compile time. Currently, the core operations on strings that can be done are:

1) indexed access
2) slicing
3) comparison
4) getting the length
5) concatenation

Any other functionality can be built up from these using template metaprogramming.

The problem is that parsing strings using templates generates a large number of template instantiations, is (relatively) very slow, and consumes a lot of memory (at compile time, not runtime). For example, ParseInteger would need 4 template instantiations to parse 5678, and each template instantiation would also include the rest of the input as part of the template instantiation's mangled name.

At some point, this will prove a barrier to large scale use of this feature.

Andrei suggested using compile time regular expressions to shoulder much of the burden, reducing parsing of any particular token to one instantiation.

The last time I introduced core regular expressions into D, it was soundly rejected by the community and was withdrawn, and for good reasons.

But I think we now have good reasons to revisit this, at least for compile time use only. For example:

	("aa|b" ~~ "ababb") would evaluate to "ab"

I expect one would generally only see this kind of thing inside templates, not user code.

February 07, 2007

Re: compile-time regex redux

Posted by Andrei Alexandrescu (See Website For Email)
in reply to Walter Bright

Permalink

Andrei Alexandrescu (See Website For Email)

Posted in reply to Walter Bright

Permalink

Walter Bright wrote:
> String mixins, in order to be useful, need an ability to manipulate strings at compile time. Currently, the core operations on strings that can be done are:
> 
> 1) indexed access
> 2) slicing
> 3) comparison
> 4) getting the length
> 5) concatenation
> 
> Any other functionality can be built up from these using template metaprogramming.
> 
> The problem is that parsing strings using templates generates a large number of template instantiations, is (relatively) very slow, and consumes a lot of memory (at compile time, not runtime). For example, ParseInteger would need 4 template instantiations to parse 5678, and each template instantiation would also include the rest of the input as part of the template instantiation's mangled name.
> 
> At some point, this will prove a barrier to large scale use of this feature.
> 
> Andrei suggested using compile time regular expressions to shoulder much of the burden, reducing parsing of any particular token to one instantiation.

Let's also note for future memento that storing the md5 hash of the name instead of the full name is an additional posibility.

> The last time I introduced core regular expressions into D, it was soundly rejected by the community and was withdrawn, and for good reasons.
> 
> But I think we now have good reasons to revisit this, at least for compile time use only. For example:
> 
>     ("aa|b" ~~ "ababb") would evaluate to "ab"
> 
> I expect one would generally only see this kind of thing inside templates, not user code.

The more traditional way is to mention the string first and pattern second, so:

("ababb" ~~ "aa|b") // match this guy against this pattern

And I think it returns "b" - juxtaposition has a higher priority than "|", so your pattern is "either two a's or one b". :o)

One program I highly recommend for playing with regexes is The Regex Coach: http://weitz.de/regex-coach/.


Andrei

February 07, 2007

Re: compile-time regex redux

Posted by Robby
in reply to Andrei Alexandrescu (See Website For Email)

Permalink

Robby

Posted in reply to Andrei Alexandrescu (See Website For Email)

Permalink

Andrei Alexandrescu (See Website For Email) wrote:
> Walter Bright wrote:
[snipped]
>> The last time I introduced core regular expressions into D, it was soundly rejected by the community and was withdrawn, and for good reasons.
>>
>> But I think we now have good reasons to revisit this, at least for compile time use only. For example:
>>
>>     ("aa|b" ~~ "ababb") would evaluate to "ab"
>>
>> I expect one would generally only see this kind of thing inside templates, not user code.
> 
> The more traditional way is to mention the string first and pattern second, so:
> 
> ("ababb" ~~ "aa|b") // match this guy against this pattern
> 
> And I think it returns "b" - juxtaposition has a higher priority than "|", so your pattern is "either two a's or one b". :o)
> 
> One program I highly recommend for playing with regexes is The Regex Coach: http://weitz.de/regex-coach/.
> 
> 
> Andrei

I wasn't here during the first round of regexes so bare with me. Though I can assume that with D's growing visibility the past few months, I'm probably not the only one. Having used Ruby for years I'm quite! fond of them, some general questions.

What version of regex would be the target? There's a few variations out there.

Probably a couple of green questions, what is the benefit of having a compile time regex implementation over a "really fast" implementation? Or having the expression compiled and the string allowed runtime?

Assuming ~~ will be the syntax used?

Sidebar chuckle :
http://dev.perl.org/perl6/doc/design/syn/S05.html
Um, wow. They've really turned 5's implementation on its head.. could be interesting to watch the reaction from that.. could be louder than the vb guys when vb.net was first released..

February 07, 2007

Re: compile-time regex redux

Posted by Walter Bright
in reply to Andrei Alexandrescu (See Website For Email)

Permalink

Walter Bright

Posted in reply to Andrei Alexandrescu (See Website For Email)

Permalink

Andrei Alexandrescu (See Website For Email) wrote:
> Walter Bright wrote:
>> But I think we now have good reasons to revisit this, at least for compile time use only. For example:
>>
>>     ("aa|b" ~~ "ababb") would evaluate to "ab"
>>
>> I expect one would generally only see this kind of thing inside templates, not user code.
> 
> The more traditional way is to mention the string first and pattern second, so:
> 
> ("ababb" ~~ "aa|b") // match this guy against this pattern
> 
> And I think it returns "b" - juxtaposition has a higher priority than "|", so your pattern is "either two a's or one b". :o)

My bad. Some more things to think about:

1) Returning the left match, the match, the right match?
2) Returning values of parenthesized expressions?
3) Some sort of sed-like replacement syntax?

An alternative is to have the compiler recognize std.Regexp names as being built-in.

February 07, 2007

Re: compile-time regex redux

Posted by Walter Bright
in reply to Robby

Permalink

Walter Bright

Posted in reply to Robby

Permalink

Robby wrote:
> What version of regex would be the target? There's a few variations out there.

The version would behave identically to the D runtime library std.regexp, for consistency's sake. (And that version is compatible with Javascript v3.)

> Probably a couple of green questions, what is the benefit of having a compile time regex implementation over a "really fast" implementation? Or having the expression compiled and the string allowed runtime?

The need is for compile time, not runtime, manipulation of string literals.

February 07, 2007

Re: compile-time regex redux

Posted by Bill Baxter
in reply to Walter Bright

Permalink

Bill Baxter

Posted in reply to Walter Bright

Permalink

Walter Bright wrote:
> String mixins, in order to be useful, need an ability to manipulate strings at compile time. Currently, the core operations on strings that can be done are:
> 
> 1) indexed access
> 2) slicing
> 3) comparison
> 4) getting the length
> 5) concatenation
> 
> Any other functionality can be built up from these using template metaprogramming.
> 
> The problem is that parsing strings using templates generates a large number of template instantiations, is (relatively) very slow, and consumes a lot of memory (at compile time, not runtime). For example, ParseInteger would need 4 template instantiations to parse 5678, and each template instantiation would also include the rest of the input as part of the template instantiation's mangled name.
> 
> At some point, this will prove a barrier to large scale use of this feature.
> 
> Andrei suggested using compile time regular expressions to shoulder much of the burden, reducing parsing of any particular token to one instantiation.

That would help I suppose, but at the same time regexps themselves have a tendancy to end up being 'write-only' code.  The heavy use of them in perl is I think a large part of what gives it a rep as a write-only language.   Heh heh.  I just found this regexp for matching RFC 822 email addresses:
    http://www.regular-expressions.info/email.html
(the one at the bottom of the page)


--bb

February 07, 2007

Re: compile-time regex redux

Posted by Sean Kelly
in reply to Walter Bright

Permalink

Sean Kelly

Posted in reply to Walter Bright

Permalink

Walter Bright wrote:
> 
> But I think we now have good reasons to revisit this, at least for compile time use only. For example:
> 
>     ("aa|b" ~~ "ababb") would evaluate to "ab"
> 
> I expect one would generally only see this kind of thing inside templates, not user code.

Just a quick comment--I want to think about this a bit more.  If we are given compile-time regular expressions it may be useful if we could obtain more information than this.  For example, I would probably also want to know where in the source string the match begins.

Sean

February 07, 2007

Re: compile-time regex redux

Posted by janderson
in reply to Walter Bright

Permalink

janderson

Posted in reply to Walter Bright

Permalink

Walter Bright wrote:
> String mixins, in order to be useful, need an ability to manipulate strings at compile time. Currently, the core operations on strings that can be done are:
> 
> At some point, this will prove a barrier to large scale use of this feature.

While I'm a fan of regex I'm not sure it meets the goal of scale.  I can imagine that regex expressions + templates will get unreadable (or at least very slow to read) as programs get larger.

-Joel

February 07, 2007

Re: compile-time regex redux

Posted by kenny
in reply to Walter Bright

Permalink

kenny

Posted in reply to Walter Bright

Permalink

Walter Bright wrote:
> Andrei Alexandrescu (See Website For Email) wrote:
>> Walter Bright wrote:
>>> But I think we now have good reasons to revisit this, at least for compile time use only. For example:
>>>
>>>     ("aa|b" ~~ "ababb") would evaluate to "ab"
>>>
>>> I expect one would generally only see this kind of thing inside templates, not user code.
>>
>> The more traditional way is to mention the string first and pattern second, so:
>>
>> ("ababb" ~~ "aa|b") // match this guy against this pattern
>>
>> And I think it returns "b" - juxtaposition has a higher priority than "|", so your pattern is "either two a's or one b". :o)
> 
> My bad. Some more things to think about:
> 
> 1) Returning the left match, the match, the right match?
> 2) Returning values of parenthesized expressions?
> 3) Some sort of sed-like replacement syntax?
> 
> An alternative is to have the compiler recognize std.Regexp names as being built-in.

Walter, I don't hate regex -- I just don't use it. It seems to me that to figure out regex syntax takes longer than writing quick for/while statements, and I usually forget cases in regex too...

just being able to write like I can in D with compile time variables would be so much easier for me, and it would only require one template function instead of 35 to parse a simple string... for example.

1. A while back, I needed something very quickly to remove whitespace. it took me much less time with loops than I ever could have done with a regex. I want to be able to do the same in templates, if possible. I will be trying to reproduce later this, but I think that it will require a lot of templates.
2. what about building associative arrays out of a string? I have this function from existing code. It didn't take too long to write. I want to be able to write something like this in templates to build assoc arrays dynamically.

I know I'm asking for a lot, but the way templates handle string are still kinda weird to me. Would string parsing in this sort of way be absolutely impossible with templates? I have not had good luck with it. Perhaps I missed something...


EXAMPLES BELOW



















--- whitespace removal ---
char[] t = text.dup;
char[] new_text;
uint len = new_text.length = t.length;
new_text.length = 0;

t = replace(t, "\r\n", "\n");
t = replace(t, "\r", "\n");
t = replace(t, "\t", " ");

int i = 0;
len = t.length;
while(i < len) {
	if(t[i] == '/' && t[i+1] == '/') {
		if(i == 0 || t[i-1] == ' ' || t[i-1] == '\n') {
			while(i < len) {
				if(t[i] == '\n') {
					break;
				}
				
				t[i++] = '\n';
			}
		}
	}
	
	i++;
}

for(i = 0; i < len; i++) {
	if(t[i] < 0x20) {
		if(t[i] == '\n') {
			i++;
			while(i < len && t[i] == ' ') {
				i++;
			}
			i--;
		} else {
			t[i] = ' ';
			i--;
		}
	} else if(!(t[i] == ' ' && i > 0 && t[i-1] == ' ')) {
		new_text ~= t[i];
	}
}

if(new_text[0] == ' ') {
	new_text = new_text[1 .. length-1];
}

if(new_text[length-1] == ' ') {
	new_text.length = new_text.length-1;
}


--- ASSOC ARRAY BUILDING ---

char[][char[]] parse_options(char[] text) {
	char[][char[]] options;
	text = strip(text);
	uint text_len = text.length;
	
	uint i = 0;
	while(text[i] == '{' && text[text_len-1] == '}') {
		text_len--;
		i++;
	}
	
	if(i > 0) {
		text = strip(text[i .. text_len]);
		text_len = text.length;
		i = 0;
	}
	
	for(;i < text_len; i++) {
		if(text[i] != ' ' && text[i] != ',') {
			
			// { label: "options, yeah", label2: `variable`, label3: {label1: "lala:", label2: `variable2`}}
			//   ^^^^^
			uint start = i;
			while(text[i] != ':') {
				
				if(text[i] == ' ') {
					log_warning!("found a space in your label... expecting ':' in '^'", text[start .. i]);
				}
				
				if(++i >= text_len) {
					log_error!("expected label... but not found '^'", text[start .. i]);
					goto return_options;
				}
			}
			
			char[] label = strip(text[start .. i]);
			
			// { label: "options, yeah", label2: `variable`, label3: {label1: "lala:", label2: `variable2`}}
			//         ^
			i++;
			
			while(text[i] == ' ') {
				if(++i >= text_len) {
					log_error!("label has no value '^'", text);
					goto return_options;
				}
			}
			
			uint def_start = i++;
			switch(text[def_start]) {
			case '{':
				// { label: "options, yeah", label2: `variable`, label3: {label1: "lala:", label2: `variable2`}}
				//                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				uint scopee = 1;
				while(true) {
					if(++i >= text_len) {
						log_error!("cannot find end to text string in label '^'", label);
						goto return_options;
					}
					
					if(text[i] == '{') {
						scopee++;
					} else if(text[i] == '}') {
						if(scopee == 1) {
							break;
						}
						
						scopee--;
					}
					
					// skip text
					if(text[i] == '"' || text[i] == '\'' || text[i] == '`') {
						char delim = text[i];
						i++;
						if(i >= text_len) break;
						while(text[i] != delim || (text[i] == delim && text[i-1] == '\\')) {
							if(++i >= text_len) {
								log_error!("cannot find end to text string in label '^'", label);
								goto return_options;
							}
						}
					}
				}
				
				options[label] = strip(text[def_start .. i+1]);
				assert(strip(text[def_start .. i+1])[0] == '{');
				assert(strip(text[def_start .. i+1])[length-1] == '}');
				
			break;
			case '"', '`', '\'':
				// { label: "options, yeah", label2: `variable`, label3: {label1: "lala:", label2: `variable2`}}
				//           ^^^^^^^^^^^^^            ^^^^^^^^
				char delim = text[def_start];
				char[] string = "";
				while(text[i] != delim || (text[i] == delim && text[i-1] == '\\')) {
					if(text[i] == delim && text[i-1] == '\\') {
						string[length-1] = delim;
					} else {
						string ~= text[i];
					}
					
					if(++i >= text_len) {
						log_error!("cannot find end to text string in label '^'", label);
						goto return_options;
					}
				}
				
				options[label] = string;
			break;
			default:
				// { label: "options, yeah", label2: variable, label3: {label1: "lala:", label2: `variable2`}}
				//                                   ^^^^^^^^
				while(text[i] != ' ' && text[i] != ',' && ++i < text_len) { }
				options[label] = text[def_start .. i];
			}
		}
	}
	
return_options:
	return options;
}

February 07, 2007

Re: compile-time regex redux

Posted by Miles
in reply to Walter Bright

Permalink

Miles

Posted in reply to Walter Bright

Permalink

Walter Bright wrote:
> The problem is that parsing strings using templates generates a large number of template instantiations, is (relatively) very slow, and consumes a lot of memory (at compile time, not runtime). For example, ParseInteger would need 4 template instantiations to parse 5678,

Why instead of doing perversions with templates, don't you add a proper compile-time D interpreter for the purposes of generating code? Doing loops with template recursion sucks a lot. It really looks the wrong approach for the problem.

D already has static if() and tuple-foreach(). Just adding compile-time
variables and a static for() will be great. A compile-time D interpreter
will be awesome.

Top | Forum index | About this forum

Forums