September 10, 2007
Walter Bright wrote:
> Stewart Gordon wrote:
> 
>> Maybe.  But still, nested comments are probably likely to be supported by more code editors than such an unusual feature as delimited strings.
> 
> 
> Delimited strings are standard practice in Perl. C++0x is getting delimited strings. Code editors that can't handle them are going to become rapidly obsolete.
> 
> The more unusual feature is the token delimited strings.

Which, since there's no nesting going on, are actually very easy to match. The Pygments lexer matches them with the following regex:

q"([a-zA-Z_]\w*)\n.*?\n\1"

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org
September 10, 2007
Hello Walter,

Thanks for the release. Could you clarify a few things regarding the new string literals for me, please?

Example:
q"/abc/def/" // Is this "abc/def" or is this an error?

Token string examples:
q{__TIME__} // Should special tokens be evaluated? Resulting in a different string than "__TIME__"?
q{666, this is super __EOF__} // Should __EOF__ be evaluated here causing the token string to be unterminated?
q{#line 4 "path/to/file"
} // Should the special token sequence be evaluated here?

You provided the following example on the lexer page:
q{ 67QQ }            // error, 67QQ is not a valid D token
Isn't your comment wrong? I see two valid tokens there: an integer "67" and an identifier "QQ"

Regards,
Aziz
September 10, 2007
Kirk McDonald wrote:
> Walter Bright wrote:
>> The more unusual feature is the token delimited strings.
> 
> Which, since there's no nesting going on, are actually very easy to match. The Pygments lexer matches them with the following regex:
> 
> q"([a-zA-Z_]\w*)\n.*?\n\1"

I meant the:

	q{ these must be valid D tokens { and brackets nest } /* ignore this } */ };

September 10, 2007
Walter Bright wrote:
> Kirk McDonald wrote:
> 
>> Walter Bright wrote:
>>
>>> The more unusual feature is the token delimited strings.
>>
>>
>> Which, since there's no nesting going on, are actually very easy to match. The Pygments lexer matches them with the following regex:
>>
>> q"([a-zA-Z_]\w*)\n.*?\n\1"
> 
> 
> I meant the:
> 
>     q{ these must be valid D tokens { and brackets nest } /* ignore this } */ };
> 

Those are also fairly easy. The Pygments lexer only highlights the opening q{ and the closing }. The tokens inside of the string are highlighted normally.

Since this lexer is the one used by Dsource, I've thrown together a wiki page showing it off:

http://www.dsource.org/projects/dsource/wiki/DelimitedStringHighlighting

A note about this lexer: It uses a combination of regular expressions, a state machine, and a stack. When a regex matches, you usually just specify that the matching text should be highlighted as such-and-such a token. In some cases, though, you want to push a particular state onto the stack, which will then swap in a different set of regexes, until such time as this new state pops itself off the stack.

Also, it is of course written in Python, so the code below is Python code.

For instance, the rule for the "heredoc" strings, which I mentioned previously, looks like this:

        (r'q"([a-zA-Z_]\w*)\n.*?\n\1"', String),

That is, it takes the chunk of text matched by that regex, and highlights it as a string.

The entry point for token strings is the following rule:

        (r'q{', String, 'token_string'),

Or: Highlight the token "q{" as a string, then push the 'token_string' state onto the stack. (This third argument is optional, and most of the rules do not have it.) The 'token_string' state looks like this:

        'token_string': [
            (r'{', Punctuation, 'token_string_nest'),
            (r'}', String, '#pop'),
            include('root'),
        ],
        'token_string_nest': [
            (r'{', Punctuation, '#push'),
            (r'}', Punctuation, '#pop'),
            include('root'),
        ],

include('root') tells it to include the contents of the 'root' state. (Which is the state the D lexer starts out in, which has all of the regular tokens in it.) '#push' means to push the current state onto the stack again, and '#pop' means to pop off of the stack. By putting the rules for '{' and '}' before the 'root' state, we override their default behavior. (Which is just to be highlighted as punctuation.)

These two nearly-identical states are needed because we only want to highlight '}' as a string when it is the last one in the token string. When '}' is closing a nested brace, we want to highlight it as regular punctuation, and pop off of the stack.

Even if the above is gibberish to you, I still assert that it's quite straightforward, and indeed is very much like how the nesting /+ +/ comments were already highlighted. (Albeit without the include('root') call, and only one extra state.)

All of this is built on the Pygments lexer framework. All I had to do was define the big list of regexes, and the occasional extra state (as I've outlined above).

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org
September 11, 2007
Kirk McDonald wrote:
> Walter Bright wrote:
>> Kirk McDonald wrote:
>>
>>> Walter Bright wrote:
>>>
>>>> The more unusual feature is the token delimited strings.
>>>
>>>
>>> Which, since there's no nesting going on, are actually very easy to match. The Pygments lexer matches them with the following regex:
>>>
>>> q"([a-zA-Z_]\w*)\n.*?\n\1"
>>
>>
>> I meant the:
>>
>>     q{ these must be valid D tokens { and brackets nest } /* ignore this } */ };
>>
> 
> Those are also fairly easy. The Pygments lexer only highlights the opening q{ and the closing }. The tokens inside of the string are highlighted normally.
> 
> Since this lexer is the one used by Dsource, I've thrown together a wiki page showing it off:
> 
> http://www.dsource.org/projects/dsource/wiki/DelimitedStringHighlighting
> 

That's pretty danged nifty.  Any chance, however, that it could apply a slight background color to the token string?

-- Chris Nicholson-Sauls
September 11, 2007
Kirk McDonald wrote:
> Those are also fairly easy. The Pygments lexer only highlights the opening q{ and the closing }. The tokens inside of the string are highlighted normally.

Sweet!
September 11, 2007
Chris Nicholson-Sauls wrote:
> Kirk McDonald wrote:
> 
>> Walter Bright wrote:
>>
>>> Kirk McDonald wrote:
>>>
>>>> Walter Bright wrote:
>>>>
>>>>> The more unusual feature is the token delimited strings.
>>>>
>>>>
>>>>
>>>> Which, since there's no nesting going on, are actually very easy to match. The Pygments lexer matches them with the following regex:
>>>>
>>>> q"([a-zA-Z_]\w*)\n.*?\n\1"
>>>
>>>
>>>
>>> I meant the:
>>>
>>>     q{ these must be valid D tokens { and brackets nest } /* ignore this } */ };
>>>
>>
>> Those are also fairly easy. The Pygments lexer only highlights the opening q{ and the closing }. The tokens inside of the string are highlighted normally.
>>
>> Since this lexer is the one used by Dsource, I've thrown together a wiki page showing it off:
>>
>> http://www.dsource.org/projects/dsource/wiki/DelimitedStringHighlighting
>>
> 
> That's pretty danged nifty.  Any chance, however, that it could apply a slight background color to the token string?
> 
> -- Chris Nicholson-Sauls

Not really. It would require defining a new token which highlights the background for every existing token, and then updating all of the styles to provide coloring for that background... Pygments simply isn't set up to do that kind of manipulation. In fact, it would even be harder to highlight the whole thing as a string, than to highlight it the way it is now. (Unless I simply ignored the limitation that its contents consist only of valid tokens.)

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org
September 11, 2007
Aziz K. wrote:
> Could you clarify a few things regarding the new string literals for me, please?
> 
> Example:
> q"/abc/def/" // Is this "abc/def" or is this an error?

Error.

> Token string examples:
> q{__TIME__} // Should special tokens be evaluated? Resulting in a different string than "__TIME__"?

No, no.

> q{666, this is super __EOF__} // Should __EOF__ be evaluated here causing the token string to be unterminated?

Yes (__EOF__ is not a token, it's an end of file)

> q{#line 4 "path/to/file"
> } // Should the special token sequence be evaluated here?

No.

> You provided the following example on the lexer page:
> q{ 67QQ }            // error, 67QQ is not a valid D token
> Isn't your comment wrong? I see two valid tokens there: an integer "67" and an identifier "QQ"

I think you're right.
September 11, 2007
Thanks for clarifying. While implementing the methods in my lexer for scanning the new string literals I found a few other ambiguities:

q"∆abcdef∆" // Might be superfluous to ask, but are (non-alpha) Unicode character delimiters allowed?
q" abcdef " // "abcdef". Allowed?

q"
äöüß
" // "äöüß". Should leading newlines be skipped or are they allowed as delimiters?

q"EOF
abcdefEOF" // Valid? Or is \nEOF a requirement? If so, how would you write such a string excluding the last newline? Because you say in the specs that the last newline is part of the string. Maybe it shouldn't be?
q"EOF
abcdef
  EOF" // Provided the previous example is an error. Is indenting the matching delimiter allowed (with " \t\v\f")?

Walter Bright wrote:
> Aziz K. wrote:
>> q{666, this is super __EOF__} // Should __EOF__ be evaluated here causing the token string to be unterminated?
>
> Yes (__EOF__ is not a token, it's an end of file)
Are you sure you want __EOF__ to really mean end of file like '\0' and 0x1A (^Z)? Every time one encounters '_', one would have to look ahead for "_EOF__" and one would have to make sure it's not followed by a valid identifier character. I have twelve instances where I check for \0 and ^Z. It wouldn't be that hard to adapt the code but I'm sure in general it would impact the speed of a D lexer adversely.

Regards,
Aziz
September 11, 2007
Kirk McDonald wrote:

> Walter Bright wrote:
>> Stewart Gordon wrote:
>> 
>>> Maybe.  But still, nested comments are probably likely to be supported by more code editors than such an unusual feature as delimited strings.
>> 
>> 
>> Delimited strings are standard practice in Perl. C++0x is getting delimited strings. Code editors that can't handle them are going to become rapidly obsolete.
>> 
>> The more unusual feature is the token delimited strings.
> 
> Which, since there's no nesting going on, are actually very easy to match. The Pygments lexer matches them with the following regex:
> 
> q"([a-zA-Z_]\w*)\n.*?\n\1"

It's great to see Pygments handles so many possible syntaxes. Unfortunately backreferences are not part of regular expressions. I've noticed two kinds of problems in tools:

a) some can't handle backreferences, but provide support for nested comments as a special case. So comments are no problem then, but all delimited strings are.

b) some lexers handles both nested comments and delimited strings, but all delimiters must be enumerated in the language definition. Even worse, some highlighters only handle delimited comments, not strings.

Maybe the new features (= one saves on average < 5 characters of typing per string) are more important than tool support? Maybe all tools should be rewritten in Python & Pygments?