September 11, 2007
"Walter Bright" <newshound1@digitalmars.com> wrote in message news:fc45ic$1k04$1@digitalmars.com...
> Stewart Gordon wrote:
>> Maybe.  But still, nested comments are probably likely to be supported by more code editors than such an unusual feature as delimited strings.
>
> Delimited strings are standard practice in Perl.

But how many editors do a good job of syntax-highlighting Perl anyway, considering the mutual dependence between the lexer and the parser?

> C++0x is getting delimited strings.  Code editors that can't handle them are going to become rapidly obsolete.

Maybe.  But an editor being obsolete doesn't stop people from using it and even liking it for the features it does have.  Take the number of people still using TextPad, for instance.

> The more unusual feature is the token delimited strings.

Indeed.

Stewart. 

September 11, 2007
Jari-Matti Mäkelä wrote:
> Kirk McDonald wrote:
> 
>> Walter Bright wrote:
>>> Stewart Gordon wrote:
>>>
>>>> Maybe.  But still, nested comments are probably likely to be supported
>>>> by more code editors than such an unusual feature as delimited strings.
>>>
>>> Delimited strings are standard practice in Perl. C++0x is getting
>>> delimited strings. Code editors that can't handle them are going to
>>> become rapidly obsolete.
>>>
>>> The more unusual feature is the token delimited strings.
>> Which, since there's no nesting going on, are actually very easy to
>> match. The Pygments lexer matches them with the following regex:
>>
>> q"([a-zA-Z_]\w*)\n.*?\n\1"
> 
> It's great to see Pygments handles so many possible syntaxes. Unfortunately
> backreferences are not part of regular expressions. I've noticed two kinds
> of problems in tools:
> 
> a) some can't handle backreferences, but provide support for nested comments
> as a special case. So comments are no problem then, but all delimited
> strings are.
> 
> b) some lexers handles both nested comments and delimited strings, but all
> delimiters must be enumerated in the language definition. Even worse, some
> highlighters only handle delimited comments, not strings.
> 
> Maybe the new features (= one saves on average < 5 characters of typing per
> string) are more important than tool support? Maybe all tools should be
> rewritten in Python & Pygments?

D's delimited strings can (luckily) be scanned with regular languages, because the enclosing double quotes are required. else the lexical structure wouldn't even be context free and a nightmare for automatically generated lexers.
therefore you can match q"[^"]*" and check the delimiters during (context sensitive) semantic analysis.
September 11, 2007
Jascha Wetzel wrote:

> D's delimited strings can (luckily) be scanned with regular languages, because the enclosing double quotes are required. else the lexical structure wouldn't even be context free and a nightmare for automatically generated lexers.

Right, thanks.

> therefore you can match q"[^"]*" and check the delimiters during (context sensitive) semantic analysis.

But e.g. syntax highlighting needs the semantic info to change the style of the text within the delimiters. The analyser also needs to check whether the two delimiters match. Like I said above, if the tool doesn't provide enough support, you're stuck. I haven't searched for all corner cases, but wasn't the old grammar scannable and highlightable with plain regular expressions (except the nested comments of course).
September 11, 2007
Jari-Matti Mäkelä wrote:
> Jascha Wetzel wrote:
> 
>> D's delimited strings can (luckily) be scanned with regular languages,
>> because the enclosing double quotes are required. else the lexical
>> structure wouldn't even be context free and a nightmare for
>> automatically generated lexers.
> 
> Right, thanks.
> 
>> therefore you can match q"[^"]*" and check the delimiters during
>> (context sensitive) semantic analysis.
> 
> But e.g. syntax highlighting needs the semantic info to change the style of
> the text within the delimiters. The analyser also needs to check whether
> the two delimiters match. Like I said above, if the tool doesn't provide
> enough support, you're stuck. I haven't searched for all corner cases, but
> wasn't the old grammar scannable and highlightable with plain regular
> expressions (except the nested comments of course).

before, the lexical structure was context free because of nested comments and floats of the form "[0-9]+\.". the latter can be matched with regexps if they support lookaheads, though.
if you stick to the specs verbatim, q"EOS...EOS" as a whole is a string literal. assuming that all tokens/lexemes are atomic, a lexer can't "look inside" the string literal. from that point of view, the lexical structure it's still context free.
if possible, i'd add a thin wrapper around an automatically generated lexer that checks the delimiters in a postprocess.
September 11, 2007
Jascha Wetzel wrote:

> before, the lexical structure was context free because of nested comments and floats of the form "[0-9]+\.". the latter can be matched with regexps if they support lookaheads, though.

Nested comments don't necessarily need much more than a constant size counter, either.

> if you stick to the specs verbatim, q"EOS...EOS" as a whole is a string literal. assuming that all tokens/lexemes are atomic, a lexer can't "look inside" the string literal. from that point of view, the lexical structure it's still context free.

But does a simple tool have to be so complex?

> if possible, i'd add a thin wrapper around an automatically generated lexer that checks the delimiters in a postprocess.

That's a bit harder with e.g. closed source tools.


Btw, is this a bug?

auto foo = q"EOS
EOS
EOS";

doesn't compile with dmd 2.004. Or is the " always supposed to follow \n + matching identifier?
September 11, 2007
Jascha Wetzel wrote:
> Jari-Matti Mäkelä wrote:
> 
>> Kirk McDonald wrote:
>>
>>> Walter Bright wrote:
>>>
>>>> Stewart Gordon wrote:
>>>>
>>>>> Maybe.  But still, nested comments are probably likely to be supported
>>>>> by more code editors than such an unusual feature as delimited strings.
>>>>
>>>>
>>>> Delimited strings are standard practice in Perl. C++0x is getting
>>>> delimited strings. Code editors that can't handle them are going to
>>>> become rapidly obsolete.
>>>>
>>>> The more unusual feature is the token delimited strings.
>>>
>>> Which, since there's no nesting going on, are actually very easy to
>>> match. The Pygments lexer matches them with the following regex:
>>>
>>> q"([a-zA-Z_]\w*)\n.*?\n\1"
>>
>>
>> It's great to see Pygments handles so many possible syntaxes. Unfortunately
>> backreferences are not part of regular expressions. I've noticed two kinds
>> of problems in tools:
>>
>> a) some can't handle backreferences, but provide support for nested comments
>> as a special case. So comments are no problem then, but all delimited
>> strings are.
>>
>> b) some lexers handles both nested comments and delimited strings, but all
>> delimiters must be enumerated in the language definition. Even worse, some
>> highlighters only handle delimited comments, not strings.
>>
>> Maybe the new features (= one saves on average < 5 characters of typing per
>> string) are more important than tool support? Maybe all tools should be
>> rewritten in Python & Pygments?
> 
> 
> D's delimited strings can (luckily) be scanned with regular languages, because the enclosing double quotes are required. else the lexical structure wouldn't even be context free and a nightmare for automatically generated lexers.
> therefore you can match q"[^"]*" and check the delimiters during (context sensitive) semantic analysis.

Is the following a valid string?

q"/foo " bar/"

The grammar does not make it clear. The Pygments lexer treats it as though it is, under the assumption that the string continues until the first matching /" is found.

Walter also said, in another branch of the thread, that this is not valid:

q"/foo/bar/"

Since it isn't all /that/ hard to match these examples, I wonder why they are disallowed. Just to simplify the lexer that much more?

And, ah! I have found a bug in the Pygments lexer already:

auto a = q"/foo/";
auto b = q"/bar/";

Everything from the opening of the first string literal to the end of the second is highlighted. Oops. I have a fix for the lexer, dsource will be updated at some point.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org
September 11, 2007
Jari-Matti Mäkelä wrote:
> Kirk McDonald wrote:
> 
> 
>>Walter Bright wrote:
>>
>>>Stewart Gordon wrote:
>>>
>>>
>>>>Maybe.  But still, nested comments are probably likely to be supported
>>>>by more code editors than such an unusual feature as delimited strings.
>>>
>>>
>>>Delimited strings are standard practice in Perl. C++0x is getting
>>>delimited strings. Code editors that can't handle them are going to
>>>become rapidly obsolete.
>>>
>>>The more unusual feature is the token delimited strings.
>>
>>Which, since there's no nesting going on, are actually very easy to
>>match. The Pygments lexer matches them with the following regex:
>>
>>q"([a-zA-Z_]\w*)\n.*?\n\1"
> 
> 
> It's great to see Pygments handles so many possible syntaxes. Unfortunately
> backreferences are not part of regular expressions. I've noticed two kinds
> of problems in tools:
> 
> a) some can't handle backreferences, but provide support for nested comments
> as a special case. So comments are no problem then, but all delimited
> strings are.
> 
> b) some lexers handles both nested comments and delimited strings, but all
> delimiters must be enumerated in the language definition. Even worse, some
> highlighters only handle delimited comments, not strings.
> 
> Maybe the new features (= one saves on average < 5 characters of typing per
> string) are more important than tool support? Maybe all tools should be
> rewritten in Python & Pygments?

While D now requires a fairly powerful lexer to lex properly, it's still easier to lex than, for example, Ruby. Ruby's heredoc strings are more complicated than D's. Even Pygments requires some advanced callback trickery to lex them properly.

Docs on Ruby's "here document" string literals:
http://docs.huihoo.com/ruby/ruby-man-1.4/syntax.html#here_doc

Pygments's Ruby lexer:
http://trac.pocoo.org/browser/pygments/trunk/pygments/lexers/agile.py#L260

Also, the lexical phase is still entirely independent of the syntactical and semantic phases, even if it is a little more difficult than it was before.

My point is simply that any tool capable of lexing Ruby -- and there are a number of these -- is more than powerful enough to lex D. So the bar is high, but quite reachable.

I do not think it is extraordinary that a tool written in Python would take advantage of Python's regular expressions' features.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org
September 11, 2007
Jari-Matti Mäkelä wrote:
> Kirk McDonald wrote:
> 
>> Walter Bright wrote:
>>> Stewart Gordon wrote:
>>>
>>>> Maybe.  But still, nested comments are probably likely to be supported
>>>> by more code editors than such an unusual feature as delimited strings.
>>>
>>> Delimited strings are standard practice in Perl. C++0x is getting
>>> delimited strings. Code editors that can't handle them are going to
>>> become rapidly obsolete.
>>>
>>> The more unusual feature is the token delimited strings.
>> Which, since there's no nesting going on, are actually very easy to
>> match. The Pygments lexer matches them with the following regex:
>>
>> q"([a-zA-Z_]\w*)\n.*?\n\1"
> 
> It's great to see Pygments handles so many possible syntaxes. Unfortunately
> backreferences are not part of regular expressions. I've noticed two kinds
> of problems in tools:
> 
> a) some can't handle backreferences, but provide support for nested comments
> as a special case. So comments are no problem then, but all delimited
> strings are.
> 
> b) some lexers handles both nested comments and delimited strings, but all
> delimiters must be enumerated in the language definition. Even worse, some
> highlighters only handle delimited comments, not strings.
> 
> Maybe the new features (= one saves on average < 5 characters of typing per
> string) are more important than tool support? Maybe all tools should be
> rewritten in Python & Pygments?

Ok, why would syntax highlighting have to be implemented with a regexp in the first place?

-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
September 11, 2007
Jari-Matti Mäkelä wrote:
> Jascha Wetzel wrote:
> 
>> before, the lexical structure was context free because of nested
>> comments and floats of the form "[0-9]+\.". the latter can be matched
>> with regexps if they support lookaheads, though.
> 
> Nested comments don't necessarily need much more than a constant size
> counter, either.

it makes the lexer context free, though, and it therefore cannot be implemented with regular expressions only.

> Btw, is this a bug?
> 
> auto foo = q"EOS
> EOS
> EOS";
> 
> doesn't compile with dmd 2.004. Or is the " always supposed to follow \n +
> matching identifier?

yep, since a non-nesting delimiter may only appear twice.
September 11, 2007
Kirk McDonald wrote:
> Jascha Wetzel wrote:
>> therefore you can match q"[^"]*" and check the delimiters during (context sensitive) semantic analysis.
> 
> Is the following a valid string?
> 
> q"/foo " bar/"

oh, you're right of course...

> Walter also said, in another branch of the thread, that this is not valid:
> 
> q"/foo/bar/"
> 
> Since it isn't all /that/ hard to match these examples, I wonder why they are disallowed. Just to simplify the lexer that much more?

what string would that represent?
foo/bar
foobar
foo