View mode: basic / threaded / horizontal-split · Log in · Help
August 01, 2012
Re: Let's stop parser Hell
On 2012-08-01 14:44, Philippe Sigaud wrote:

> Everytime I think I understand D strings, you prove me wrong. So, I
> *still* don't get how that works:
>
> say I have
>
> auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d;
>
> Then, the "..." part is lexed as a string literal. How can the string
> field in the Token magically contain UTF32 characters? Or, are they
> automatically cut in four nonsense chars each? What about comments
> containing non-ASCII chars? How can code coming after the lexer make
> sense of it?
>
> As Jacob say, many people code in English. That's right, but
>
> 1- they most probably use their own language for internal documentation
> 2- any in8n part of a code base will have non-ASCII chars
> 3- D is supposed to accept UTF-16 and UTF-32 source code.
>
> So, wouldn't it make sense to at least provide an option on the lexer
> to specifically store identifier lexemes and comments as a dstring?

I'm not quite sure how it works either. But I'm thinking like this:

The string representing what's in the source code can be either UFT-8 or 
the encoding of the file. I'm not sure if the lexer needs to re-encode 
the string if it's not in the same encoding as the file.

Then there's an other field/function that returns the processed token, 
i.e. for a token of the type "int" it will return an actual int. This 
function will return different types of string depending on the type of 
the string literal the token represents.

-- 
/Jacob Carlborg
August 01, 2012
Re: Let's stop parser Hell
On Wednesday, August 01, 2012 13:30:31 Jacob Carlborg wrote:
> On 2012-07-31 23:20, Jonathan M Davis wrote:
> > I'm actually quite far along with one now - one which is specifically
> > written and optimized for lexing D. I'll probably be done with it not too
> > long after the 2.060 release (though we'll see). Writing it has been
> > going surprisingly quickly actually, and I've already found some bugs in
> > the spec as a result (some of which have been fixed, some of which I
> > still need to create pull requests for). So, regardless of what happens
> > with my lexer, at least the spec will be more accurate.
> 
> BTW, do you have to code online somewhere?

No, because I'm still in the middle of working on it.

- Jonathan M Davis
August 01, 2012
Re: Let's stop parser Hell
01.08.2012 1:20, Jonathan M Davis пишет:
> On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
>> Having std.lexer in Phobos would be quite good. With a pre-compiled lexer
>> for D.
>
> I'm actually quite far along with one now - one which is specifically written
> and optimized for lexing D. I'll probably be done with it not too long after
> the 2.060 release (though we'll see). Writing it has been going surprisingly
> quickly actually, and I've already found some bugs in the spec as a result
> (some of which have been fixed, some of which I still need to create pull
> requests for). So, regardless of what happens with my lexer, at least the spec
> will be more accurate.
>
> - Jonathan M Davis
>

Good. Will wait for announce.

-- 
Денис В. Шеломовский
Denis V. Shelomovskij
August 01, 2012
Re: Let's stop parser Hell
On Wednesday, August 01, 2012 14:44:29 Philippe Sigaud wrote:
> Everytime I think I understand D strings, you prove me wrong. So, I
> *still* don't get how that works:
> 
> say I have
> 
> auto s = " - some greek or chinese chars, mathematical symbols, whatever -
> "d;
> 
> Then, the "..." part is lexed as a string literal. How can the string
> field in the Token magically contain UTF32 characters?

It contains unicode. The lexer is lexing whatever encoding the source is in, 
which has _nothing_ to do with the d on the end. It could be UTF-8, or UTF-16, 
or UTF-32. If we supported other encodings in ranges, it could be one of 
those. Which of those it is is irrelevant. As far as the value of the literal 
goes, these two strings are identical:

"ウェブサイト"
"\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"

The encoding of the source file is irrelevant. By tacking a d on the end

"ウェブサイト"d
"\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"d

you're just telling the compiler that you want the value that it generates to 
be in UTF-32. The source code could be in any of the supported encodings, and 
the string could be held in any encoding until the object code is actually 
generated.

> So, wouldn't it make sense to at least provide an option on the lexer
> to specifically store identifier lexemes and comments as a dstring?

You mean make it so that Token is 

struct Token(R)
{
   TokenType    type;
   R       str;
   LiteralValue value
   SourcePos    pos;
}

instead of

struct Token
{
   TokenType    type;
   string       str;
   LiteralValue value
   SourcePos    pos;
}

or do you mean something else? I may do something like that, but I would point 
out that if R doesn't have slicing, then that doesn't work. So, str can't 
always be the same type as the original range. For ranges with no slicing, it 
would have to be something else (probably either string or 
typeof(takeExactly(range))). However, making str R _does_ come at the cost of 
complicating code using the lexer, since instead of just using Token, you have 
to worry about whether it's a Token!string, Token!dstring, etc, and whether 
it's worth that complication is debatable. By far the most common use case is 
to lex string, and if str is string, and R is not, then you incur the penalty 
of converting R to string. So, the common use case is fast, and the uncommon 
use case still works but is slower, and the user of the lexer doesn't have to 
care what the original range type was.

It could go either way. I used string on first pass, but as I said, I could 
change it to R later if that makes more sense. I'm not particularly hung up on 
that little detail at this point, and that's probably one of the things that 
can be changed reasonably easily later.

- Jonathan M Davis
August 01, 2012
Re: Let's stop parser Hell
On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:

> "ウェブサイト"
> "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"
>
> The encoding of the source file is irrelevant.

do you mean I can do:

string field = "ウェブサイト";

?

Geez, just tested it, it works. even writeln(field) correctly output
the japanese chars and dmd doesn't choke on it.
Bang, back to state 0: I don't get how D strings work.



> You mean make it so that Token is
>
> struct Token(R)
> {
>     TokenType    type;
>     R       str;
>     LiteralValue value
>     SourcePos    pos;
> }
>
> instead of
>
> struct Token
> {
>     TokenType    type;
>     string       str;
>     LiteralValue value
>     SourcePos    pos;
> }
>
> or do you mean something else?

Right, this.
August 01, 2012
Re: Let's stop parser Hell
On 2012-08-01 19:50, Philippe Sigaud wrote:
> On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
>
>> "ウェブサイト"
>> "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"
>>
>> The encoding of the source file is irrelevant.
>
> do you mean I can do:
>
> string field = "ウェブサイト";
>
> ?
>
> Geez, just tested it, it works. even writeln(field) correctly output
> the japanese chars and dmd doesn't choke on it.
> Bang, back to state 0: I don't get how D strings work.

Unicode supports three encodings: UTF-8, UTF-16 and UTF-32. All these 
encodings can store every character in the Unicode standard. What's 
different is how the characters are stored and how many bytes a single 
character takes to store in the string. For example:

string str = "ö";

The above character will take up two bytes in the string. On the other 
hand, this won't work:

char c = 'ö';

The reason for that is the the above character needs two bytes to be 
stored but "char" can only store one byte. Therefore you need to store 
the character in a type where it fits, i.e. "wchar" or "dchar". Or you 
can use a string where you can store how many bytes you want.

Don't know if that makes it clearer.

-- 
/Jacob Carlborg
August 01, 2012
Re: Let's stop parser Hell
On Wednesday, August 01, 2012 19:50:10 Philippe Sigaud wrote:
> On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg@gmx.com> 
wrote:
> > "ウェブサイト"
> > "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"
> > 
> > The encoding of the source file is irrelevant.
> 
> do you mean I can do:
> 
> string field = "ウェブサイト";
> 
> ?
> 
> Geez, just tested it, it works. even writeln(field) correctly output
> the japanese chars and dmd doesn't choke on it.
> Bang, back to state 0: I don't get how D strings work.

>From http://dlang.org/lex.html

D source text can be in one of the fol­low­ing for­mats: 
* ASCII
* UTF-8
* UTF-16BE
* UTF-16LE
* UTF-32BE
* UTF-32LE

So, yes, you can stick unicode characters directly in D code. Though I wonder 
about the correctness of the spec here. It claims that if there's no BOM, then 
it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of 
them have BOM markers, but I can put unicode in a .d file just fine with vim. U 
should probably study up on BOMs.

In any case, the source is read in whatever encoding it's in. String literals 
then all become UTF-8 in the final object code unless they're marked as 
specifically being another type via the postfix letter or they're inferred as 
being another type (e.g. when you assign a string literal to a dstring). 
Regardless, what's in the final object code is based on the types that the type 
system marks strings as, not what the encoding of the source code was.

So, a lexer shouldn't care about what the encoding of the source is beyond 
what it takes to covert it to a format that it can deal with and potentially 
being written in a way which makes handling a particular encoding more 
efficient. The values of literals and the like are completely unaffected 
regardless.

- Jonathan M Davis
August 01, 2012
Re: Let's stop parser Hell
On 2012-08-01 20:24, Jonathan M Davis wrote:

> D source text can be in one of the fol­low­ing for­mats:
> * ASCII
> * UTF-8
> * UTF-16BE
> * UTF-16LE
> * UTF-32BE
> * UTF-32LE
>
> So, yes, you can stick unicode characters directly in D code. Though I wonder
> about the correctness of the spec here. It claims that if there's no BOM, then
> it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of
> them have BOM markers, but I can put unicode in a .d file just fine with vim. U
> should probably study up on BOMs.
>
> In any case, the source is read in whatever encoding it's in. String literals
> then all become UTF-8 in the final object code unless they're marked as
> specifically being another type via the postfix letter or they're inferred as
> being another type (e.g. when you assign a string literal to a dstring).
> Regardless, what's in the final object code is based on the types that the type
> system marks strings as, not what the encoding of the source code was.
>
> So, a lexer shouldn't care about what the encoding of the source is beyond
> what it takes to covert it to a format that it can deal with and potentially
> being written in a way which makes handling a particular encoding more
> efficient. The values of literals and the like are completely unaffected
> regardless.

But if you read a source file which is encoded using UTF-16 you would 
need to re-encode that to store it in the "str" filed in your Token struct?

If that's the case, wouldn't it be better to make Token a template to be 
able to store all Unicode encodings without re-encoding? Although I 
don't know how if that will complicate the rest of the lexer.

-- 
/Jacob Carlborg
August 01, 2012
Re: Let's stop parser Hell
On Wednesday, August 01, 2012 20:29:45 Jacob Carlborg wrote:
> But if you read a source file which is encoded using UTF-16 you would
> need to re-encode that to store it in the "str" filed in your Token struct?

Currently, yes.

> If that's the case, wouldn't it be better to make Token a template to be
> able to store all Unicode encodings without re-encoding? Although I
> don't know how if that will complicate the rest of the lexer.

It may very well be a good idea to templatize Token on range type. It would be 
nice not to have to templatize it, but that may be the best route to go. The 
main question is whether str is _always_ a slice (or the result of 
takeExactly) of the orignal range. I _think_ that it is, but I'd have to make 
sure of that. If it's not and can't be for whatever reason, then that poses a 
problem. If Token _does_ get templatized, then I believe that R will end up 
being the original type in the case of the various string types or a range 
which has slicing, but it'll be the result of takeExactly(range, len) for 
everything else.

I just made str a string to begin with, since it was simple, and I was still 
working on a lot of the initial design and how I was going to go about things. 
If it makes more sense for it to be templated, then it'll be changed so that
it's templated.

- Jonathan M Davis
August 01, 2012
Re: Let's stop parser Hell
On Wed, Aug 1, 2012 at 8:24 PM, Jacob Carlborg <doob@me.com> wrote:

> Don't know if that makes it clearer.

It does! Particularly this:

> All these encodings can store *every* character in the Unicode standard. What's
> different is how the characters are stored and how many bytes a single
> character takes to store in the string.
(emphasis mine)

I somehow thought that with UTF-8 you were limited to a part of
Unicode, and to another, bigger part with UTF-16.
I equated Unicode with UTF-32.
This is what completely warped my vision. It's good to learn something
new everyday, I guess.

Thanks Jacob!
19 20 21 22 23 24 25 26
Top | Discussion index | About this forum | D home