August 01, 2012
On 2012-08-01 14:44, Philippe Sigaud wrote:

> Everytime I think I understand D strings, you prove me wrong. So, I
> *still* don't get how that works:
>
> say I have
>
> auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d;
>
> Then, the "..." part is lexed as a string literal. How can the string
> field in the Token magically contain UTF32 characters? Or, are they
> automatically cut in four nonsense chars each? What about comments
> containing non-ASCII chars? How can code coming after the lexer make
> sense of it?
>
> As Jacob say, many people code in English. That's right, but
>
> 1- they most probably use their own language for internal documentation
> 2- any in8n part of a code base will have non-ASCII chars
> 3- D is supposed to accept UTF-16 and UTF-32 source code.
>
> So, wouldn't it make sense to at least provide an option on the lexer
> to specifically store identifier lexemes and comments as a dstring?

I'm not quite sure how it works either. But I'm thinking like this:

The string representing what's in the source code can be either UFT-8 or the encoding of the file. I'm not sure if the lexer needs to re-encode the string if it's not in the same encoding as the file.

Then there's an other field/function that returns the processed token, i.e. for a token of the type "int" it will return an actual int. This function will return different types of string depending on the type of the string literal the token represents.

-- 
/Jacob Carlborg
August 01, 2012
On Wednesday, August 01, 2012 13:30:31 Jacob Carlborg wrote:
> On 2012-07-31 23:20, Jonathan M Davis wrote:
> > I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see). Writing it has been going surprisingly quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate.
> 
> BTW, do you have to code online somewhere?

No, because I'm still in the middle of working on it.

- Jonathan M Davis
August 01, 2012
01.08.2012 1:20, Jonathan M Davis пишет:
> On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
>> Having std.lexer in Phobos would be quite good. With a pre-compiled lexer
>> for D.
>
> I'm actually quite far along with one now - one which is specifically written
> and optimized for lexing D. I'll probably be done with it not too long after
> the 2.060 release (though we'll see). Writing it has been going surprisingly
> quickly actually, and I've already found some bugs in the spec as a result
> (some of which have been fixed, some of which I still need to create pull
> requests for). So, regardless of what happens with my lexer, at least the spec
> will be more accurate.
>
> - Jonathan M Davis
>

Good. Will wait for announce.

-- 
Денис В. Шеломовский
Denis V. Shelomovskij
August 01, 2012
On Wednesday, August 01, 2012 14:44:29 Philippe Sigaud wrote:
> Everytime I think I understand D strings, you prove me wrong. So, I *still* don't get how that works:
> 
> say I have
> 
> auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d;
> 
> Then, the "..." part is lexed as a string literal. How can the string field in the Token magically contain UTF32 characters?

It contains unicode. The lexer is lexing whatever encoding the source is in, which has _nothing_ to do with the d on the end. It could be UTF-8, or UTF-16, or UTF-32. If we supported other encodings in ranges, it could be one of those. Which of those it is is irrelevant. As far as the value of the literal goes, these two strings are identical:

"ウェブサイト"
"\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"

The encoding of the source file is irrelevant. By tacking a d on the end

"ウェブサイト"d "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"d

you're just telling the compiler that you want the value that it generates to be in UTF-32. The source code could be in any of the supported encodings, and the string could be held in any encoding until the object code is actually generated.

> So, wouldn't it make sense to at least provide an option on the lexer to specifically store identifier lexemes and comments as a dstring?

You mean make it so that Token is

struct Token(R)
{
    TokenType    type;
    R       str;
    LiteralValue value
    SourcePos    pos;
}

instead of

struct Token
{
    TokenType    type;
    string       str;
    LiteralValue value
    SourcePos    pos;
}

or do you mean something else? I may do something like that, but I would point out that if R doesn't have slicing, then that doesn't work. So, str can't always be the same type as the original range. For ranges with no slicing, it would have to be something else (probably either string or typeof(takeExactly(range))). However, making str R _does_ come at the cost of complicating code using the lexer, since instead of just using Token, you have to worry about whether it's a Token!string, Token!dstring, etc, and whether it's worth that complication is debatable. By far the most common use case is to lex string, and if str is string, and R is not, then you incur the penalty of converting R to string. So, the common use case is fast, and the uncommon use case still works but is slower, and the user of the lexer doesn't have to care what the original range type was.

It could go either way. I used string on first pass, but as I said, I could change it to R later if that makes more sense. I'm not particularly hung up on that little detail at this point, and that's probably one of the things that can be changed reasonably easily later.

- Jonathan M Davis
August 01, 2012
On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:

> "ウェブサイト"
> "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"
>
> The encoding of the source file is irrelevant.

do you mean I can do:

string field = "ウェブサイト";

?

Geez, just tested it, it works. even writeln(field) correctly output
the japanese chars and dmd doesn't choke on it.
Bang, back to state 0: I don't get how D strings work.



> You mean make it so that Token is
>
> struct Token(R)
> {
>     TokenType    type;
>     R       str;
>     LiteralValue value
>     SourcePos    pos;
> }
>
> instead of
>
> struct Token
> {
>     TokenType    type;
>     string       str;
>     LiteralValue value
>     SourcePos    pos;
> }
>
> or do you mean something else?

Right, this.
August 01, 2012
On 2012-08-01 19:50, Philippe Sigaud wrote:
> On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
>
>> "ウェブサイト"
>> "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"
>>
>> The encoding of the source file is irrelevant.
>
> do you mean I can do:
>
> string field = "ウェブサイト";
>
> ?
>
> Geez, just tested it, it works. even writeln(field) correctly output
> the japanese chars and dmd doesn't choke on it.
> Bang, back to state 0: I don't get how D strings work.

Unicode supports three encodings: UTF-8, UTF-16 and UTF-32. All these encodings can store every character in the Unicode standard. What's different is how the characters are stored and how many bytes a single character takes to store in the string. For example:

string str = "ö";

The above character will take up two bytes in the string. On the other hand, this won't work:

char c = 'ö';

The reason for that is the the above character needs two bytes to be stored but "char" can only store one byte. Therefore you need to store the character in a type where it fits, i.e. "wchar" or "dchar". Or you can use a string where you can store how many bytes you want.

Don't know if that makes it clearer.

-- 
/Jacob Carlborg
August 01, 2012
On Wednesday, August 01, 2012 19:50:10 Philippe Sigaud wrote:
> On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg@gmx.com>
wrote:
> > "ウェブサイト"
> > "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"
> > 
> > The encoding of the source file is irrelevant.
> 
> do you mean I can do:
> 
> string field = "ウェブサイト";
> 
> ?
> 
> Geez, just tested it, it works. even writeln(field) correctly output
> the japanese chars and dmd doesn't choke on it.
> Bang, back to state 0: I don't get how D strings work.

>From http://dlang.org/lex.html

D source text can be in one of the fol­low­ing for­mats:
* ASCII
* UTF-8
* UTF-16BE
* UTF-16LE
* UTF-32BE
* UTF-32LE

So, yes, you can stick unicode characters directly in D code. Though I wonder about the correctness of the spec here. It claims that if there's no BOM, then it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of them have BOM markers, but I can put unicode in a .d file just fine with vim. U should probably study up on BOMs.

In any case, the source is read in whatever encoding it's in. String literals then all become UTF-8 in the final object code unless they're marked as specifically being another type via the postfix letter or they're inferred as being another type (e.g. when you assign a string literal to a dstring). Regardless, what's in the final object code is based on the types that the type system marks strings as, not what the encoding of the source code was.

So, a lexer shouldn't care about what the encoding of the source is beyond what it takes to covert it to a format that it can deal with and potentially being written in a way which makes handling a particular encoding more efficient. The values of literals and the like are completely unaffected regardless.

- Jonathan M Davis
August 01, 2012
On 2012-08-01 20:24, Jonathan M Davis wrote:

> D source text can be in one of the fol­low­ing for­mats:
> * ASCII
> * UTF-8
> * UTF-16BE
> * UTF-16LE
> * UTF-32BE
> * UTF-32LE
>
> So, yes, you can stick unicode characters directly in D code. Though I wonder
> about the correctness of the spec here. It claims that if there's no BOM, then
> it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of
> them have BOM markers, but I can put unicode in a .d file just fine with vim. U
> should probably study up on BOMs.
>
> In any case, the source is read in whatever encoding it's in. String literals
> then all become UTF-8 in the final object code unless they're marked as
> specifically being another type via the postfix letter or they're inferred as
> being another type (e.g. when you assign a string literal to a dstring).
> Regardless, what's in the final object code is based on the types that the type
> system marks strings as, not what the encoding of the source code was.
>
> So, a lexer shouldn't care about what the encoding of the source is beyond
> what it takes to covert it to a format that it can deal with and potentially
> being written in a way which makes handling a particular encoding more
> efficient. The values of literals and the like are completely unaffected
> regardless.

But if you read a source file which is encoded using UTF-16 you would need to re-encode that to store it in the "str" filed in your Token struct?

If that's the case, wouldn't it be better to make Token a template to be able to store all Unicode encodings without re-encoding? Although I don't know how if that will complicate the rest of the lexer.

-- 
/Jacob Carlborg
August 01, 2012
On Wednesday, August 01, 2012 20:29:45 Jacob Carlborg wrote:
> But if you read a source file which is encoded using UTF-16 you would
> need to re-encode that to store it in the "str" filed in your Token struct?

Currently, yes.

> If that's the case, wouldn't it be better to make Token a template to be able to store all Unicode encodings without re-encoding? Although I don't know how if that will complicate the rest of the lexer.

It may very well be a good idea to templatize Token on range type. It would be nice not to have to templatize it, but that may be the best route to go. The main question is whether str is _always_ a slice (or the result of takeExactly) of the orignal range. I _think_ that it is, but I'd have to make sure of that. If it's not and can't be for whatever reason, then that poses a problem. If Token _does_ get templatized, then I believe that R will end up being the original type in the case of the various string types or a range which has slicing, but it'll be the result of takeExactly(range, len) for everything else.

I just made str a string to begin with, since it was simple, and I was still working on a lot of the initial design and how I was going to go about things. If it makes more sense for it to be templated, then it'll be changed so that it's templated.

- Jonathan M Davis
August 01, 2012
On Wed, Aug 1, 2012 at 8:24 PM, Jacob Carlborg <doob@me.com> wrote:

> Don't know if that makes it clearer.

It does! Particularly this:

> All these encodings can store *every* character in the Unicode standard. What's different is how the characters are stored and how many bytes a single character takes to store in the string.
(emphasis mine)

I somehow thought that with UTF-8 you were limited to a part of
Unicode, and to another, bigger part with UTF-16.
I equated Unicode with UTF-32.
This is what completely warped my vision. It's good to learn something
new everyday, I guess.

Thanks Jacob!