August 01, 2012
On 8/1/12, Philippe Sigaud <philippe.sigaud@gmail.com> wrote:
> I somehow thought that with UTF-8 you were limited to a part of
> Unicode, and to another, bigger part with UTF-16.
> I equated Unicode with UTF-32.
> This is what completely warped my vision. It's good to learn something
> new everyday, I guess.

I think many people viewed Unicode this way at first. But there is a metric ton of cool info out there if you want to get to know more about unicode (this may or may not be interesting reading material), e.g.:

http://www.catch22.net/tuts/introduction-unicode http://icu-project.org/docs/papers/forms_of_unicode/ http://stackoverflow.com/questions/222386/what-do-i-need-to-know-about-unicode

I used to have more of these links but lost them. There's even a gigantic book about unicode (Unicode Demystified).
August 01, 2012
On Wed, Aug 1, 2012 at 10:54 PM, Andrej Mitrovic <andrej.mitrovich@gmail.com> wrote:
> On 8/1/12, Philippe Sigaud <philippe.sigaud@gmail.com> wrote:
>> I somehow thought that with UTF-8 you were limited to a part of
>> Unicode, and to another, bigger part with UTF-16.
>> I equated Unicode with UTF-32.
>> This is what completely warped my vision. It's good to learn something
>> new everyday, I guess.
>
> I think many people viewed Unicode this way at first. But there is a metric ton of cool info out there if you want to get to know more about unicode

I will, but not yet. I've a few books on parsing and compilers to read
before that.
I just read http://www.joelonsoftware.com/articles/Unicode.html,
though, and I'm a bit disappointed that char 7 (\u007) does not make
my computer beep. I remember now having my computer beep on char 7
during the 80s when ASCII was the only thing that existed.
August 01, 2012
On 02-Aug-12 01:23, Philippe Sigaud wrote:
> On Wed, Aug 1, 2012 at 10:54 PM, Andrej Mitrovic
> <andrej.mitrovich@gmail.com> wrote:
>> On 8/1/12, Philippe Sigaud <philippe.sigaud@gmail.com> wrote:
>>> I somehow thought that with UTF-8 you were limited to a part of
>>> Unicode, and to another, bigger part with UTF-16.
>>> I equated Unicode with UTF-32.
>>> This is what completely warped my vision. It's good to learn something
>>> new everyday, I guess.
>>
>> I think many people viewed Unicode this way at first. But there is a
>> metric ton of cool info out there if you want to get to know more
>> about unicode
>
> I will, but not yet. I've a few books on parsing and compilers to read
> before that.
> I just read http://www.joelonsoftware.com/articles/Unicode.html,
> though, and I'm a bit disappointed that char 7 (\u007) does not make
> my computer beep. I remember now having my computer beep on char 7
> during the 80s when ASCII was the only thing that existed.
>
Once you have time to learn some unicode, check out this page:
http://unicode.org/cldr/utility/index.jsp

I've found these tools to be incredibly useful.

-- 
Dmitry Olshansky
August 01, 2012
On 8/1/12, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:
> Once you have time to learn some unicode, check out this page: http://unicode.org/cldr/utility/index.jsp
>
> I've found these tools to be incredibly useful.

Didn't know about that one, cool! Also might come in handy: http://people.w3.org/rishida/scripts/uniview/
August 01, 2012
On Wednesday, August 01, 2012 22:47:47 Philippe Sigaud wrote:
> I somehow thought that with UTF-8 you were limited to a part of
> Unicode, and to another, bigger part with UTF-16.
> I equated Unicode with UTF-32.
> This is what completely warped my vision. It's good to learn something
> new everyday, I guess.

I guess that that would explain why you didn't understand what I was saying. I was highly confused as to what was confusing about what I was saying, but it didn't even occur to me that you had that sort of misunderstanding. You really should get a better grip on unicode if you want to be writing code that lexes or parses it efficiently (though it sounds like you're reading up on a lot already right now).

- Jonathan M Davis
August 02, 2012
On Thu, Aug 2, 2012 at 1:29 AM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On Wednesday, August 01, 2012 22:47:47 Philippe Sigaud wrote:
>> I somehow thought that with UTF-8 you were limited to a part of
>> Unicode, and to another, bigger part with UTF-16.
>> I equated Unicode with UTF-32.
>> This is what completely warped my vision. It's good to learn something
>> new everyday, I guess.
>
> I guess that that would explain why you didn't understand what I was saying. I was highly confused as to what was confusing about what I was saying, but it didn't even occur to me that you had that sort of misunderstanding. You really should get a better grip on unicode if you want to be writing code that lexes or parses it efficiently (though it sounds like you're reading up on a lot already right now).

I knew about 1-2-4 bytes schemes and such. But, somehow, for me,
string == only-almost-ASCII characters.
Anyway, it all *clicked* into place right afterwards and your answers
are perfectly clear to me now.
August 02, 2012
On 2012-08-01 22:54, Andrej Mitrovic wrote:

> I think many people viewed Unicode this way at first. But there is a
> metric ton of cool info out there if you want to get to know more
> about unicode (this may or may not be interesting reading material),
> e.g.:
>
> http://www.catch22.net/tuts/introduction-unicode
> http://icu-project.org/docs/papers/forms_of_unicode/
> http://stackoverflow.com/questions/222386/what-do-i-need-to-know-about-unicode
>
> I used to have more of these links but lost them. There's even a
> gigantic book about unicode (Unicode Demystified).
>

This is a good read as well:

http://www.joelonsoftware.com/articles/Unicode.html

-- 
/Jacob Carlborg
August 02, 2012
On 2012-08-01 22:47, Philippe Sigaud wrote:

> I somehow thought that with UTF-8 you were limited to a part of
> Unicode, and to another, bigger part with UTF-16.
> I equated Unicode with UTF-32.
> This is what completely warped my vision. It's good to learn something
> new everyday, I guess.
>
> Thanks Jacob!
>

You're welcome :)

-- 
/Jacob Carlborg
August 02, 2012
On 2012-08-01 22:10, Jonathan M Davis wrote:

> It may very well be a good idea to templatize Token on range type. It would be
> nice not to have to templatize it, but that may be the best route to go. The
> main question is whether str is _always_ a slice (or the result of
> takeExactly) of the orignal range. I _think_ that it is, but I'd have to make
> sure of that. If it's not and can't be for whatever reason, then that poses a
> problem. If Token _does_ get templatized, then I believe that R will end up
> being the original type in the case of the various string types or a range
> which has slicing, but it'll be the result of takeExactly(range, len) for
> everything else.

To me a string type would be enough. I don't need support for ranges. How about adding a union instead?

enum StringType
{
    utf8,
    utf16,
    utf32
}

struct Token
{
    StringType stringType;

    union
    {
        string strc;
        wstring strw;
        dstring strd;
    }

    @property T str (T = string) ()
    {
        static if (is(T == string))
        {
            assert(stringType == StringType.utf8);
            return strc;
        }
        ...
    }
}

Most use cases would look like this:

string s = token.str;

> I just made str a string to begin with, since it was simple, and I was still
> working on a lot of the initial design and how I was going to go about things.
> If it makes more sense for it to be templated, then it'll be changed so that
> it's templated.

I completely understand that.

-- 
/Jacob Carlborg
August 02, 2012
"Jonathan M Davis" , dans le message (digitalmars.D:173942), a écrit :
> It may very well be a good idea to templatize Token on range type. It would be nice not to have to templatize it, but that may be the best route to go. The main question is whether str is _always_ a slice (or the result of takeExactly) of the orignal range. I _think_ that it is, but I'd have to make sure of that. If it's not and can't be for whatever reason, then that poses a problem.

It can't if it is a simple input range! Like a file read with most 'lazy' methods. Then you need either to transform the input range in a forward range using a range adapter that performs buffering, or perform your own buffering internally. You also have to decide how long the token will be valid (I believe if you want lexing to be blazing fast, you don't want to allocate for each token).

Maybe you want you lexer to work with range of strings too, like File.byLine or File.byChunk (the latter require buffering if you split in the middle of a token...). But that may wait until a nice API for files, streams, etc. is found.

> If Token _does_ get templatized, then I believe that R will end up being the original type in the case of the various string types or a range which has slicing, but it'll be the result of takeExactly(range, len) for everything else.

A range which has slicing doesn't necessarily return it's own type when opSlice is used, according to hasSlicing. I'm pretty sure parts of Phobos doesn't take that into account. However, the result of takeExactly will always be the good type, since it uses opSlice when it can, so you can just use that.

Making a generic lexer that works with any forward range of dchar and returns a range of tokens without performing decoding of litteral seems to be a good first step.

> I just made str a string to begin with, since it was simple, and I was still working on a lot of the initial design and how I was going to go about things. If it makes more sense for it to be templated, then it'll be changed so that it's templated.

string may not be where you want to start, because it is a specialization for which you need to optimize utf-8 decoding.

Also, you said in this thread that you only need to consider ansy
characters in the lexer because non-ansy characters are only used in
non-keyword identifier. That is not entirely true: EndOfLine defines 2
non-ansy characters, namely LINE SEPARATOR and PARAGRAPH SEPARATOR.
  http://dlang.org/lex.html#EndOfLine
  Maybe they should be dropped, since other non-ansy whitespace are not
supported. You may want the line count to be consistent with other
programs. I don't know what text-processing programs usualy considers an
end of line.

-- 
Christophe