View mode: basic / threaded / horizontal-split · Log in · Help
August 01, 2012
Re: Let's stop parser Hell
On 8/1/12, Philippe Sigaud <philippe.sigaud@gmail.com> wrote:
> I somehow thought that with UTF-8 you were limited to a part of
> Unicode, and to another, bigger part with UTF-16.
> I equated Unicode with UTF-32.
> This is what completely warped my vision. It's good to learn something
> new everyday, I guess.

I think many people viewed Unicode this way at first. But there is a
metric ton of cool info out there if you want to get to know more
about unicode (this may or may not be interesting reading material),
e.g.:

http://www.catch22.net/tuts/introduction-unicode
http://icu-project.org/docs/papers/forms_of_unicode/
http://stackoverflow.com/questions/222386/what-do-i-need-to-know-about-unicode

I used to have more of these links but lost them. There's even a
gigantic book about unicode (Unicode Demystified).
August 01, 2012
Re: Let's stop parser Hell
On Wed, Aug 1, 2012 at 10:54 PM, Andrej Mitrovic
<andrej.mitrovich@gmail.com> wrote:
> On 8/1/12, Philippe Sigaud <philippe.sigaud@gmail.com> wrote:
>> I somehow thought that with UTF-8 you were limited to a part of
>> Unicode, and to another, bigger part with UTF-16.
>> I equated Unicode with UTF-32.
>> This is what completely warped my vision. It's good to learn something
>> new everyday, I guess.
>
> I think many people viewed Unicode this way at first. But there is a
> metric ton of cool info out there if you want to get to know more
> about unicode

I will, but not yet. I've a few books on parsing and compilers to read
before that.
I just read http://www.joelonsoftware.com/articles/Unicode.html,
though, and I'm a bit disappointed that char 7 (\u007) does not make
my computer beep. I remember now having my computer beep on char 7
during the 80s when ASCII was the only thing that existed.
August 01, 2012
Re: Let's stop parser Hell
On 02-Aug-12 01:23, Philippe Sigaud wrote:
> On Wed, Aug 1, 2012 at 10:54 PM, Andrej Mitrovic
> <andrej.mitrovich@gmail.com> wrote:
>> On 8/1/12, Philippe Sigaud <philippe.sigaud@gmail.com> wrote:
>>> I somehow thought that with UTF-8 you were limited to a part of
>>> Unicode, and to another, bigger part with UTF-16.
>>> I equated Unicode with UTF-32.
>>> This is what completely warped my vision. It's good to learn something
>>> new everyday, I guess.
>>
>> I think many people viewed Unicode this way at first. But there is a
>> metric ton of cool info out there if you want to get to know more
>> about unicode
>
> I will, but not yet. I've a few books on parsing and compilers to read
> before that.
> I just read http://www.joelonsoftware.com/articles/Unicode.html,
> though, and I'm a bit disappointed that char 7 (\u007) does not make
> my computer beep. I remember now having my computer beep on char 7
> during the 80s when ASCII was the only thing that existed.
>
Once you have time to learn some unicode, check out this page:
http://unicode.org/cldr/utility/index.jsp

I've found these tools to be incredibly useful.

-- 
Dmitry Olshansky
August 01, 2012
Re: Let's stop parser Hell
On 8/1/12, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:
> Once you have time to learn some unicode, check out this page:
> http://unicode.org/cldr/utility/index.jsp
>
> I've found these tools to be incredibly useful.

Didn't know about that one, cool! Also might come in handy:
http://people.w3.org/rishida/scripts/uniview/
August 01, 2012
Re: Let's stop parser Hell
On Wednesday, August 01, 2012 22:47:47 Philippe Sigaud wrote:
> I somehow thought that with UTF-8 you were limited to a part of
> Unicode, and to another, bigger part with UTF-16.
> I equated Unicode with UTF-32.
> This is what completely warped my vision. It's good to learn something
> new everyday, I guess.

I guess that that would explain why you didn't understand what I was saying. I 
was highly confused as to what was confusing about what I was saying, but it 
didn't even occur to me that you had that sort of misunderstanding. You really 
should get a better grip on unicode if you want to be writing code that lexes 
or parses it efficiently (though it sounds like you're reading up on a lot 
already right now).

- Jonathan M Davis
August 02, 2012
Re: Let's stop parser Hell
On Thu, Aug 2, 2012 at 1:29 AM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On Wednesday, August 01, 2012 22:47:47 Philippe Sigaud wrote:
>> I somehow thought that with UTF-8 you were limited to a part of
>> Unicode, and to another, bigger part with UTF-16.
>> I equated Unicode with UTF-32.
>> This is what completely warped my vision. It's good to learn something
>> new everyday, I guess.
>
> I guess that that would explain why you didn't understand what I was saying. I
> was highly confused as to what was confusing about what I was saying, but it
> didn't even occur to me that you had that sort of misunderstanding. You really
> should get a better grip on unicode if you want to be writing code that lexes
> or parses it efficiently (though it sounds like you're reading up on a lot
> already right now).

I knew about 1-2-4 bytes schemes and such. But, somehow, for me,
string == only-almost-ASCII characters.
Anyway, it all *clicked* into place right afterwards and your answers
are perfectly clear to me now.
August 02, 2012
Re: Let's stop parser Hell
On 2012-08-01 22:54, Andrej Mitrovic wrote:

> I think many people viewed Unicode this way at first. But there is a
> metric ton of cool info out there if you want to get to know more
> about unicode (this may or may not be interesting reading material),
> e.g.:
>
> http://www.catch22.net/tuts/introduction-unicode
> http://icu-project.org/docs/papers/forms_of_unicode/
> http://stackoverflow.com/questions/222386/what-do-i-need-to-know-about-unicode
>
> I used to have more of these links but lost them. There's even a
> gigantic book about unicode (Unicode Demystified).
>

This is a good read as well:

http://www.joelonsoftware.com/articles/Unicode.html

-- 
/Jacob Carlborg
August 02, 2012
Re: Let's stop parser Hell
On 2012-08-01 22:47, Philippe Sigaud wrote:

> I somehow thought that with UTF-8 you were limited to a part of
> Unicode, and to another, bigger part with UTF-16.
> I equated Unicode with UTF-32.
> This is what completely warped my vision. It's good to learn something
> new everyday, I guess.
>
> Thanks Jacob!
>

You're welcome :)

-- 
/Jacob Carlborg
August 02, 2012
Re: Let's stop parser Hell
On 2012-08-01 22:10, Jonathan M Davis wrote:

> It may very well be a good idea to templatize Token on range type. It would be
> nice not to have to templatize it, but that may be the best route to go. The
> main question is whether str is _always_ a slice (or the result of
> takeExactly) of the orignal range. I _think_ that it is, but I'd have to make
> sure of that. If it's not and can't be for whatever reason, then that poses a
> problem. If Token _does_ get templatized, then I believe that R will end up
> being the original type in the case of the various string types or a range
> which has slicing, but it'll be the result of takeExactly(range, len) for
> everything else.

To me a string type would be enough. I don't need support for ranges. 
How about adding a union instead?

enum StringType
{
    utf8,
    utf16,
    utf32
}

struct Token
{
    StringType stringType;

    union
    {
        string strc;
        wstring strw;
        dstring strd;
    }

    @property T str (T = string) ()
    {
        static if (is(T == string))
        {
            assert(stringType == StringType.utf8);
            return strc;
        }
        ...
    }
}

Most use cases would look like this:

string s = token.str;

> I just made str a string to begin with, since it was simple, and I was still
> working on a lot of the initial design and how I was going to go about things.
> If it makes more sense for it to be templated, then it'll be changed so that
> it's templated.

I completely understand that.

-- 
/Jacob Carlborg
August 02, 2012
Re: Let's stop parser Hell
"Jonathan M Davis" , dans le message (digitalmars.D:173942), a écrit :
> It may very well be a good idea to templatize Token on range type. It would be 
> nice not to have to templatize it, but that may be the best route to go. The 
> main question is whether str is _always_ a slice (or the result of 
> takeExactly) of the orignal range. I _think_ that it is, but I'd have to make 
> sure of that. If it's not and can't be for whatever reason, then that poses a 
> problem.

It can't if it is a simple input range! Like a file read with most 
'lazy' methods. Then you need either to transform the input range in a 
forward range using a range adapter that performs buffering, or perform 
your own buffering internally. You also have to decide how long the 
token will be valid (I believe if you want lexing to be blazing fast, 
you don't want to allocate for each token).

Maybe you want you lexer to work with range of strings too, like 
File.byLine or File.byChunk (the latter require buffering if you split 
in the middle of a token...). But that may wait until a nice API for 
files, streams, etc. is found.

> If Token _does_ get templatized, then I believe that R will end up 
> being the original type in the case of the various string types or a range 
> which has slicing, but it'll be the result of takeExactly(range, len) for 
> everything else.

A range which has slicing doesn't necessarily return it's own type when 
opSlice is used, according to hasSlicing. I'm pretty sure parts of 
Phobos doesn't take that into account. However, the result of 
takeExactly will always be the good type, since it uses opSlice when it 
can, so you can just use that.

Making a generic lexer that works with any forward range of dchar and 
returns a range of tokens without performing decoding of litteral seems 
to be a good first step.

> I just made str a string to begin with, since it was simple, and I was still 
> working on a lot of the initial design and how I was going to go about things. 
> If it makes more sense for it to be templated, then it'll be changed so that
> it's templated.

string may not be where you want to start, because it is a 
specialization for which you need to optimize utf-8 decoding.

Also, you said in this thread that you only need to consider ansy 
characters in the lexer because non-ansy characters are only used in 
non-keyword identifier. That is not entirely true: EndOfLine defines 2 
non-ansy characters, namely LINE SEPARATOR and PARAGRAPH SEPARATOR. 
 http://dlang.org/lex.html#EndOfLine
 Maybe they should be dropped, since other non-ansy whitespace are not 
supported. You may want the line count to be consistent with other 
programs. I don't know what text-processing programs usualy considers an 
end of line.

-- 
Christophe
20 21 22 23 24 25 26
Top | Discussion index | About this forum | D home