D port of dmd: Lexer, Parser, AND CodeGenerator fully operational (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Announce » D port of dmd: Lexer, Parser, AND CodeGenerator fully operational (page 5)

March 08, 2012

Re: D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Posted by Jacob Carlborg
in reply to Zach the Mystic

Jacob Carlborg

Posted in reply to Zach the Mystic

On 2012-03-08 08:21, Zach the Mystic wrote:
> On Thursday, 8 March 2012 at 04:56:07 UTC, Jonathan M Davis wrote:
>> If you took it from ddmd, then it's definitely going to have to be GPL.
>>
>> Now, there is interest in having a D parser and lexer in Phobos. I
>> don't know
>> if your version will fit the bill (e.g. it must have a range-based
>> API), but we
>> need one at some point. The original idea was to more or less directly
>> port
>> dmd's lexer and parser with some adjustments to the API as necessary
>> (primarily to make it range-based). But no one has had the time to
>> complete
>> such a project yet (I originally volunteered to do it, but I just
>> haven't had
>> the time).
>>
>> When that project was proposed, Walter agreed to let that port be
>> Boost rather
>> than GPL (since he holds the copyright and the port would be going in
>> Phobos,
>> which uses boost).
>>
>> The problem with what you have (even if the API and implementation were
>> perfect) is that it comes from ddmd, which had other contributors
>> working on
>> it. So, you would have to get permission from not only Walter but all
>> of the
>> relevant ddmd contributors. If you were able to _that_, and it could get
>> passed the review process, then what you've done could be put into
>> Phobos. But
>> that requires that you take the time and effort to take care of
>> getting the
>> appropriate permissions, making sure that the API and implementation are
>> acceptable for Phobos, and putting it through the Phobos review
>> process. It
>> would be great if you could do that though.
>>
>> - Jonathan M Davis
>
> This is great news. I was really worried that the license was etched in
> stone. I'll need help finding out who owns the code, plus legal advice
> if the process is more than just getting a simple confirmation email
> from each of the original authors.

You have my blessing for the small changes I contributed with.


-- 
/Jacob Carlborg

March 08, 2012

Re: D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Posted by Jacob Carlborg
in reply to Jonathan M Davis

Jacob Carlborg

Posted in reply to Jonathan M Davis

On 2012-03-08 09:20, Jonathan M Davis wrote:
> On Thursday, March 08, 2012 09:11:03 Jacob Carlborg wrote:
>> On 2012-03-08 05:54, Jonathan M Davis wrote:
>>> On Thursday, March 08, 2012 03:12:48 Zach the Mystic wrote:
>>>> On Thursday, 8 March 2012 at 01:43:26 UTC, Daniel Murphy wrote:
>>>>> "Zach the Mystic"<reachMINUSTHISzachgmail@dot.com>   wrote in
>>>>> message
>>>>> news:afqmbmvuvizvgfooefqj@forum.dlang.org...
>>>>>
>>>>>> I'll gladly put a license on it if the leaders of the
>>>>>> community tell me which one to use ( Artistic, libpng, Boost ).
>>>>>>
>>>>>> Zach
>>>>>
>>>>> It will need to be the same license as the frontend
>>>>> (GPL/Artistic).  It
>>>>> should be at the top of each c++ source file.
>>>>
>>>> It looks like the license is going to have to be GPL because it
>>>> says so strictly in dmd's readme.txt. Somehow that license scares
>>>> me, though. The "Free Software Foundation" seems like a very
>>>> Orwellian institution to me. I hope it doesn't scare users away.
>>>
>>> If you took it from ddmd, then it's definitely going to have to be GPL.
>>>
>>> Now, there is interest in having a D parser and lexer in Phobos. I don't
>>> know if your version will fit the bill (e.g. it must have a range-based
>>> API), but we need one at some point. The original idea was to more or
>>> less directly port dmd's lexer and parser with some adjustments to the
>>> API as necessary (primarily to make it range-based). But no one has had
>>> the time to complete such a project yet (I originally volunteered to do
>>> it, but I just haven't had the time).
>>>
>>> When that project was proposed, Walter agreed to let that port be Boost
>>> rather than GPL (since he holds the copyright and the port would be going
>>> in Phobos, which uses boost).
>>>
>>> The problem with what you have (even if the API and implementation were
>>> perfect) is that it comes from ddmd, which had other contributors working
>>> on it. So, you would have to get permission from not only Walter but all
>>> of the relevant ddmd contributors. If you were able to _that_, and it
>>> could get passed the review process, then what you've done could be put
>>> into Phobos. But that requires that you take the time and effort to take
>>> care of getting the appropriate permissions, making sure that the API and
>>> implementation are acceptable for Phobos, and putting it through the
>>> Phobos review process. It would be great if you could do that though.
>>>
>>> - Jonathan M Davis
>>
>> It would be nice if the frontend written in D (which ever it will be)
>> could be used by DMD. Then there wouldn't be any problems of being out
>> of sync.
>
> Well, having it be easier to keep in sync is one of the reasons that it was
> originally proposed to simply port the lexer in dmd's frontend to D. But
> having it _be_ the frontend of dmd would be even better. I don't know how
> interested Walter is or isn't in that, but having it be a port of the current
> frontend would likely make convincing him easier than it would be if it were
> one written from scratch.
>
> - Jonathan M davis

Exactly, that is what I'm thinking as well. Maybe it then can gradually be modified to better support an API useful for other things than just the compiler. Or building an higher level API on top of it that the compiler may not need to use.

-- 
/Jacob Carlborg

March 08, 2012

Re: D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Posted by Dmitry Olshansky
in reply to Jonathan M Davis

Dmitry Olshansky

Posted in reply to Jonathan M Davis

On 08.03.2012 22:46, Jonathan M Davis wrote:
> On Thursday, March 08, 2012 22:03:12 Dmitry Olshansky wrote:
>> On 08.03.2012 11:48, Jonathan M Davis wrote:
>>> A range is not necessarily a dynamic array, though a dynamic array is a
>>> range. The lexer is going to need to take a range of dchar (which may or
>>> may not be an array), and it's probably going to need to return a range
>>> of tokens. The parser would then take a range of tokens and then output
>>> the AST in some form or other - it probably couldn't be range, but I'm
>>> not sure. And while the lexer would need to operate on generic ranges of
>>> dchar, it would probably have to be special-cased for strings in a number
>>> of places in order to make it faster (e.g. checking the first char in a
>>> string rather than using front when it's known that the value being
>>> checked against is an ASCII character and will therefore fit in a single
>>> char - front has to decode the next character, which is less efficient).
>>
>> Simply put, the decisison on decoding should belong to lexer. Thus
>> strings should be wrapped as input range of char, wchar&  dchar
>> respectively.
>
> ??? The normal way to handle this is to simply special-case certain
> operations. e.g.
>
> static if(Unqual!(isElementEncodingType!R) == char)
> { ... }

Does isElementEncodingType aware of anything other then string/wstring?

> else
> { ... }
>
> I'm not sure that wrapping char and wchar arrays in structs that treat them as
> ranges of char or wchar is a good idea. At minimum, I'm not aware of anything
> in Phobos currently working that way (unless you did something like that in
> std.regex?).

Well it's not that I'm insisting of _wrapping_ the arrays or some such in general sense.  And yes, I had some hard experience in std.regex with UTF decoding and range design that doesn't exactly fits.

What I'm truly against is going overly generic and automatically stomping on your performance. That being said the general design of string ranges has unnessary pessimization already as it's popFront does a UTF length lookup, identical operation is  performed when decoding the first codepoint (.front).
At any rate building lexer on top of ranges in current setting means using abstraction that _hides_ decoding inside. That's a bad idea, it's loosing a fight from the start. Why? Because in this case range can't know if decoding is needed at this particular state of lexer or not, it is generality that kills this by definition.

Yeah, in general auto-magically decoding ranges are OK, as long as the stuff you do has cost much higher then decoding (~10 times would be fine) or things that you strictly can't do otherwise. Lexer doesn't fall into this criteria.

Everything either treats them as generic ranges of dchar or
> special cases them. And when you want to be most efficient with string
> processing, I would think that you'd want to treat them exactly as the arrays
> of code units that they are rather than ranges - in which case treating them
> as generic ranges of dchar in most places and then special casing certain
> sections of code which can take advantage of the fact that they're arrays of
> code units seems like the way to go.

Yeah, no arguing that. The thing is that lexer as a whole is precisely one of these special cases. It's good as long as it's fast and that requires more control then "a generic input range of dchar".

Now, speaking outside of this specific problem.
Basically I would propose formalizing a kind of range that current string/wstring is. And that is a VariableLengthEncoding range (VLE range), a two in one - random access codeunit range and bidirectional 'codepoint' range. I've heard of attempts on this concept before, but now with a use case at hand it could be become more important.
The problem is, I think, that current InputRange range is insufficent as it requres to calculate length of first element twice: one time in front and extra one in popFront.

-- 
Dmitry Olshansky

March 08, 2012

Re: D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Posted by Jonathan M Davis
in reply to Dmitry Olshansky

Jonathan M Davis

Posted in reply to Dmitry Olshansky

On Friday, March 09, 2012 00:54:48 Dmitry Olshansky wrote:
> On 08.03.2012 22:46, Jonathan M Davis wrote:
> > On Thursday, March 08, 2012 22:03:12 Dmitry Olshansky wrote:
> >> On 08.03.2012 11:48, Jonathan M Davis wrote:
> >>> A range is not necessarily a dynamic array, though a dynamic array is a
> >>> range. The lexer is going to need to take a range of dchar (which may or
> >>> may not be an array), and it's probably going to need to return a range
> >>> of tokens. The parser would then take a range of tokens and then output
> >>> the AST in some form or other - it probably couldn't be range, but I'm
> >>> not sure. And while the lexer would need to operate on generic ranges of
> >>> dchar, it would probably have to be special-cased for strings in a
> >>> number
> >>> of places in order to make it faster (e.g. checking the first char in a
> >>> string rather than using front when it's known that the value being
> >>> checked against is an ASCII character and will therefore fit in a single
> >>> char - front has to decode the next character, which is less efficient).
> >> 
> >> Simply put, the decisison on decoding should belong to lexer. Thus strings should be wrapped as input range of char, wchar& dchar respectively.
> > 
> > ??? The normal way to handle this is to simply special-case certain operations. e.g.
> > 
> > static if(Unqual!(isElementEncodingType!R) == char)
> > { ... }
> 
> Does isElementEncodingType aware of anything other then string/wstring?

No. It uses isNarrowString. So, what you'd end up doing is having the lexer accept a generic range of dchar and then have specializations where appropriate for narrow strings. Nothing in Phobos' string handling really supports the idea of dealing with generic char or wchar ranges. It all processes ranges of dchar and specializes on narrow strings where appropriate.

But is there really a use case for a generic range of char or wchar? I don't know. In general, I really don't think that there is. When dealing with ranges of characters, they're essentially always either strings or strings which have been wrapped in other ranges (generally by calling some other range-based function such as take or filter). And those range-based functions pretty much inevitably need to treat the strings as ranges of dchar to do what they do (potentially with specific optimizations for strings). They aren't designed to operate on ranges of char or wchar, and the only reason to make them do so would be if there were a use case where you needed a range of char or wchar which was not an array. But they're all either arrays or wrapped arrays. So, no such use case currently exists with Phobos, and I really question the usefulness of trying to optimize on a generic range of char or wchar - especially when many of the optimizations for arrays involve random access ranges, and if you have a random access range of char or wchar, I _really_ question that it would ever be anything other than an array.

So, I'd advise against trying to operate on ranges of char or wchar and just stick to operating on ranges of dchar with optimizations for narrow strings where appropriate.

> Now, speaking outside of this specific problem.
> Basically I would propose formalizing a kind of range that current
> string/wstring is. And that is a VariableLengthEncoding range (VLE
> range), a two in one - random access codeunit range and bidirectional
> 'codepoint' range. I've heard of attempts on this concept before, but
> now with a use case at hand it could be become more important.

There has been some talk of VLE ranges but really only with regards to the fact that strings are a case of that. Nothing has been done to generalize it. It may be something that should be looked into, but until we've formalized that, I don't think that something like the lexer should be built around the idea. It would be safer to stick with what we've been doing - operating on ranges of dchar and special-casing for narrow strings where appropriate. If need be, it should be possible to adjust it to use VLEs later.

> The problem is, I think, that current InputRange range is insufficent as it requres to calculate length of first element twice: one time in front and extra one in popFront.

Perhaps an optional function should be added to input ranges which both returns and front and pops it. But VLE ranges such as narrow strings are probably the only ones which would really care, and we can already special case for strings, so it would probably only be of real value for general VLEs. However, since VLEs would arguably be a new range type, they could just implement the new function, and anything which special-cases VLEs can take advantage of it, while they still work just fine as input ranges. We could even add a free function which both pops and returns front which uses front and popFront for most ranges and the new function for VLEs, so you can code with that free function in cases where you intend to both take front and pop it off without caring about what type of range you're dealing with and without forcing ranges in general to implement an extra function.

There's definitely more work and designing to be done here, but there are definitely possibilities.

Still, in the interim, I'd advise simply special casing on narrow strings in the lexer rather than trying to use VLEs initially.

- Jonathan M Davis

March 09, 2012

Re: D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Posted by Zach the Mystic
in reply to Jacob Carlborg

Zach the Mystic

Posted in reply to Jacob Carlborg

On Thursday, 8 March 2012 at 19:36:32 UTC, Jacob Carlborg wrote:
>> This is great news. I was really worried that the license was etched in
>> stone. I'll need help finding out who owns the code, plus legal advice
>> if the process is more than just getting a simple confirmation email
>> from each of the original authors.
>
> You have my blessing for the small changes I contributed with.

Great!

March 09, 2012

Re: D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Posted by Zach the Mystic
in reply to Jonathan M Davis

Zach the Mystic

Posted in reply to Jonathan M Davis

On Thursday, 8 March 2012 at 07:49:57 UTC, Jonathan M Davis wrote:
> Regardless, you need to familiarize yourself with ranges if you want to get
> the lexer and parser ready for inclusion in Phobos.

I have to admit that I don't currently feel competent to do this work. I'm too green. But I do think that the program is already very close to what is needed. I also would assume that those experienced with dmd's C++ incarnation will find themselves on rather familiar soil if they set to work on the D version, so that might speed things up a little if they also know about ranges.

March 09, 2012

Re: D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Posted by Dmitry Olshansky
in reply to Jonathan M Davis

Dmitry Olshansky

Posted in reply to Jonathan M Davis

On 09.03.2012 1:12, Jonathan M Davis wrote:
> On Friday, March 09, 2012 00:54:48 Dmitry Olshansky wrote:
>> On 08.03.2012 22:46, Jonathan M Davis wrote:
>>> On Thursday, March 08, 2012 22:03:12 Dmitry Olshansky wrote:
>>>> On 08.03.2012 11:48, Jonathan M Davis wrote:
>>>>> A range is not necessarily a dynamic array, though a dynamic array is a
>>>>> range. The lexer is going to need to take a range of dchar (which may or
>>>>> may not be an array), and it's probably going to need to return a range
>>>>> of tokens. The parser would then take a range of tokens and then output
>>>>> the AST in some form or other - it probably couldn't be range, but I'm
>>>>> not sure. And while the lexer would need to operate on generic ranges of
>>>>> dchar, it would probably have to be special-cased for strings in a
>>>>> number
>>>>> of places in order to make it faster (e.g. checking the first char in a
>>>>> string rather than using front when it's known that the value being
>>>>> checked against is an ASCII character and will therefore fit in a single
>>>>> char - front has to decode the next character, which is less efficient).
>>>>
>>>> Simply put, the decisison on decoding should belong to lexer. Thus
>>>> strings should be wrapped as input range of char, wchar&  dchar
>>>> respectively.
>>>
>>> ??? The normal way to handle this is to simply special-case certain
>>> operations. e.g.
>>>
>>> static if(Unqual!(isElementEncodingType!R) == char)
>>> { ... }
>>
>> Does isElementEncodingType aware of anything other then string/wstring?
>
> No. It uses isNarrowString. So, what you'd end up doing is having the lexer
> accept a generic range of dchar and then have specializations where
> appropriate for narrow strings. Nothing in Phobos' string handling really
> supports the idea of dealing with generic char or wchar ranges. It all
> processes ranges of dchar and specializes on narrow strings where appropriate.
>

The concept that *only* strings need special casing is broken. I recall somebody already stomped on this: i.e. any range that returns char (a wrapper range, over say memory-mapped file) passes by all the intricate special casing that does decoding of std.algorithm and friends. So to put it simply there is no way to tap into this *decoding by default* infrastructure.

> But is there really a use case for a generic range of char or wchar? I don't
> know. In general, I really don't think that there is.

Memory mapped file wrapper is one, it's just from the top of my head. There could be plenty of them, one needs just to look. 'I don't
 know' is not a solid argument, sorry.

When dealing with ranges
> of characters, they're essentially always either strings or strings which have
> been wrapped in other ranges (generally by calling some other range-based
> function such as take or filter). And those range-based functions pretty much
> inevitably need to treat the strings as ranges of dchar to do what they do
> (potentially with specific optimizations for strings). They aren't designed to
> operate on ranges of char or wchar, and the only reason to make them do so
> would be if there were a use case where you needed a range of char or wchar
> which was not an array. But they're all either arrays or wrapped arrays. So,
> no such use case currently exists with Phobos, and I really question the
> usefulness of trying to optimize on a generic range of char or wchar -
> especially when many of the optimizations for arrays involve random access
> ranges, and if you have a random access range of char or wchar, I _really_
> question that it would ever be anything other than an array.
>

And that is problem, if you fail to see why we need to stop pretending that all char/wchar ranges are arrays, then I failed somewhere. In the same vane one would argue that there is no other random access range but array, yet there is. For instance if we explore containers there could be Tries, Dequeues  and whatnot of char/wchar with their respective ranges.

> So, I'd advise against trying to operate on ranges of char or wchar and just
> stick to operating on ranges of dchar with optimizations for narrow strings
> where appropriate.

Probably the fact that in lexer it's not 'some place for optimizations, it's the whole thing' was missed. That's why I'm looking for more or less generic yet efficient way.

>
>> Now, speaking outside of this specific problem.
>> Basically I would propose formalizing a kind of range that current
>> string/wstring is. And that is a VariableLengthEncoding range (VLE
>> range), a two in one - random access codeunit range and bidirectional
>> 'codepoint' range. I've heard of attempts on this concept before, but
>> now with a use case at hand it could be become more important.
>
> There has been some talk of VLE ranges but really only with regards to the
> fact that strings are a case of that. Nothing has been done to generalize it.
> It may be something that should be looked into, but until we've formalized
> that, I don't think that something like the lexer should be built around the
> idea. It would be safer to stick with what we've been doing - operating on
> ranges of dchar and special-casing for narrow strings where appropriate. If
> need be, it should be possible to adjust it to use VLEs later.
>
>> The problem is, I think, that current InputRange range is insufficent as
>> it requres to calculate length of first element twice: one time in front
>> and extra one in popFront.
>
> Perhaps an optional function should be added to input ranges which both
> returns and front and pops it. But VLE ranges such as narrow strings are
> probably the only ones which would really care, and we can already special
> case for strings, so it would probably only be of real value for general VLEs.

The goal is to make std.algorithm general when it comes to UTF-x ranges, VLE range seems a best suited abstraction level so far. Other things like base64 encoded stuff could be there, though it needs some thought.

> However, since VLEs would arguably be a new range type, they could just
> implement the new function, and anything which special-cases VLEs can take
> advantage of it, while they still work just fine as input ranges. We could even
> add a free function which both pops and returns front which uses front and
> popFront for most ranges and the new function for VLEs, so you can code with
> that free function in cases where you intend to both take front and pop it off
> without caring about what type of range you're dealing with and without
> forcing ranges in general to implement an extra function.
>
> There's definitely more work and designing to be done here, but there are
> definitely possibilities.
>
> Still, in the interim, I'd advise simply special casing on narrow strings in
> the lexer rather than trying to use VLEs initially.
>

Aye, and write 2 copies of lexer, yikes. I would just ignore *default* array ranges for the moment and do whatever is convenient & fast. Then convince people to provide proper abstraction.
The problem of *default* ranges is that they have to do extra work to maintain their invariants, e.g. even array should adjust it's length.
Like OP mentioned popFront is --length, ++ptr; whereas ptr++ is all required. So clearly a wrapper of array that doesn't need to adjust length could be faster. A-la:
struct Wrapped(E){
private://not visible
	E[] arr;
	size_t idx;
public:
	@property auto front(){ arr[idx]; }
...
}
Or even 2 pointer (current+end) version.

-- 
Dmitry Olshansky

March 09, 2012

Re: D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Posted by Jonathan M Davis
in reply to Dmitry Olshansky

Jonathan M Davis

Posted in reply to Dmitry Olshansky

On Friday, March 09, 2012 11:53:51 Dmitry Olshansky wrote:
> The goal is to make std.algorithm general when it comes to UTF-x ranges, VLE range seems a best suited abstraction level so far. Other things like base64 encoded stuff could be there, though it needs some thought.

My point is that it doesn't make sense to try and design the lexer around an as yet undesigned and unused VLE range. Either the lexer should just move forward with how things are currently done and then be adjusted later to use VLE ranges once they've been sorted out, or it should be postponed until VLE ranges are sorted out.

- Jonathan M Davis

March 09, 2012

Re: D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Posted by Dmitry Olshansky
in reply to Jonathan M Davis

Dmitry Olshansky

Posted in reply to Jonathan M Davis

On 09.03.2012 11:58, Jonathan M Davis wrote:
> On Friday, March 09, 2012 11:53:51 Dmitry Olshansky wrote:
>> The goal is to make std.algorithm general when it comes to UTF-x ranges,
>> VLE range seems a best suited abstraction level so far. Other things
>> like base64 encoded stuff could be there, though it needs some thought.
>
> My point is that it doesn't make sense to try and design the lexer around an
> as yet undesigned and unused VLE range.

Interface-wise, yes.

 Either the lexer should just move
> forward with how things are currently done and then be adjusted later to use
> VLE ranges once they've been sorted out, or it should be postponed until VLE
> ranges are sorted out.
>
Or it could use whatever abstraction it needs internally, providing (for starters) (w|d)string interface.

-- 
Dmitry Olshansky

March 09, 2012

Re: D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Posted by Martin Nowak
in reply to Jonathan M Davis

Martin Nowak

Posted in reply to Jonathan M Davis

> Now, there is interest in having a D parser and lexer in Phobos. I don't know
> if your version will fit the bill (e.g. it must have a range-based API), but we
> need one at some point. The original idea was to more or less directly port
> dmd's lexer and parser with some adjustments to the API as necessary
> (primarily to make it range-based). But no one has had the time to complete
> such a project yet (I originally volunteered to do it, but I just haven't had
> the time).
>
I did wrote a complete D lexer some time ago. I'd consider it a little too CTFE heavy for phobos though.
https://gist.github.com/1255439 - generic lexer generator
https://gist.github.com/1262321 - D lexer

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation