Lexer in D - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Lexer in D

Thread overview

Lexer in D
Mar 02, 2013 Namespace
Mar 02, 2013 Dmitry Olshansky
Mar 02, 2013 Namespace
Mar 02, 2013 David
Mar 02, 2013 Dmitry Olshansky
Mar 02, 2013 Namespace
Mar 02, 2013 Namespace
Mar 02, 2013 Namespace
Mar 02, 2013 Dmitry Olshansky
Mar 02, 2013 Namespace
Mar 02, 2013 Dmitry Olshansky
Mar 02, 2013 Dmitry Olshansky
Mar 02, 2013 Namespace
Mar 02, 2013 Dmitry Olshansky
Mar 02, 2013 Namespace
Mar 02, 2013 Namespace
Mar 02, 2013 Dmitry Olshansky
Mar 02, 2013 Zach the Mystic
Mar 02, 2013 Dmitry Olshansky
Mar 02, 2013 Namespace
Mar 02, 2013 Namespace
Mar 02, 2013 Namespace
Mar 03, 2013 Dmitry Olshansky
Mar 03, 2013 Namespace
Mar 03, 2013 Dmitry Olshansky
Mar 03, 2013 Namespace
Mar 03, 2013 Namespace
Mar 03, 2013 Dmitry Olshansky
Mar 03, 2013 Namespace
Mar 03, 2013 Dmitry Olshansky
Mar 03, 2013 Namespace
Mar 04, 2013 Namespace
Mar 04, 2013 Namespace
Mar 04, 2013 Dmitry Olshansky
Mar 04, 2013 Namespace
Mar 04, 2013 Dmitry Olshansky
Mar 04, 2013 Namespace
Mar 04, 2013 Dmitry Olshansky
Mar 11, 2013 Namespace
Mar 11, 2013 Namespace
Mar 11, 2013 FG
Mar 11, 2013 Namespace
Mar 11, 2013 David
Mar 11, 2013 Namespace
Mar 11, 2013 monarch_dodra
Mar 11, 2013 Namespace
Mar 11, 2013 monarch_dodra
Mar 11, 2013 Namespace
Mar 13, 2013 Namespace
Mar 14, 2013 Dmitry Olshansky
Mar 14, 2013 Namespace
Mar 14, 2013 Dmitry Olshansky
Mar 14, 2013 Artur Skawina
Mar 14, 2013 Dmitry Olshansky
Mar 15, 2013 Namespace
Mar 21, 2013 Namespace
Mar 23, 2013 Namespace
Mar 24, 2013 Namespace
Mar 24, 2013 Dmitry Olshansky
Mar 24, 2013 Namespace
Mar 24, 2013 Brian Schott
Mar 24, 2013 Namespace
Mar 14, 2013 Dmitry Olshansky
Mar 15, 2013 Namespace
Mar 04, 2013 Brian Schott
Mar 04, 2013 Dmitry Olshansky
Mar 03, 2013 Dmitry Olshansky
Mar 03, 2013 Jonathan M Davis
Mar 03, 2013 Dmitry Olshansky

March 02, 2013

Posted by Namespace

Namespace

For one of our courses this year we have a very cool topic: Lexer and Compiler.
In the past I have already tried some experiments in this direction with Romulus and Remus, but I've never measured my Lexing time.
That's why I wanted to start again to write a lexer (hitherto purely for the learning effect).
I looked at the code of some of the other lexers, that are also written in D, to get some inspiration from, but currently my Lexer is a little bit slow, it requires for example ~200 msecs for std.datetime.
So I wanted to ask whether there are nice people who help me to improve our lexer.

Code: https://github.com/Dgame/Puzzle/blob/master/lexer.d

Thanks in advance. :)

March 02, 2013

Posted by Dmitry Olshansky
in reply to Namespace

Dmitry Olshansky

Posted in reply to Namespace

02-Mar-2013 13:33, Namespace пишет:
> For one of our courses this year we have a very cool topic: Lexer and
> Compiler.
> In the past I have already tried some experiments in this direction with
> Romulus and Remus, but I've never measured my Lexing time.
> That's why I wanted to start again to write a lexer (hitherto purely for
> the learning effect).
> I looked at the code of some of the other lexers, that are also written
> in D, to get some inspiration from, but currently my Lexer is a little
> bit slow, it requires for example ~200 msecs for std.datetime.
> So I wanted to ask whether there are nice people who help me to improve
> our lexer.
>
> Code: https://github.com/Dgame/Puzzle/blob/master/lexer.d
>
> Thanks in advance. :)

See this one I'm currently investing my time into:

https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer

Other then this use cachegrind and compile with ldc/gdc :)

Also for dmd check the flags are "-release -O -inline  -noboundscheck"

-- 
Dmitry Olshansky

March 02, 2013

Posted by Namespace
in reply to Dmitry Olshansky

Namespace

Posted in reply to Dmitry Olshansky

I'm already using these flags. But thanks for your Link I will take a look at it. :)

March 02, 2013

Posted by David
in reply to Namespace

David

Posted in reply to Namespace

Am 02.03.2013 10:33, schrieb Namespace:
> For one of our courses this year we have a very cool topic: Lexer and
> Compiler.
> In the past I have already tried some experiments in this direction with
> Romulus and Remus, but I've never measured my Lexing time.
> That's why I wanted to start again to write a lexer (hitherto purely for
> the learning effect).
> I looked at the code of some of the other lexers, that are also written
> in D, to get some inspiration from, but currently my Lexer is a little
> bit slow, it requires for example ~200 msecs for std.datetime.
> So I wanted to ask whether there are nice people who help me to improve
> our lexer.
> 
> Code: https://github.com/Dgame/Puzzle/blob/master/lexer.d
> 
> Thanks in advance. :)

Making `types` and `keywords` an array might improve the speed a little bit:

if (id in types) -> if (types.canFind(id))
(same with keywords)

March 02, 2013

Posted by Dmitry Olshansky
in reply to David

Dmitry Olshansky

Posted in reply to David

02-Mar-2013 15:04, David пишет:
> Am 02.03.2013 10:33, schrieb Namespace:
>> For one of our courses this year we have a very cool topic: Lexer and
>> Compiler.
>> In the past I have already tried some experiments in this direction with
>> Romulus and Remus, but I've never measured my Lexing time.
>> That's why I wanted to start again to write a lexer (hitherto purely for
>> the learning effect).
>> I looked at the code of some of the other lexers, that are also written
>> in D, to get some inspiration from, but currently my Lexer is a little
>> bit slow, it requires for example ~200 msecs for std.datetime.
>> So I wanted to ask whether there are nice people who help me to improve
>> our lexer.
>>
>> Code: https://github.com/Dgame/Puzzle/blob/master/lexer.d
>>
>> Thanks in advance. :)
>
> Making `types` and `keywords` an array might improve the speed a little bit:
>
> if (id in types) -> if (types.canFind(id))
> (same with keywords)

That would be slower as canFind is linear search.
Simpler way is use first letter and length to select a bucket of keywords.

But otherwise, yes, the first advice is never use built-in AA as these take quite some time to allocate. Plus they not that fast to lookup as they used mod-prime not mod-power-of-2 to pick a bucket.


-- 
Dmitry Olshansky

March 02, 2013

Posted by Namespace
in reply to Dmitry Olshansky

Namespace

Posted in reply to Dmitry Olshansky

With canFind it takes ~300 msecs. :/
Any idea how I could improve it?

> Simpler way is use first letter and length to select a bucket of keywords.

I don't understand how you could do this without AA's and how this could be more performant than canFind.

March 02, 2013

Posted by Namespace
in reply to Namespace

Namespace

Posted in reply to Namespace

I sorted both, types and keywords, alphabetical and wrote my own 'canFind':

bool canFind(const size_t size)(ref const string[size] values, string value) pure nothrow {
	if (value[0] < 'm' || value[0] == '_') {
		foreach (string _val; values) {
			if (_val == value) return true;
		}
	} else {
		foreach_reverse (string _val; values) {
			if (_val == value) return true;
		}
	}
	
	return false;
}

That brings ~20 msecs, so I have ~280 mesecs instead of ~ 300. But it isn't perfect.

March 02, 2013

Posted by Namespace
in reply to Dmitry Olshansky

Namespace

Posted in reply to Dmitry Olshansky

> That would be slower as canFind is linear search.
> Simpler way is use first letter and length to select a bucket of keywords.

I think I get it.
Did you mean something like this?

[code]
immutable ubyte[char] typeMap;
immutable ubyte[char] keywordMap;

static this() {
	typeMap = [
		'd' : 7, 'l' : 16, 'i' : 12, 'u' : 20, 'b' : 0, 'f' : 10, 'r' : 17, 'v' : 25, 'c' : 2, 's' : 18, 'w' : 26
	];
	
	keywordMap = [
		'_' : 0, 'a' : 6, 'b' : 12, 'c' : 14, 'd' : 20, 'e' : 26, 'f' : 30, 'g' : 36, 'i' : 37, 'l' : 45,
		'm' : 46, 'n' : 49, 'o' : 52, 'p' : 54, 'r' : 60, 's' : 62, 't' : 69, 'u' : 77, 'v' : 79, 'w' : 81
	];
}

bool isKeyword(string value) pure nothrow {
	if (value[0] !in keywordMap) return false;
	
	foreach (string kword; keywords[keywordMap[value[0]] .. $]) {
		if (kword == value) return true;
		if (kword[0] != value[0]) return false;
	}
	
	return false;
}

bool isType(string value) pure nothrow {
	if (value[0] !in typeMap) return false;
	
	foreach (string type; types[typeMap[value[0]] .. $]) {
		if (type == value) return true;
		if (type[0] != value[0]) return false;
	}
	
	return false;
}
[/code]

I'm now by ~230 msecs.

March 02, 2013

Posted by Dmitry Olshansky
in reply to Namespace

Dmitry Olshansky

Posted in reply to Namespace

02-Mar-2013 18:53, Namespace пишет:
>> That would be slower as canFind is linear search.
>> Simpler way is use first letter and length to select a bucket of
>> keywords.
>
> I think I get it.
> Did you mean something like this?
>

No-no and no.

Just translate list of keywords/types to a bunch of switch-cases that will give you near optimal performance. Use CTFE to construct that string of D code.

In fact I proposed to do just 2 levels of switches: first switch by length then by first char.

Even simple if you don't want to go for CTFE code-generation is to make plain array of array. The length of array is exactly 256 i.e.

Even this should have decent speed:

string[][256] keywords = [
null,
null,
...
//at index 'i'
['int', 'if', 'invariant', 'idouble', 'ifloat' ...]
//array of strings starting with char for each
...
];

bool isKeyword(string value)
{
	string[] toSearch = keywords[value[0]];
	return toSearch.canFind(value);
}

Better yet only store [1..$] parts of keywords and search for value[1..$].

Even faster is to make the same for each possible length of keyword.

> [code]
> immutable ubyte[char] typeMap;
> immutable ubyte[char] keywordMap;
>
> static this() {
>      typeMap = [
>          'd' : 7, 'l' : 16, 'i' : 12, 'u' : 20, 'b' : 0, 'f' : 10, 'r' :
> 17, 'v' : 25, 'c' : 2, 's' : 18, 'w' : 26
>      ];
>
>      keywordMap = [
>          '_' : 0, 'a' : 6, 'b' : 12, 'c' : 14, 'd' : 20, 'e' : 26, 'f' :
> 30, 'g' : 36, 'i' : 37, 'l' : 45,
>          'm' : 46, 'n' : 49, 'o' : 52, 'p' : 54, 'r' : 60, 's' : 62, 't'
> : 69, 'u' : 77, 'v' : 79, 'w' : 81
>      ];
> }
>
> bool isKeyword(string value) pure nothrow {
>      if (value[0] !in keywordMap) return false;
>
>      foreach (string kword; keywords[keywordMap[value[0]] .. $]) {
>          if (kword == value) return true;
>          if (kword[0] != value[0]) return false;
>      }
>
>      return false;
> }
>
> bool isType(string value) pure nothrow {
>      if (value[0] !in typeMap) return false;
>
>      foreach (string type; types[typeMap[value[0]] .. $]) {
>          if (type == value) return true;
>          if (type[0] != value[0]) return false;
>      }
>
>      return false;
> }
> [/code]
>
> I'm now by ~230 msecs.


-- 
Dmitry Olshansky

March 02, 2013

Posted by Namespace
in reply to Dmitry Olshansky

Namespace

Posted in reply to Dmitry Olshansky

> No-no and no.
>
> Just translate list of keywords/types to a bunch of switch-cases that will give you near optimal performance. Use CTFE to construct that string of D code.
> In fact I proposed to do just 2 levels of switches: first switch by length then by first char.

I don't understand.
You don't mean something like:

switch (value) {
case "if": isKeyword = true; break;
...

right?

> Even simple if you don't want to go for CTFE code-generation is to make plain array of array. The length of array is exactly 256 i.e.
>
> Even this should have decent speed:
>
> string[][256] keywords = [
> null,
> null,
> ...
> //at index 'i'
> ['int', 'if', 'invariant', 'idouble', 'ifloat' ...]
> //array of strings starting with char for each
> ...
> ];
>
> bool isKeyword(string value)
> {
> 	string[] toSearch = keywords[value[0]];
> 	return toSearch.canFind(value);
> }

Ok I will try it. Thank you.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation