February 06, 2011
Hello D-istos,


I am currenty implementing a kind of lexing toolkit. First time I do that. Below are design questions on the topic. Also, I would like to know whether you think such a module would be useful for th community od D programmers. And for which advantages, knowing that D directly link to C lexers like flex (I have some ideas on the question, indeed).


1. Lexeme types

Lexemes types defined by client code need to bring at least 2 pieces of information
* a code representing the type
* a regex format (string)

If I decide type codes to be strings, then we get a very nice format in source for "morphologies":
    string[2][] morphology = [
        [ "SPC" ,       `[\ \t\n]*` ],
        [ "ASSIGN" ,    `=` ],
        [ "integer" ,   `[\+\-]?[1-9]+*` ],
        ...
    ];
A side advantage beeing that writing out a morphology or a single lexeme type bring a meaningful name (instead of a clueless nominal number: http://en.wikipedia.org/wiki/Nominal_number).

But: using strings as type codes is obviously a useless overload from the strict point-of-view of functionality; codes just need to be unique, thus a plain enum of uints or even ubytes used as nominals is a correct choice.
If I choose uint codes, then lexeme types must be structs (or else tuples, but they're worse). In this case, I can then take the opportunity to add a mode field. Which would give eg:
     LexemeType[] morphology = [
        LexemeType( "SPC" ,       `[\ \t\n]*` ,         SKIP ),
        LexemeType( "ASSIGN" ,    `=` ,                 MARK ),
        LexemeType( "integer" ,   `[\+\-]?[1-9]+*` ,    DATA ),
        ...
    ];
Far more annoying to write, ain't it?

Also, a 'mode' field is nearly useless as of now:
(1) for MARKs, I cannot avoid reading the slice yet anyway (see above), thus why not store it since there is no (additional) copy
(2) for SKIP'ped lexemes, I have a practical alternative allowing the parser to skip optional and non-significant tokens (still a bit stupid to record tokens just to ignore them later, but...)


2. match actions

I do not have any match action system yet. Actually, a 'mode' field would implement kinds of very special predefined actions. Is more really needed? Typically, in my experience of parsing, useful match actions happen at a higher level, namely at parsing rather than lexing time:
* Structure the AST, eg discard MARK tokens or flatten lists.
* Handle data, eg convert numbers or drop '"' from strings.
Structural actions can only be handled by the parser, I guess, while operations on data are nicely placed in dedicated Node type constructors.
What kinds of typical actions would really be useful for client code, at lexing time, especially ones allowing parser simplification? (else as handling SKIP tokens)


External points of view warmly welcome :-)

Denis
-- 
_________________
vita es estrany
spir.wikidot.com