Jump to page: 1 2
Thread overview
struct vs class for a simple token in my d lexer
May 14, 2012
Roman D. Boiko
May 14, 2012
Tobias Pankrath
May 14, 2012
Roman D. Boiko
May 14, 2012
Tobias Pankrath
May 14, 2012
Roman D. Boiko
May 14, 2012
Jonathan M Davis
May 14, 2012
Roman D. Boiko
May 14, 2012
Dmitry Olshansky
May 14, 2012
Dmitry Olshansky
May 14, 2012
Roman D. Boiko
May 15, 2012
Era Scarecrow
May 15, 2012
Era Scarecrow
May 15, 2012
Roman D. Boiko
May 15, 2012
Roman D. Boiko
May 14, 2012
Dmitry Olshansky
May 14, 2012
Tobias Pankrath
May 14, 2012
Dmitry Olshansky
May 14, 2012
Roman D. Boiko
May 14, 2012
(Subj.) I'm in doubt which to choose for my case, but this is a generic question.

http://forum.dlang.org/post/odcrgqxoldrktdtarskf@forum.dlang.org

Cross-posting here. I would appreciate any feedback. (Whether to reply in this or that thread is up to you.) Thanks
May 14, 2012
On Monday, 14 May 2012 at 15:10:25 UTC, Roman D. Boiko wrote:
> (Subj.) I'm in doubt which to choose for my case, but this is a generic question.
>
> http://forum.dlang.org/post/odcrgqxoldrktdtarskf@forum.dlang.org
>
> Cross-posting here. I would appreciate any feedback. (Whether to reply in this or that thread is up to you.) Thanks


struct Token
{
    TokenType type;
    string content; // type may as well be template parameter
    Area loc;   // basically: ((x1, y1), (x2, y2))
}

That's how my Token look like. The 2 or 3 times every token get's copied isn't worth a heap allocation with small type like this.
May 14, 2012
Quoting your post in another thread:

On Monday, 14 May 2012 at 15:10:25 UTC, Roman D. Boiko wrote:
> Making it a class would give several benefits:

>* allow not to worry about allocating a big array of tokens. E.g., on 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes 13.5MB in an array of almost 500K tokens. It would require 4 times smaller chunk of contiguous memory if it was an array of class objects, because each would consume only 8 bytes instead of 32.

You'll still have count the space the tokens claim on the heap. So it's
basically the 500k tokens plus 500k references. I'm not sure, why you would need such a big array of tokens, though.

Aren't they produced by the lexer to be directly consumed and discarded by the parser?

> * allow subclassing, for example, for storing strongly typed literal values; this flexibility could also facilitate future extensibility (but it's difficult to predict which kind of extension may be needed)

If performance matters, why would you subclass and risk a virtual method call for something as basic as tokens?


> * there would be no need to copy data from tokens into AST, passing an object would be enough (again, copy 8 instead of 32 bytes); the same applies to passing into methods - no need to pass by ref to minimise overhead

I'm using string to store source content in tokens. Because of the way string in D works, there is no need for data copies.

> These considerations are mostly about performance. I think there is also some impact on design, but couldn't find anything significant (given that currently I see a token as merely a datastructure without associated behavior).

IMO token are value types.


May 14, 2012
On Monday, 14 May 2012 at 15:44:54 UTC, Tobias Pankrath wrote:
> On Monday, 14 May 2012 at 15:10:25 UTC, Roman D. Boiko wrote:
>> (Subj.) I'm in doubt which to choose for my case, but this is a generic question.
>>
>> http://forum.dlang.org/post/odcrgqxoldrktdtarskf@forum.dlang.org
>>
>> Cross-posting here. I would appreciate any feedback. (Whether to reply in this or that thread is up to you.) Thanks
>
>
> struct Token
> {
>     TokenType type;
>     string content; // type may as well be template parameter
>     Area loc;   // basically: ((x1, y1), (x2, y2))
> }
>
> That's how my Token look like. The 2 or 3 times every token get's copied isn't worth a heap allocation with small type like this.
Yes, except that the array of tokens for big files requires a lot of **continuous** space (13.5M for std.datetime, up to 2M for a few others in phobos & druntime). The array of references would 4 times less. Out of memory usually happens when the system cannot find a big enough chunck of memory, not when there is no available memory.

On the contrary, I think that 10-20M on rare cases is not too much.

Why I use the array I'll describe in the next post.
May 14, 2012
On Monday, 14 May 2012 at 15:53:34 UTC, Tobias Pankrath wrote:
> Quoting your post in another thread:
>
> On Monday, 14 May 2012 at 15:10:25 UTC, Roman D. Boiko wrote:
>> Making it a class would give several benefits:
>
>>* allow not to worry about allocating a big array of tokens. E.g., on 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes 13.5MB in an array of almost 500K tokens. It would require 4 times smaller chunk of contiguous memory if it was an array of class objects, because each would consume only 8 bytes instead of 32.
>
> You'll still have count the space the tokens claim on the heap. So it's
> basically the 500k tokens plus 500k references. I'm not sure, why you would need such a big array of tokens, though.
>
> Aren't they produced by the lexer to be directly consumed and discarded by the parser?
I use sorted array of tokens for efficient 0(log N) lookup by its index (the first code unit of token). (Since tokens are created in increasing order of start indices, no further sorting is needed.) Lookup is used for two purposes:
* find the token corresponding to location of cursor (e.g., for auto-complete)
* combined with 0(log M) lookup in the ordered array of first line code unit indices, calculate the Location (line number and column number) for start / end of a token on demand (they are not pre-calculated because not used frequently); this approach also makes it easy to calculate Location either taking into account special token sequences (#line 3 "ab/c.d"), or ignoring them.

>> * allow subclassing, for example, for storing strongly typed literal values; this flexibility could also facilitate future extensibility (but it's difficult to predict which kind of extension may be needed)
>
> If performance matters, why would you subclass and risk a virtual method call for something as basic as tokens?
Agree, but not sure. That's why I created this thread.

>> * there would be no need to copy data from tokens into AST, passing an object would be enough (again, copy 8 instead of 32 bytes); the same applies to passing into methods - no need to pass by ref to minimise overhead
>
> I'm using string to store source content in tokens. Because of the way string in D works, there is no need for data copies.
The same do I. But size of string field is still 16 bytes (half of my token size).

>> These considerations are mostly about performance. I think there is also some impact on design, but couldn't find anything significant (given that currently I see a token as merely a datastructure without associated behavior).
>
> IMO token are value types.
The value type might be implemented as struct or class.
May 14, 2012
On Monday, May 14, 2012 17:10:23 Roman D. Boiko wrote:
> (Subj.) I'm in doubt which to choose for my case, but this is a
> generic question.
> 
> http://forum.dlang.org/post/odcrgqxoldrktdtarskf@forum.dlang.org
> 
> Cross-posting here. I would appreciate any feedback. (Whether to reply in this or that thread is up to you.) Thanks

For the question in general, there are primarily 2 questions to consider:

1. Do you need inheritance/polymorphism?

2. Do you need a value type or a reference type?

If you need inheritance/polymorphism, you have to use classes, because structs don't have inheritance.

If you want a value type, you have to use structs, because classes are reference types.

If you want a reference type, then you can use either a class or a struct, but it's easier with classes, because classes are always reference types, whereas structs are naturally value types, so it can take more work to make them reference types depending on what its member variables are.

One other thing to consider is deterministic destruction. The only way to get deterministic destruction is to have a struct on the stack. Value types will naturally get that, but if you want a reference type with determinstic destruction, then you're probably going to have to use a ref-counted struct.

In the vast majority of cases, that should be enough to decide whether you need a class or a struct. The main case that it doesn't is the case where you want a reference type and don't necessarily want a class, and that decision can get complicated.

In the case of a token in a lexer, I would definitely expect that to be a value type, which means using a struct.

- Jonathan M Davis
May 14, 2012
On 14.05.2012 19:10, Roman D. Boiko wrote:
> (Subj.) I'm in doubt which to choose for my case, but this is a generic
> question.
>
> http://forum.dlang.org/post/odcrgqxoldrktdtarskf@forum.dlang.org
>
> Cross-posting here. I would appreciate any feedback. (Whether to reply
> in this or that thread is up to you.) Thanks
On 14.05.2012 19:10, Roman D. Boiko wrote:
Oops, sorry I meant to post to NG only :) Repost:

Clearly you are puting too much pressure on Token.

In my mind it should be real simple:

struct Token{
    uint col, line;
    uint flags;//indicated info about token, serves as both type tag and flag set;
//indicates proper type once token was cooked (like "31.415926" -> 3.145926e1) i.e. values are calculated
    union {
        string     chars;
        float     f_val;
        double     d_val;
        uint     uint_val;
        long     long_val;
        ulnog     ulong_val;
        //... anything else you may need (8 bytes are plenty)
    }//even then you may use up to 12bytes
    //total size == 24 or 20
};

Where:
    Each raw token at start has chars == slice of characters in text (or if not UTF-8 source = copy of source). Except for keywords and operators.
    Cooking is a process of calculating constant values and such (say populating symbols table will putting symbol id into token instead of leaving string slice). Do it on the fly or after the whole source - let the user choose.

Value types have nice property of being real fast, I suggest you to do at least some syntetic tests before going with ref-based stuff. Pushing 4 word is cheap, indirection never is. Classes also have hidden mutex _monitor_ field so using 'class' can be best described as suicide.

Yet you may go with freelist of tokens (leaving them as structs). It's an old and proven way.


About row/col - if this stuff is encoded into Finite Automation (and you sure want to do something like DFA) it comes out almost at no cost.
The only disadvantage is complicating DFA tables with some irregularities of "true Unicode line ending sequences". It's more of nuisance then real problem though.

P.S. if you real bend on performance, I suggest to run sparate Aho-Corassic-style thing for keywords, it would greatly simplify (=speed up) DFA structure if keywords are not hardcoded into automation.


-- 
Dmitry Olshansky
May 14, 2012
> struct Token{
>     uint col, line;
>     uint flags;//indicated info about token, serves as both type tag and flag set;
> //indicates proper type once token was cooked (like "31.415926" -> 3.145926e1) i.e. values are calculated
>     union {
>         string     chars;
>         float     f_val;
>         double     d_val;
>         uint     uint_val;
>         long     long_val;
>         ulnog     ulong_val;
>         //... anything else you may need (8 bytes are plenty)
>     }//even then you may use up to 12bytes
>     //total size == 24 or 20
> };

I wouldn't convert the string representation to a value unless needed. Even the parser does not need to know.
May 14, 2012
On Monday, 14 May 2012 at 16:41:39 UTC, Jonathan M Davis wrote:
> On Monday, May 14, 2012 17:10:23 Roman D. Boiko wrote:
>> (Subj.) I'm in doubt which to choose for my case, but this is a
>> generic question.
>> 
>> http://forum.dlang.org/post/odcrgqxoldrktdtarskf@forum.dlang.org
>> 
>> Cross-posting here. I would appreciate any feedback. (Whether to
>> reply in this or that thread is up to you.) Thanks
>
> For the question in general, there are primarily 2 questions to consider:
>
> 1. Do you need inheritance/polymorphism?
> don't have inheritance.
>
> 2. Do you need a value type or a reference type?
>
> If you need inheritance/polymorphism, you have to use classes, because structs
>
> If you want a value type, you have to use structs, because classes are
> reference types.
>
> If you want a reference type, then you can use either a class or a struct, but
> it's easier with classes, because classes are always reference types, whereas
> structs are naturally value types, so it can take more work to make them
> reference types depending on what its member variables are.
>
> One other thing to consider is deterministic destruction. The only way to get
> deterministic destruction is to have a struct on the stack. Value types will
> naturally get that, but if you want a reference type with determinstic
> destruction, then you're probably going to have to use a ref-counted struct.
>
> In the vast majority of cases, that should be enough to decide whether you
> need a class or a struct. The main case that it doesn't is the case where you
> want a reference type and don't necessarily want a class, and that decision
> can get complicated.
>
> In the case of a token in a lexer, I would definitely expect that to be a value
> type, which means using a struct.
>
> - Jonathan M Davis
Semantics is clearly of value type. But it would be nice to
minimize size of instance to be copied. (I could simply split it
into two, if some part is not used in most scenarios.)

Just curious: how to implement a struct as ref type? I could add
indirection (but then why not just use a class?), or simply pass
by ref into all methods. Are there any other options?

Thanks to everybody for your feedback, it looks like I'll stay
with struct.
May 14, 2012
On 14.05.2012 21:16, Tobias Pankrath wrote:
>
>> struct Token{
>> uint col, line;
>> uint flags;//indicated info about token, serves as both type tag and
>> flag set;
>> //indicates proper type once token was cooked (like "31.415926" ->
>> 3.145926e1) i.e. values are calculated
>> union {
>> string chars;
>> float f_val;
>> double d_val;
>> uint uint_val;
>> long long_val;
>> ulnog ulong_val;
>> //... anything else you may need (8 bytes are plenty)
>> }//even then you may use up to 12bytes
>> //total size == 24 or 20
>> };
>
> I wouldn't convert the string representation to a value unless needed.
> Even the parser does not need to know.

Thx for snipping out the gist of it :)
See last sentence, flexibility is the king:
<<<
Where:
    Each raw token at start has chars == slice of characters in text (or if not UTF-8 source = copy of source). Except for keywords and operators.
    Cooking is a process of calculating constant values and such (say populating symbols table will putting symbol id into token instead of leaving string slice). Do it on the fly or after the whole source - let the user choose.
>>>

-- 
Dmitry Olshansky
« First   ‹ Prev
1 2