[dox] Fixing the lexical rule for BinaryInteger (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » [dox] Fixing the lexical rule for BinaryInteger (page 2)

August 17, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by H. S. Teoh
in reply to Andre Artus

H. S. Teoh

Posted in reply to Andre Artus

On Sat, Aug 17, 2013 at 04:02:40AM +0200, Andre Artus wrote:
> On Saturday, 17 August 2013 at 00:51:57 UTC, H. S. Teoh wrote:
> >On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:
> >>On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:
> >[...]
> >>>2. Your BinaryInteger and HexadecimalInteger only allow for
> >>>one of
> >>>the following (reduced) cases:
> >>>
> >>>0b1__ : works
> >>>0b_1_ : fails
> >>>0b__1 : fails
> >>
> >>It's my opinion that the compiler should reject all of these because
> >>I think of the underscore as a separator between digits, but I'm
> >>constantly fighting the "spec, dmd, and idiom all disagree"
> >>issue.
> >[...]
> >
> >I remember reading this part of the spec on dlang.org, and I wonder if it was worded the way it is just for simplicity, because to specify something like "_ must appear between digits" involves some complicated BNF rules, which maybe seems like overkill for a single literal.
> >
> >But sometimes it is good to be precise, if we want to enforce "proper" conventions for underscores:
> >
> ><binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits>
> >
> ><binaryDigits> ::= <binaryDigit> <binaryDigits>
> >		| <binaryDigit>
> >
> ><underscoreBinaryDigits> ::= ""
> >		| "_" <binaryDigits>
> >		| "_" <binaryDigits> <underscoreBinaryDigits>
> >
> ><binaryDigit> ::= "0"
> >		| "1"
> >
> >This BNF spec forces "_" to only appear between two binary digits, and never more than a single _ in a row.
> 
> Yup, that's the issue. Coding the actual behaviour by hand, or doing it with a regular expression, is close to trivial.
> 
> >You can also make your parser only pick up <binaryDigit> when performing semantic on binary literals, so the other stuff is ignored and only serves to enforce syntax.
> 
> Pushing it up to the parser is an option in implementation, but I don't see that making the specification easier (it's 3:40 in the morning here, so I am very likely not thinking too clearly about this).

I didn't mean to push it up to the parser. I was just using BNF to show that it's possible to specify the behaviour precisely. And also that it's rather convoluted just for something as intuitively straightforward as an integer literal. Which is a likely reason why the current specs are a bit blurry about what should/shouldn't be allowed.


> >I'd be surprised if there's any D code out there that doesn't fit this spec, to be honest.
> 
> It's not what I would call best practice, but the following is possible in the current compiler:
> 
> >	auto myBin1 = 0b0011_1101; // Sane
> >	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
> >	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1
> 
> Which means a tools built against the documented spec are going to choke on these weird cases. Personally I would prefer if the more questionable options were not allowed as they potentially defeat the goal of improving clarity. But, that's a breaking change.

I know that, but I'm saying that hardly *any* code would break if we made DMD reject things like this. I don't think anybody in their right mind would write code like that. (Unless they were competing in the IODCC... :-P)

The issue here is that when specs / DMD / TDPL don't agree, then it's not always clear which among the three are wrong. Perhaps *all* of them are wrong. Just because DMD accepts invalid code doesn't mean it should be part of the specs, for example. It could constitute a DMD bug.


> >But if you want to accept "strange" literals like 0b__1__, you could do something like:
> >
> ><binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> <underscoreBinaryDigits>
> >
> ><underscoreBinaryDigits> ::= "_"
> >		| "_" <underscoreBinaryDigits>
> >		| <binaryDigit>
> >		| <binaryDigit> <underscoreBinaryDigits>
> >		| ""
> >
> ><binaryDigit> ::= "0"
> >		| "1"
> >
> >The odd form of the rule for <binaryLiteral> is to ensure that there's at least one binary digit in the string, whereas <underscoreBinaryDigits> is just a wildcard anything-goes rule that takes any combination of 0, 1, and _, including the empty string.
> >
> 
> The rule that matches the DMD compiler is actually very easy to do in ANTLR4, i.e.
> 
> BinaryLiteral   : '0b' [_01]* [01] [_01]* ;
> 
> 
> I'm a bit too tired to fully pay attention, but it seems you are saying that "0b" (no additional numbers) should match, which I believe it should not (although I admit to not testing this). If it does then I would consider that a bug.

No, the BNF rules I wrote are equivalent to your ANTLR4 spec. Which is equivalent to the regex I posted later.


> It's not a problem implementing the rule, I am more concerned with documenting it in a clear and unambiguous way so that people building tools from it can get it right. BNF isn't always the easiest way to do so, but it's what being used.

Well, you could bug Walter about what *should* be accepted, and if he agrees to restrict it to having _ only between two digits, then you'd file a bug against DMD. Again, I seriously doubt that such a change would cause any code breakage, because writing 0b1 as 0b____1____ is just so ridiculous that any such code *should* be broken.


T

-- 
Prosperity breeds contempt, and poverty breeds consent. -- Suck.com

August 17, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by Andre Artus
in reply to H. S. Teoh

Andre Artus

Posted in reply to H. S. Teoh

>>>>> Andre Artus wrote:
>>>>> 2. Your BinaryInteger and HexadecimalInteger only allow for
>>>>> one of the following (reduced) cases:
>>>>> 
>>>>> 0b1__ : works
>>>>> 0b_1_ : fails
>>>>> 0b__1 : fails

>>>> Brian Schott wrote:
>>>> It's my opinion that the compiler should reject all of these because I think of the underscore as a separator between digits,
>>>> but I'm constantly fighting the "spec, dmd, and idiom all disagree" issue.

I agree with you, Brian, all three of these constructions go contrary to the goal of making the code clearer. I would not be too surprised if a significant number of programmers would see those as different numbers, at least until they paused to take it in.

>>> [...]
>>> H. S. Teoh wrote:
>>> 
>>> I remember reading this part of the spec on dlang.org, and I wonder if it was worded the way it is just for simplicity, because to specify something like "_ must appear between digits" involves some complicated BNF rules, which maybe seems like overkill for a single literal.

I think that you are right.

>>> H. S. Teoh wrote:
>>> 
>>> But sometimes it is good to be precise, if we want to enforce
>>> "proper" conventions for underscores:
>>> 
>>> <binaryLiteral>  ::= "0b" <binaryDigits>
>>>                     <underscoreBinaryDigits>
>>> 
>>> <binaryDigits>   ::= <binaryDigit> <binaryDigits>
>>>                    | <binaryDigit>
>>> 
>>> <underscoreBinaryDigits>
>>>                  ::= ""
>>>                    | "_" <binaryDigits>
>>>                    | "_" <binaryDigits> <underscoreBinaryDigits>
>>> 
>>> <binaryDigit>    ::= "0"
>>>                    | "1"
>>> 
>>> This BNF spec forces "_" to only appear between two binary digits, and never more than a single _ in a row.


>> Andre Artus wrote:
>> Yup, that's the issue. Coding the actual behaviour by hand, or doing it with a regular expression, is close to trivial.

>>> H. S. Teoh wrote:
>>> You can also make your parser only pick up <binaryDigit> when
>>> performing semantic on binary literals, so the other stuff is ignored and only serves to enforce syntax.

>> Andre Artus wrote:
>> Pushing it up to the parser is an option in implementation, but I don't see that making the specification easier (it's 3:40 in the morning here, so I am very likely not thinking too clearly about this).


> H. S. Teoh wrote:
> I didn't mean to push it up to the parser.

Sorry I misunderstood, I had been up for over 21 hours at the time I wrote, so it was getting a bit difficult for me to concentrate. I got the impression you were saying that the parser would be responsible for extracting the binary digits.

> H. S. Teoh wrote:
> I was just using BNF to show that it's possible to specify the behaviour precisely.
> And also that it's rather convoluted just for something as intuitively straightforward as an integer literal. Which is a likely reason why the current specs are a bit blurry about what should/shouldn't be allowed.

I don't think I've seen lexemes defined using (a variant of) BNF before, most often a form of regular expressions are used. One could cut down and clarify the page describing the lexical syntax significantly employing simple regular expressions.


>>> H. S. Teoh wrote:
>>> I'd be surprised if there's any D code out there that doesn't fit this spec, to be honest.

>> Andre Artus wrote:
>> It's not what I would call best practice, but the following is
>> possible in the current compiler:
>> 
>>> 	auto myBin1 = 0b0011_1101; // Sane
>>> 	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
>>> 	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1
>> 
>> Which means a tools built against the documented spec are going to choke on these weird cases. Personally I would prefer if the more questionable options were not allowed as they potentially defeat the goal of improving clarity. But, that's a breaking change.

> H. S. Teoh wrote:
> I know that, but I'm saying that hardly *any* code would break if we made DMD reject things like this. I don't think anybody in their right mind would write code like that. (Unless they were competing in the IODCC... :-P)

I agree that the compiler should probably break that code, I believe some breaking changes are good when they help the programmer fix potential bugs. But I am also someone who compiles with "Treat warnings as errors".

> H. S. Teoh wrote:
> The issue here is that when specs / DMD / TDPL don't agree, then it's not always clear which among the three are wrong. Perhaps *all* of them are wrong. Just because DMD accepts invalid code doesn't mean it should be part of the specs, for example. It could constitute a DMD bug.

It would be good to get some clarification on this.

>>> H. S. Teoh wrote:
>>> But if you want to accept "strange" literals like 0b__1__, you could do something like:
>>> 
>>> <binaryLiteral>          ::= "0b" <underscoreBinaryDigits>
>>>                              <binaryDigit>
>>>                              <underscoreBinaryDigits>
>>> 
>>> <underscoreBinaryDigits> ::= "_"
>>>                            | "_" <underscoreBinaryDigits>
>>>                            | <binaryDigit>
>>>                            | <binaryDigit> <underscoreBinaryDigits>
>>>                            | ""
>>> 
>>> <binaryDigit>            ::= "0"
>>>                            | "1"
>>> 
>>> The odd form of the rule for <binaryLiteral> is to ensure that
>>> there's at least one binary digit in the string, whereas
>>> <underscoreBinaryDigits> is just a wildcard anything-goes rule that takes any combination of 0, 1, and _, including the
>>> empty string.

>> Andre Artus wrote:
>> The rule that matches the DMD compiler is actually very easy to do in ANTLR4, i.e.
>> 
>> BinaryLiteral   : '0b' [_01]* [01] [_01]* ;
>> 
>> I'm a bit too tired to fully pay attention, but it seems you are saying that "0b" (no additional numbers) should match, which I believe it should not (although I admit to not testing this). If it does then I would consider that a bug.

> H. S. Teoh wrote:
> No, the BNF rules I wrote are equivalent to your ANTLR4 spec. Which is equivalent to the regex I posted later.

I should have paid better attention, as I missed the <binaryDigit> in <binaryLiteral>. To be honest I was having a hard time focusing due to lack of sleep and a pervading stench of paint fumes seeping in from the adjacent building.

>> Andre Artus wrote:
>> It's not a problem implementing the rule, I am more concerned with documenting it in a clear and unambiguous way so that people building tools from it can get it right. BNF isn't always the easiest way to do so, but it's what being used.

> H. S. Teoh wrote:
> Well, you could bug Walter about what *should* be accepted,

I'm not sure how to go about that.

> H. S. Teoh wrote:
> and if he agrees to restrict it to having _ only between two digits, then you'd file a bug against DMD.

Well if we could get a ruling on this then we could include HexadecimalInteger in the ruling as it has similar behaviour in DMD.


The current specification for DecimalInteger also allows a trailing sequence of underscores. It also does not include the sign as part of the token value.

Possible regex alternatives (note I do not include the sign, as per current spec).

(0|[1-9]([_]*[0-9])*)

or arguably better
(0|[1-9]([_]?[0-9])*)

> H. S. Teoh wrote:
> Again, I seriously doubt that such a change would cause any code breakage, because writing 0b1 as 0b____1____ is just so ridiculous that any such code *should* be broken.

Agreed.

August 17, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by H. S. Teoh
in reply to Andre Artus

H. S. Teoh

Posted in reply to Andre Artus

On Sat, Aug 17, 2013 at 11:29:03PM +0200, Andre Artus wrote: [...]
> >H. S. Teoh wrote:
> >I was just using BNF to show that it's possible to specify the
> >behaviour precisely.  And also that it's rather convoluted just for
> >something as intuitively straightforward as an integer literal. Which
> >is a likely reason why the current specs are a bit blurry about what
> >should/shouldn't be allowed.
> 
> I don't think I've seen lexemes defined using (a variant of) BNF before, most often a form of regular expressions are used. One could cut down and clarify the page describing the lexical syntax significantly employing simple regular expressions.

You're right, I think the D specs page on literals using BNF is a bit of an overkill. Maybe it should be rewritten using regexen. It would be easier to understand, for one thing.


[...]
> >H. S. Teoh wrote:
> >I know that, but I'm saying that hardly *any* code would break if
> >we made DMD reject things like this. I don't think anybody in
> >their right mind would write code like that. (Unless they were
> >competing in the IODCC... :-P)
> 
> I agree that the compiler should probably break that code, I believe some breaking changes are good when they help the programmer fix potential bugs. But I am also someone who compiles with "Treat warnings as errors".

Walter is someone who believes that compilers should only have errors, not warnings. :)


[...]
> >>Andre Artus wrote:
> >>It's not a problem implementing the rule, I am more concerned
> >>with documenting it in a clear and unambiguous way so that
> >>people building tools from it can get it right. BNF isn't always
> >>the easiest way to do so, but it's what being used.
> 
> >H. S. Teoh wrote:
> >Well, you could bug Walter about what *should* be accepted,
> 
> I'm not sure how to go about that.

Email him and ask? :)


> >H. S. Teoh wrote:
> >and if he agrees to restrict it to having _ only between two
> >digits, then you'd file a bug against DMD.
> 
> Well if we could get a ruling on this then we could include HexadecimalInteger in the ruling as it has similar behaviour in DMD.
> 
> The current specification for DecimalInteger also allows a trailing sequence of underscores. It also does not include the sign as part of the token value.

Yeah that sounds like a bug in the specs.


> Possible regex alternatives (note I do not include the sign, as per
> current spec).
> 
> (0|[1-9]([_]*[0-9])*)
> 
> or arguably better
> (0|[1-9]([_]?[0-9])*)
[...]

I think it should be:

	(0|[1-9]([0-9]*(_[0-9]+)*)?)

That is, either it's a 0, or a single digit from 1-9, or 1-9 followed by (zero or more digits 0-9 followed by zero or more (underscore followed by one or more digits 0-9)). This enforces only a single underscore between digits, and no preceding/trailing underscores. So it would exclude things like 12_____34, which is just as ridiculous as 123___, and only allow 12_34.


T

-- 
Blunt statements really don't have a point.

August 17, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by Andre Artus
in reply to H. S. Teoh

Andre Artus

Posted in reply to H. S. Teoh

> [...]
>>> H. S. Teoh wrote:
>>> I was just using BNF to show that it's possible to specify the behaviour precisely.  And also that it's rather convoluted just for something as intuitively straightforward as an integer literal. Which is a likely reason why the current specs are a bit blurry about what should/shouldn't be allowed.

>> Andre Artus wrote:
>> I don't think I've seen lexemes defined using (a variant of) BNF before, most often a form of regular expressions are used. One could cut down and clarify the page describing the lexical syntax significantly employing simple regular expressions.

> H. S. Teoh wrote:
> You're right, I think the D specs page on literals using BNF is a bit of an overkill. Maybe it should be rewritten using regexen.
> It would be easier to understand, for one thing.

I would not mind doing this, I'll see what Walter says.

It would also be quite easy to generate syntax diagrams from a reg-expr.

> [...]
>>> H. S. Teoh wrote:
>>> I know that, but I'm saying that hardly *any* code would break if
>>> we made DMD reject things like this. I don't think anybody in
>>> their right mind would write code like that. (Unless they were
>>> competing in the IODCC... :-P)
>> 
>> I agree that the compiler should probably break that code, I believe some breaking changes are good when they help the programmer fix potential bugs. But I am also someone who compiles with "Treat warnings as errors".

> H. S. Teoh wrote:
> Walter is someone who believes that compilers should only have errors, not warnings. :)

That can go both ways, but I suspect you mean that in the good way.


> [...]
>>>> Andre Artus wrote:
>>>> It's not a problem implementing the rule, I am more concerned
>>>> with documenting it in a clear and unambiguous way so that
>>>> people building tools from it can get it right. BNF isn't always the easiest way to do so, but it's what being used.
>> 
>>> H. S. Teoh wrote:
>>> Well, you could bug Walter about what *should* be accepted,
>> 
>> I'm not sure how to go about that.

> H. S. Teoh wrote:
> Email him and ask? :)

I'll try that.

>>> H. S. Teoh wrote:
>>> and if he agrees to restrict it to having _ only between two
>>> digits, then you'd file a bug against DMD.
>> 
>> Well if we could get a ruling on this then we could include
>> HexadecimalInteger in the ruling as it has similar behaviour in DMD.
>> 
>> The current specification for DecimalInteger also allows a trailing sequence of underscores. It also does not include the sign as part of the token value.

> H. S. Teoh wrote:
> Yeah that sounds like a bug in the specs.

Yes, I believe so. The same issues are under "Floating Point Literals". Should be easy to fix.

>> Possible regex alternatives (note I do not include the sign, as per current spec).
>> 
>> (0|[1-9]([_]*[0-9])*)
>> 
>> or arguably better
>> (0|[1-9]([_]?[0-9])*)
> [...]
>
> I think it should be:
>
> 	(0|[1-9]([0-9]*(_[0-9]+)*)?)
>
> That is, either it's a 0, or a single digit from 1-9, or 1-9 followed by (zero or more digits 0-9 followed by zero or more (underscore followed by one or more digits 0-9)). This enforces only a single underscore between digits, and no preceding/trailing underscores. So it would exclude things like 12_____34, which is just as ridiculous as 123___, and only allow 12_34.

I concur with your assessment.
I believe my second reg-ex is functionally equivalent to the one you propose (test results below). Although I would concede that yours may be easier to grok.


The following match my regex (assuming it's whitespace delimited)

1
1_1
1_2_3_4_5_6_7_8_9_0	
1234_45_15		
1234567_8_90		
123456789_0		
1_234567890		
12_34567890		
123_4567890
1234_567890
12345_67890
123456_7890
1234567_890
12345678_90
123456789_0
123_45_6_789012345_67890

Whereas these do not

_1
1_
_1_
1______1
-12_34
-1234
123_45_6__789012345_67890
1234567890_
_1234567890_
_1234567890
1234567890_

August 18, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by Andre Artus
in reply to Andre Artus

Andre Artus

Posted in reply to Andre Artus

[...]

For fun I made a scanner rule that forces BinaryInteger to conform to a power of 2 grouping of nibbles. I think it loses it's clarity after 16 bits.

I made the underscore optional between nibbles, but required for groups of 2 bytes and above.

Some passing cases from my test inputs.

0b00010001
0b0001_0001
0b00010001_0001_0001
0b00010001_00010001
0b0001_0001_00010001
0b00010001_00010001
0b00010001_00010001_00010001_00010001
0b00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001
0b00010001_00010001_00010001_00010001_00010001_0001_0001_00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001

It loses some of the value of arbitrary grouping specifically the ability to group bits in a bitmask by function.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation