[dox] Fixing the lexical rule for BinaryInteger - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » [dox] Fixing the lexical rule for BinaryInteger

Thread overview

[dox] Fixing the lexical rule for BinaryInteger
Aug 16, 2013 Andre Artus
Aug 16, 2013 Brian Schott
Aug 16, 2013 Andre Artus
Aug 16, 2013 Brian Schott
Aug 16, 2013 Andre Artus
Aug 17, 2013 H. S. Teoh
Aug 17, 2013 Andre Artus
Aug 17, 2013 H. S. Teoh
Aug 17, 2013 Andre Artus
Aug 17, 2013 H. S. Teoh
Aug 17, 2013 Andre Artus
Aug 18, 2013 Andre Artus
Aug 17, 2013 H. S. Teoh
Aug 16, 2013 Andre Artus
Aug 16, 2013 Brian Schott

August 16, 2013

[dox] Fixing the lexical rule for BinaryInteger

Posted by Andre Artus

Andre Artus

The documentation on the lexical rules for BinaryInteger (http://dlang.org/lex.html#BinaryInteger) has a few issues:

> BinaryInteger:
>    BinPrefix BinaryDigits

The nonterminal BinaryDigits, does not exist.


> BinaryDigitsUS:
>    BinaryDigitUS
>    BinaryDigitUS BinaryDigitsUS

The construction for BinaryDigitsUS currently allows for the following:

_(_)*, e.g. 0b_, 0b__, 0b___ etc.


Which is clearly not allowed by the compiler.


I have put up a change on GitHub [1], but there is a clear problem. The DMD compiler allows for any of the following (reduced cases):

a. 0b__1
b. 0b_1_
c. 0b1__

Whereas my change disallows the second case (b), but is in line with how the other integers are specified.

This is a specification problem (limitation of BNF), not an implementation problem. In plain English one would just say that the BinaryDigitsUS sequence should contain at least one BinaryDigit character.

I'm busy working on the HexadecimalInteger, which has related issues.

1. https://github.com/andre-artus/dlang.org/blob/LexBinaryDigit/lex.dd

August 16, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by Brian Schott
in reply to Andre Artus

Brian Schott

Posted in reply to Andre Artus

I've been doing some work with the language grammar specification. You may find these resources useful:

http://d.puremagic.com/issues/show_bug.cgi?id=10233
https://github.com/Hackerpilot/DGrammar/blob/master/D.g4

August 16, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by Andre Artus
in reply to Brian Schott

Andre Artus

Posted in reply to Brian Schott

On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:
> I've been doing some work with the language grammar specification. You may find these resources useful:
>
> http://d.puremagic.com/issues/show_bug.cgi?id=10233
> https://github.com/Hackerpilot/DGrammar/blob/master/D.g4

You have done impressive work on your grammar; I just have some small issues.

1. I run into a number of errors trying to generate the Java code, I'm using ANTLR 4.1

2. Your BinaryInteger and HexadecimalInteger only allow for one of the following (reduced) cases:

0b1__ : works
0b_1_ : fails
0b__1 : fails

Same with HexadecimalInteger.

3. The imports don't allow for all cases.

4. how are you handling the scope attribute specifier in the "attribute ':'" case, e.g. "public:"?

There seems to be a few more places where it diverges a bit from what the compiler currently accepts.

I'm not arguing for the wisdom of writing code as I am about to show, but the following compiles with the current release build of DMD, but may not parse with DGrammar, quite likely balk in the scanner:

module main;

public:
static:
import std.stdio;

int main(string[] argv)
{
	auto myBin = 0b0011_1101;

	writefln("%1$x\t%1$.8b\t%1$s", myBin);

	auto myBin2 = 0b_______1;

	writefln("%1$x\t%1$.8b\t%1$s", myBin2);

	auto myBin3 = 0b____1___;

	writefln("%1$x\t%1$.8b\t%1$s", myBin3);

	auto myHex1 = 0x1__;
	writefln("%1$x\t%1$.8b\t%1$s", myHex1);

	auto myHex2 = 0x_1_;
	writefln("%1$x\t%1$.8b\t%1$s", myHex2);

	auto myHex3 = 0x__1;
	writefln("%1$x\t%1$.8b\t%1$s", myHex3);

	return 0;
}

August 16, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by Brian Schott
in reply to Andre Artus

Brian Schott

Posted in reply to Andre Artus

On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:
> On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:
>> I've been doing some work with the language grammar specification. You may find these resources useful:
>>
>> http://d.puremagic.com/issues/show_bug.cgi?id=10233
>> https://github.com/Hackerpilot/DGrammar/blob/master/D.g4
>
> You have done impressive work on your grammar; I just have some small issues.
>
> 1. I run into a number of errors trying to generate the Java code, I'm using ANTLR 4.1

I'm aware of that. If you're able to get ANTLR to actually produce a working parser for D I'd be happy to merge your pull request. I haven't been able to get any parser generators to work for D.

> 2. Your BinaryInteger and HexadecimalInteger only allow for one of the following (reduced) cases:
>
> 0b1__ : works
> 0b_1_ : fails
> 0b__1 : fails

It's my opinion that the compiler should reject all of these because I think of the underscore as a separator between digits, but I'm constantly fighting the "spec, dmd, and idiom all disagree" issue.

> Same with HexadecimalInteger.
>
> 3. The imports don't allow for all cases.

https://github.com/Hackerpilot/DGrammar/issues

> 4. how are you handling the scope attribute specifier in the "attribute ':'" case, e.g. "public:"?
>
> There seems to be a few more places where it diverges a bit from what the compiler currently accepts.
>
> I'm not arguing for the wisdom of writing code as I am about to show, but the following compiles with the current release build of DMD, but may not parse with DGrammar, quite likely balk in the scanner:
>
> module main;
>
> public:
> static:
> import std.stdio;
>
> int main(string[] argv)
> {
> 	auto myBin = 0b0011_1101;
>
> 	writefln("%1$x\t%1$.8b\t%1$s", myBin);
>
> 	auto myBin2 = 0b_______1;
>
> 	writefln("%1$x\t%1$.8b\t%1$s", myBin2);
>
> 	auto myBin3 = 0b____1___;
>
> 	writefln("%1$x\t%1$.8b\t%1$s", myBin3);
>
> 	auto myHex1 = 0x1__;
> 	writefln("%1$x\t%1$.8b\t%1$s", myHex1);
>
> 	auto myHex2 = 0x_1_;
> 	writefln("%1$x\t%1$.8b\t%1$s", myHex2);
>
> 	auto myHex3 = 0x__1;
> 	writefln("%1$x\t%1$.8b\t%1$s", myHex3);
>
> 	
> 	return 0;
> }

I wrote that grammar as part of my work on DCD and DScanner. My lexer, parser, and AST library need some more testing. Please download DScanner and run it with either the --ast or --syntaxCheck options. If you find issues, please report them on Github.

August 16, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by Andre Artus
in reply to Brian Schott

Andre Artus

Posted in reply to Brian Schott

On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:
> I've been doing some work with the language grammar specification. You may find these resources useful:
>
> http://d.puremagic.com/issues/show_bug.cgi?id=10233
> https://github.com/Hackerpilot/DGrammar/blob/master/D.g4

I have fixed up a few issues in DGrammar.g4, I will put them up on GitHub if you are interested.

According the the Definitive ANTLR Reference the following list of words are reserved in ANTLR grammars:
import, fragment, lexer, parser, grammar, returns, locals, throws, *catch*, *finally*, mode, options, tokens.

The two I marked above caused problems when generating.

I don't know whether you are in the middle of trying to fix the indirect left recursion issue but I see that terminals tied to "unaryExpression" are duplicated all over the place. I can fix the recursion issue, and clean up the dups if that would help you.

August 16, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by Brian Schott
in reply to Andre Artus

Brian Schott

Posted in reply to Andre Artus

On Friday, 16 August 2013 at 23:07:38 UTC, Andre Artus wrote:
> On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:
>> I've been doing some work with the language grammar specification. You may find these resources useful:
>>
>> http://d.puremagic.com/issues/show_bug.cgi?id=10233
>> https://github.com/Hackerpilot/DGrammar/blob/master/D.g4
>
> I have fixed up a few issues in DGrammar.g4, I will put them up on GitHub if you are interested.
>
> According the the Definitive ANTLR Reference the following list of words are reserved in ANTLR grammars:
> import, fragment, lexer, parser, grammar, returns, locals, throws, *catch*, *finally*, mode, options, tokens.
>
> The two I marked above caused problems when generating.

I must have missed those when I pulled the grammar out of my parser's DDOC comments.

> I don't know whether you are in the middle of trying to fix the indirect left recursion issue but I see that terminals tied to "unaryExpression" are duplicated all over the place. I can fix the recursion issue, and clean up the dups if that would help you.

It would. I'm not actively working on that grammar.

August 16, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by Andre Artus
in reply to Brian Schott

Andre Artus

Posted in reply to Brian Schott

-- SNIP --

> I wrote that grammar as part of my work on DCD and DScanner. My lexer, parser, and AST library need some more testing. Please download DScanner and run it with either the --ast or --syntaxCheck options. If you find issues, please report them on Github.

I forked just under an hour ago, am I on old bits?
I have fixed all but one of the build issues, I would like to fix the last one before I commit as I don't like to leave my repo's in a broken state.

I'll continue the discussion on GitHub.

August 17, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by H. S. Teoh
in reply to Brian Schott

H. S. Teoh

Posted in reply to Brian Schott

On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:
> On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:
[...]
> >2. Your BinaryInteger and HexadecimalInteger only allow for one of
> >the following (reduced) cases:
> >
> >0b1__ : works
> >0b_1_ : fails
> >0b__1 : fails
> 
> It's my opinion that the compiler should reject all of these because I think of the underscore as a separator between digits, but I'm constantly fighting the "spec, dmd, and idiom all disagree" issue.
[...]

I remember reading this part of the spec on dlang.org, and I wonder if it was worded the way it is just for simplicity, because to specify something like "_ must appear between digits" involves some complicated BNF rules, which maybe seems like overkill for a single literal.

But sometimes it is good to be precise, if we want to enforce "proper" conventions for underscores:

<binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits>

<binaryDigits> ::= <binaryDigit> <binaryDigits>
		| <binaryDigit>

<underscoreBinaryDigits> ::= ""
		| "_" <binaryDigits>
		| "_" <binaryDigits> <underscoreBinaryDigits>

<binaryDigit> ::= "0"
		| "1"

This BNF spec forces "_" to only appear between two binary digits, and never more than a single _ in a row. You can also make your parser only pick up <binaryDigit> when performing semantic on binary literals, so the other stuff is ignored and only serves to enforce syntax.

I'd be surprised if there's any D code out there that doesn't fit this spec, to be honest.

But if you want to accept "strange" literals like 0b__1__, you could do something like:

<binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> <underscoreBinaryDigits>

<underscoreBinaryDigits> ::= "_"
		| "_" <underscoreBinaryDigits>
		| <binaryDigit>
		| <binaryDigit> <underscoreBinaryDigits>
		| ""

<binaryDigit> ::= "0"
		| "1"

The odd form of the rule for <binaryLiteral> is to ensure that there's at least one binary digit in the string, whereas <underscoreBinaryDigits> is just a wildcard anything-goes rule that takes any combination of 0, 1, and _, including the empty string.

T

-- 
That's not a bug; that's a feature!

August 17, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by H. S. Teoh

H. S. Teoh

On Fri, Aug 16, 2013 at 05:50:24PM -0700, H. S. Teoh wrote: [...]
> <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits>
> 
> <binaryDigits> ::= <binaryDigit> <binaryDigits>
> 		| <binaryDigit>
> 
> <underscoreBinaryDigits> ::= ""
> 		| "_" <binaryDigits>
> 		| "_" <binaryDigits> <underscoreBinaryDigits>
> 
> <binaryDigit> ::= "0"
> 		| "1"

Regex equivalent:

	0b(0|1)(0|1)*(_(0|1)(0|1)*)*


[...]
> <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> <underscoreBinaryDigits>
> 
> <underscoreBinaryDigits> ::= "_"
> 		| "_" <underscoreBinaryDigits>
> 		| <binaryDigit>
> 		| <binaryDigit> <underscoreBinaryDigits>
> 		| ""
> 
> <binaryDigit> ::= "0"
> 		| "1"
[...]

Regex equivalent:

	0b(0|1|_)*(0|1)(0|1|_)*


T

-- 
"How are you doing?" "Doing what?"

August 17, 2013

Re: [dox] Fixing the lexical rule for BinaryInteger

Posted by Andre Artus
in reply to H. S. Teoh

Andre Artus

Posted in reply to H. S. Teoh

On Saturday, 17 August 2013 at 00:51:57 UTC, H. S. Teoh wrote:
> On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:
>> On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:
> [...]
>> >2. Your BinaryInteger and HexadecimalInteger only allow for one of
>> >the following (reduced) cases:
>> >
>> >0b1__ : works
>> >0b_1_ : fails
>> >0b__1 : fails
>> 
>> It's my opinion that the compiler should reject all of these because
>> I think of the underscore as a separator between digits, but I'm
>> constantly fighting the "spec, dmd, and idiom all disagree" issue.
> [...]
>
> I remember reading this part of the spec on dlang.org, and I wonder if
> it was worded the way it is just for simplicity, because to specify
> something like "_ must appear between digits" involves some complicated
> BNF rules, which maybe seems like overkill for a single literal.
>
> But sometimes it is good to be precise, if we want to enforce "proper"
> conventions for underscores:
>
> <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits>
>
> <binaryDigits> ::= <binaryDigit> <binaryDigits>
> 		| <binaryDigit>
>
> <underscoreBinaryDigits> ::= ""
> 		| "_" <binaryDigits>
> 		| "_" <binaryDigits> <underscoreBinaryDigits>
>
> <binaryDigit> ::= "0"
> 		| "1"
>
> This BNF spec forces "_" to only appear between two binary digits, and never more than a single _ in a row.

Yup, that's the issue. Coding the actual behaviour by hand, or doing it with a regular expression, is close to trivial.

> You can also make your parser only
> pick up <binaryDigit> when performing semantic on binary literals, so
> the other stuff is ignored and only serves to enforce syntax.

Pushing it up to the parser is an option in implementation, but I don't see that making the specification easier (it's 3:40 in the morning here, so I am very likely not thinking too clearly about this).

>
> I'd be surprised if there's any D code out there that doesn't fit this
> spec, to be honest.

It's not what I would call best practice, but the following is possible in the current compiler:

> 	auto myBin1 = 0b0011_1101; // Sane
> 	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
> 	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1

Which means a tools built against the documented spec are going to choke on these weird cases. Personally I would prefer if the more questionable options were not allowed as they potentially defeat the goal of improving clarity. But, that's a breaking change.

>
> But if you want to accept "strange" literals like 0b__1__, you could do
> something like:
>
> <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> <underscoreBinaryDigits>
>
> <underscoreBinaryDigits> ::= "_"
> 		| "_" <underscoreBinaryDigits>
> 		| <binaryDigit>
> 		| <binaryDigit> <underscoreBinaryDigits>
> 		| ""
>
> <binaryDigit> ::= "0"
> 		| "1"
>
> The odd form of the rule for <binaryLiteral> is to ensure that there's
> at least one binary digit in the string, whereas
> <underscoreBinaryDigits> is just a wildcard anything-goes rule that
> takes any combination of 0, 1, and _, including the empty string.
>

The rule that matches the DMD compiler is actually very easy to do in ANTLR4, i.e.

BinaryLiteral   : '0b' [_01]* [01] [_01]* ;

I'm a bit too tired to fully pay attention, but it seems you are saying that "0b" (no additional numbers) should match, which I believe it should not (although I admit to not testing this). If it does then I would consider that a bug.

It's not a problem implementing the rule, I am more concerned with documenting it in a clear and unambiguous way so that people building tools from it can get it right. BNF isn't always the easiest way to do so, but it's what being used.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation