Thread overview
suggestion: clean white space / end of line definition
Oct 28, 2006
Thomas Kuehne
Oct 31, 2006
Walter Bright
Oct 31, 2006
Thomas Kuehne
Nov 01, 2006
Walter Bright
Nov 01, 2006
Thomas Kuehne
Nov 02, 2006
Georg Wrede
Nov 03, 2006
Thomas Kuehne
October 28, 2006
Current definition(http://www.digitalmars.com/d/lex.html):
> EndOfLine:
>	\u000D
>	\u000A
>	\u000D \u000A
>	EndOfFile
>
> WhiteSpace:
>	Space
>	Space WhiteSpace
>
> Space:
>	\u0020
>	\u0009
>	\u000B
>	\u000C

DMD's frontend however doesn't strictly conform to those definitions.

doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces
html.c:351: \u000D and \u000A are treated as space too
html.c:683: \u00A0 is treated as space only if it was encountered via a html entity
inifile.c:264: \u000D and \u000A are treated as space too
lexer.c:2360: \u000B and \u000C aren't treated as spaces
lexer.c: treats \u2028 and \u2029 as line seperators too

The oddest case is enitiy.c:577:
treat "\ " as "\u0020" istead of "\u00A0"

suggested definition:
> EndOfLine:
>	Unicode(all non-tailorable Line Breaking Classes causing a line break)
>	EndOfFile
>
> WhiteSpace:
>	Space
>	Space WhiteSpace
>
> Space:
>	( Unicode(General_Category == Space_Seperator)
>		|| Unicode(Bidi_Class == Segment_Separator)
>		|| Unicode(Bidi_Class == Whitespace)
>	) && !EndOfLine

this expands to:
> EndOfLine:
>	000A		// LINE FEED
>	000B		// LINE TABULATION
>	000C		// FORM FEED
>	000D		// CARRIAGE RETURN
>	000D 000A	// CARRIAGE RETURN followed by LINE FEED
>	0085		// NEXT LINE
>	2028		// LINE SEPARATOR
>	2029		// PARAGRAPH SEPARATOR
>
> Space:
>	Unicode(General_Category == Space_Seperator) && !EndOfLine
>		0020       // SPACE
>		00A0       // NO-BREAK SPACE
>		1680       // OGHAM SPACE MARK
>		180E       // MONGOLIAN VOWEL SEPARATOR
>		2000..200A // EN QUAD..HAIR SPACE
>		202F       // NARROW NO-BREAK SPACE
>		205F       // MEDIUM MATHEMATICAL SPACE
>		3000       // IDEOGRAPHIC SPACE
>
>	Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
>		0009	// CHARACTER TABULATION
>		001F	// INFORMATION SEPARATOR ONE
>
>	Unicode(Bidi_Class == Whitespace) && !EndOfLine
>		<all part of the Space_Seperator listing>
>

Thomas

October 31, 2006
Thomas Kuehne wrote:
> DMD's frontend however doesn't strictly conform to those definitions.
> 
> doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces
> html.c:351: \u000D and \u000A are treated as space too
> html.c:683: \u00A0 is treated as space only if it was encountered via a html entity
> inifile.c:264: \u000D and \u000A are treated as space too
> lexer.c:2360: \u000B and \u000C aren't treated as spaces
> lexer.c: treats \u2028 and \u2029 as line seperators too
> 
> The oddest case is enitiy.c:577:
> treat "\&nbsp;" as "\u0020" istead of "\u00A0"

Thanks, I'll try to get those fixed.


> suggested definition:
>> EndOfLine:
>> 	Unicode(all non-tailorable Line Breaking Classes causing a line break)
>> 	EndOfFile
>>
>> WhiteSpace:
>> 	Space
>> 	Space WhiteSpace
>>
>> Space:
>> 	( Unicode(General_Category == Space_Seperator)
>> 		|| Unicode(Bidi_Class == Segment_Separator)
>> 		|| Unicode(Bidi_Class == Whitespace)
>> 	) && !EndOfLine
> 
> this expands to:
>> EndOfLine:
>> 	000A		// LINE FEED
>> 	000B		// LINE TABULATION
>> 	000C		// FORM FEED
>> 	000D		// CARRIAGE RETURN
>> 	000D 000A	// CARRIAGE RETURN followed by LINE FEED
>> 	0085		// NEXT LINE
>> 	2028		// LINE SEPARATOR
>> 	2029		// PARAGRAPH SEPARATOR
>>
>> Space:
>> 	Unicode(General_Category == Space_Seperator) && !EndOfLine
>> 		0020       // SPACE
>> 		00A0       // NO-BREAK SPACE
>> 		1680       // OGHAM SPACE MARK
>> 		180E       // MONGOLIAN VOWEL SEPARATOR
>> 		2000..200A // EN QUAD..HAIR SPACE
>> 		202F       // NARROW NO-BREAK SPACE
>> 		205F       // MEDIUM MATHEMATICAL SPACE
>> 		3000       // IDEOGRAPHIC SPACE
>>
>> 	Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
>> 		0009	// CHARACTER TABULATION
>> 		001F	// INFORMATION SEPARATOR ONE
>>
>> 	Unicode(Bidi_Class == Whitespace) && !EndOfLine
>> 		<all part of the Space_Seperator listing>

Is it really worth doing all that?
October 31, 2006
Walter Bright schrieb am 2006-10-31:
> Thomas Kuehne wrote:

<snip>

>> suggested definition:
>>> EndOfLine:
>>> 	Unicode(all non-tailorable Line Breaking Classes causing a line break)
>>> 	EndOfFile
>>>
>>> WhiteSpace:
>>> 	Space
>>> 	Space WhiteSpace
>>>
>>> Space:
>>> 	( Unicode(General_Category == Space_Seperator)
>>> 		|| Unicode(Bidi_Class == Segment_Separator)
>>> 		|| Unicode(Bidi_Class == Whitespace)
>>> 	) && !EndOfLine
>> 
>> this expands to:
>>> EndOfLine:
>>> 	000A		// LINE FEED
>>> 	000B		// LINE TABULATION
>>> 	000C		// FORM FEED
>>> 	000D		// CARRIAGE RETURN
>>> 	000D 000A	// CARRIAGE RETURN followed by LINE FEED
>>> 	0085		// NEXT LINE
>>> 	2028		// LINE SEPARATOR
>>> 	2029		// PARAGRAPH SEPARATOR
>>>
>>> Space:
>>> 	Unicode(General_Category == Space_Seperator) && !EndOfLine
>>> 		0020       // SPACE
>>> 		00A0       // NO-BREAK SPACE
>>> 		1680       // OGHAM SPACE MARK
>>> 		180E       // MONGOLIAN VOWEL SEPARATOR
>>> 		2000..200A // EN QUAD..HAIR SPACE
>>> 		202F       // NARROW NO-BREAK SPACE
>>> 		205F       // MEDIUM MATHEMATICAL SPACE
>>> 		3000       // IDEOGRAPHIC SPACE
>>>
>>> 	Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
>>> 		0009	// CHARACTER TABULATION
>>> 		001F	// INFORMATION SEPARATOR ONE
>>>
>>> 	Unicode(Bidi_Class == Whitespace) && !EndOfLine
>>> 		<all part of the Space_Seperator listing>
>
> Is it really worth doing all that?

What is actually changing for EndOfLine?
	000A new
 	000B formerly white space
 	000C formerly white space
 	0085 new
 	2028 implemented but undocumented
 	2029 implemented but undocumented

\v and \f were probably defined as white space to due to
C's isspace. Please note however that \r and \n are recognised
by isspace too. Implementing 2028 and 2029 seems implicit due to
the use of UTF encodings.

All the different line endings can be converted to '\n' for
non UTF-8 D files in Module::parse. UTF-8 encoded HTML sources
can use a similar approach in html.c(GDC currently uses a
isLineSeperator there). UTF-8 encoded D files would require
support at
lexer.c: 303,709,763,835,1113,1301,1375,1457,1520,1520,2258,2272,2386.
The alternative and more robust solution would be a 'new line cleanup'
at module.c:485 and a goto from module.c:523. This way, all the
'\r', LS and PS tests sprinkled around lexer.c and html.c could be
removed.

In my opinion the EndOfLine change is well worth it.


The SPACE changed was prompted by the broken
00A0 (NO-BREAK SPACE) kludges in html.c and entity.c.
The issue isn't that the idea was bad but the
reasons wasn't layed out properly. If 00A0 is to be considered
a SPACE, then why 00A0 and not character foo-bar? At least the
2000..200A range will become the same problem 00A0 was originally.
Using the Unicode standard as reference would direct all further
debates if a character is a space to the Unicode consortium and leave
D out of potentially length debates.

Changes would be required somewhere around lexer.c:490,1331,2218,2368,2375,2404

Using a function like

// returns NULL or end of white space
char* isUniSpace(char*)

would also clean up white space parsing.
lexer.c currently tests for '\t' on 6 occasions,
7 times for ' ' and only 3 times for '\f' and '\v' each.

Thomas


November 01, 2006
There is a problem though with replacing it all with a function - lexing speed. Lexing speed is critically dependent on being able to consume whitespace fast, hence all the inline code to do it. Running the source through two passes makes it half as fast.
November 01, 2006
Walter Bright schrieb am 2006-11-01:
> There is a problem though with replacing it all with a function - lexing speed. Lexing speed is critically dependent on being able to consume whitespace fast, hence all the inline code to do it. Running the source through two passes makes it half as fast.

Here is a faster mock-up(untested!) using functions.
Use of macros is certanly possible too.

Thomas

# unsigned char* isEndOfLine(unsigned char* input){
#     switch(input[0]){
#	 /* covered by the lexer:
#	  * case 0x0A:    // LINE FEED
#	  */
#	 case 0x0B:    // LINE TABULATION
#	 case 0x0C:    // FORM FEED
#	     return input;
#	 case 0x0D:    // CARRIAGE RETURN
#	     if(input[1] == 0x0A){
#		 return input + 1;
#	     }
#	     return input;
#	 case 0xC2:    // NEXT LINE
#	     if(input[1] == 0x85){
#		 return input + 1;
#	     }
#	     break;
#	 case 0xE2:    // LINE SEPARATOR || PARAGRAPH SEPARATOR
#	     if((input[1] == 0x80) && ((input[2] == 0xA8) || (input[2] == 0xA9))){
#		 return input + 2;
#	     }
#	     break;
#	 default:
#	     break;
#     }
#
#     return 0;
# }
#
# unsigned char* isSpace(unsigned char* input){
#     switch(input[0]){
#	 /* covered by the lexer:
#	  * case 0x20:    // SPACE
#	  */
#	 case 0x09:    // CHARACTER TABULATION
#	 case 0x1F:    // INFORMATION SEPARATOR ONE
#	     return input;
#	 case 0xC2:
#	     if(input[1] == 0xA0){
#		 // NO-BREAK SPACE
#		 return input + 1;
#	     }
#	     break;
#	 case 0xE1:
#	     switch(input[1]){
#		 case 0xA9:
#		     if(input[2] == 0x80){
#			 // OGHAM SPACE MARK
#			 return input + 2;
#		     }
#		     break;
#		 case 0xA0:
#		     if(input[2] == 0x8E){
#			 // MONGOLIAN VOWEL SEPARATOR
#			 return input + 2;
#		     }
#		     break;
#		 default:
#		     break;
#	     }
#	     break;
#	 case 0xE2:
#	     switch(input[1]){
#		 case 0x80:
#		     if((0x80 <= input[2]) && (input[2] <= 0x8A)){
#			 // EN QUAD..HAIR SPACE
#			 return input + 2;
#		     }else if(input[2] == 0xAF){
#			 // NARROW NO-BREAK SPACE
#			 return input + 2;
#		     }
#		     break;
#		 case 0x81:
#		     if(input[2] == 0x9F){
#			 // MEDIUM MATHEMATICAL SPACE
#			 return input + 2;
#		     }
#		     break;
#		 default:
#		     break;
#	     }
#	     break;
#	 case 0xE3:
#	     if((input[1] == 0x80) && (input[2] == 0x80)){
#		 // IDEOGRAPHIC SPACE
#		 return input + 2;
#	     }
#	     break;
#	 default:
#	     break;
#     }
#     return 0;
# }
#
# void lexer(){
#     unsigned char* p;
#     unsigned char* tmp;
#     while (1)
#     {
#	 switch (*p)
#	 {
#	     Lspace:
#	     case ' ':
#		 p++;
#		 continue;	    // skip white space
#
#	     Lnew_line:
#	     case '\n':
#		 p++;
#		 //loc.linnum++;
#		 continue;	    // skip white space
#
# /* a lot more code goes here */
#
#	     default:
#		 if((tmp = isEndOfLine(p))){
#		     p = tmp;
#		     goto Lnew_line;
#		 }
#		 if((tmp = isSpace(p))){
#		     p = tmp;
#		     goto Lspace;
#		 }
#
# /* a lot more code goes here */
#	 }
#     }
# }

November 02, 2006
Thomas Kuehne wrote:
> Here is a faster mock-up

(Apologies in advance, and totally ignoring the good code, standards compliance and some other good things,) I have to ask:

Is this a Good Thing?

Admittedly not having thought through this issue myself, all I have is a gut feeling. But that gut feeling says that source code (especially in a systems language in the C family) should strive to hinder all kinds of Funny Stuff from entering the toolchain.

Accepting "foreign" characters within strings (and possibly even in comments) is OK, but in the source code itself, that's my issue here.

We can already have variable names in D written in Afghan and Negro-Potamian, which I definitely don't consider a good idea. If we were to follow this line of thought, then the next thing we know somebody might demand to have all D keywords translated to every single language in the bushes, and to have the compiler accept them as Equal synonyms to the Original Keywords. (This actually happened with the CP/M operating system in Finland in the early eighties! You don't want to hear the whole story.)

What will this do to cross-cultural study, reuse, and copying of example code? Won't it eventually compatmentalize most all code written outside of the Anglo-Centric world? That is, alienate it from us, but also from each of the other cultures too.

And who says parentheses and operators should only be the ones you need a Western keyboard to type? I bet there are cultures that use (or will insist on using, once the rumour is it's possible) some preposterous ink blots instead, for example.

And the next thing of course would be the idiot Humanists who'd demand that a non-breaking space really has to be equal to the underscore "for people think in words, and subjecting humans to CamelCase or under_scored names constitutes deplorable Oppression". And this kind of people refuse to see the [to us] obvious horrible ramifications of it.

And this I wrote in spite of my mother tongue needing non-ASCII characters in every single sentence.

But, as I said at the outset, this is just a gut feeling, so I'm not pressing the issue as if it were something I'd analyzed through-and-through.

---

Now, what is obvious, however, is that the current compiler *should* be consistent with whitespace and the like, instead of haphazardly enumerating some of them each time. No argument there.

November 03, 2006
Georg Wrede schrieb am 2006-11-02:
> Thomas Kuehne wrote:
>> Here is a faster mock-up
>
> (Apologies in advance, and totally ignoring the good code, standards compliance and some other good things,) I have to ask:
>
> Is this a Good Thing?
>
> Admittedly not having thought through this issue myself, all I have is a gut feeling. But that gut feeling says that source code (especially in a systems language in the C family) should strive to hinder all kinds of Funny Stuff from entering the toolchain.
>
> Accepting "foreign" characters within strings (and possibly even in comments) is OK, but in the source code itself, that's my issue here.
>
> We can already have variable names in D written in Afghan and Negro-Potamian, which I definitely don't consider a good idea. If we were to follow this line of thought, then the next thing we know somebody might demand to have all D keywords translated to every single language in the bushes, and to have the compiler accept them as Equal synonyms to the Original Keywords. (This actually happened with the CP/M operating system in Finland in the early eighties! You don't want to hear the whole story.)

Keywords are a few "magic" words, teaching those doesn't require any knowledge of the natural language they were taken from. I definitely agree with your view on keywords. The rest however ... sounds like a typical culture-centric view.

Forcing everyone - especially beginners and non-IT people - to use English isn't a viable solution. No, transliteration doesn't cut it:

mama ma ma ma

hint: This a variant of a Chinese language joke and involves 4 different characters.

In addition there are quite a few words and concepts that have no English equivalent. For simplicity's sake lets use an ASCII representable German word: "Heimat"

* home - to narrow
* native country - quite often wrong

> What will this do to cross-cultural study, reuse, and copying of example code? Won't it eventually compatmentalize most all code written outside of the Anglo-Centric world? That is, alienate it from us, but also from each of the other cultures too.

That's what coding standards are for. The same reuse issue goes for C/C++ and the preprocessor and seems to work reasonably well.

Thomas