Thread overview
Newline character set in the D lexer - NEL
Aug 31, 2020
Cecil Ward
Sep 04, 2020
Cecil Ward
Sep 04, 2020
NilsLankila
Sep 08, 2020
Cecil Ward
Sep 08, 2020
James Blachly
Aug 31, 2020
Nils Lankila
Aug 31, 2020
Paul Backus
August 31, 2020
Would there be any benefit from the following suggestion? Add the character Unicode NEL U+0085 into the set of EndOfLine characters in the lexer ?

Cecil Ward.
August 31, 2020
On Monday, 31 August 2020 at 01:49:06 UTC, Cecil Ward wrote:
> Would there be any benefit from the following suggestion? Add the character Unicode NEL U+0085 into the set of EndOfLine characters in the lexer ?
>
> Cecil Ward.

I personally think we should have these definitions:

             /*  NUL    EM    SUB */
EndOfFile   = { 0x00 | 0x19 | 0x1A | PhysicalEndOfFile };
             /*  LF     FF     CR      CR LF     NEL     LSEP     PSEP  */
EndOfLine   = { 0x0A | 0x0C | 0x0D | 0x0D 0x0A | 0x85 | 0x2028 | 0x2029 | EndOfFile };

             /*  HT     VT     SP    NBSP    NQSP     MQSP     ENSP     EMSP     3/MSP */
WhiteSpace  = { 0x09 | 0x0B | 0x20 | 0xA0 | 0x2000 | 0x2001 | 0x2002 | 0x2003 | 0x2004

             /*  4/MSP    6/MSP     FSP      PSP     THSP      HSP     ZWSP     NNBSP */
              | 0x2005 | 0x2006 | 0x2007 | 0x2008 | 0x2009 | 0x200A | 0x200B | 0x202F

             /*  MMSP      WJ      IDSP    ZWNBSP */
              | 0x205F | 0x2060 | 0x3000 | 0xFEFF | EndOfLine };

The definition of D source files misses quite a lot of them :-(

EM = end of medium (what if not this should end a file?!?)
NEL = New Line
LSEP = Line Separator
PSEP = Paragraph Separator

NBSP = non-braking space
NQSP = ENSP = N-wide space
MQSP = EMSP = M-wide space
3/MSP = 1/3 M-wide space (three spaces together are as wide as an M)
4/MSP = 1/4 M-wide space
6/MSP = 1/6 M-wide space
FSP = figure space
PSP = point space
THSP = thin space
HSP = hair space
ZWSP = zero width space
NNBSP = narrow non-braking space
MMSP = mathematic space
WJ = word joiner (invisible space that separate words for the spelling correction)
IDSP = ideographic space (same width as a chinese character)
ZWNBSP = zero-width non-braking space
August 31, 2020
On Monday, 31 August 2020 at 01:49:06 UTC, Cecil Ward wrote:
> Would there be any benefit from the following suggestion? Add the character Unicode NEL U+0085 into the set of EndOfLine characters in the lexer ?
>
> Cecil Ward.

Pardon me but why bother while ascii gives already all we need to put spaces and new lines with fast decode (< 80h) ?
August 31, 2020
On Monday, 31 August 2020 at 09:39:12 UTC, Nils Lankila wrote:
> On Monday, 31 August 2020 at 01:49:06 UTC, Cecil Ward wrote:
>> Would there be any benefit from the following suggestion? Add the character Unicode NEL U+0085 into the set of EndOfLine characters in the lexer ?
>>
>> Cecil Ward.
>
> Pardon me but why bother while ascii gives already all we need to put spaces and new lines with fast decode (< 80h) ?

D already recognizes some non-ascii characters as spaces and line separators [1], so the decision to "bother" has already been made.

[1] https://dlang.org/spec/lex.html#character_set
September 04, 2020
I agree with Dominikus

Note to earlier poster: NEL was used and just possibly may still be used by IBM mainframe users; XML 1.1 understands NEL iirc;

see https://www.w3.org/TR/newline/   and

       https://www.w3.org/International/questions/qa-controls
September 04, 2020
On Friday, 4 September 2020 at 00:48:59 UTC, Cecil Ward wrote:
> I agree with Dominikus
>
> Note to earlier poster: NEL was used and just possibly may still be used by IBM mainframe users; XML 1.1 understands NEL iirc;
>
> see https://www.w3.org/TR/newline/   and
>
>        https://www.w3.org/International/questions/qa-controls

Given the lack of answers I would suggest to go ahead with a PR or at least open an issue. Lexing is not a big deal but if nobody cares this will never be done.
September 08, 2020
On Friday, 4 September 2020 at 05:28:47 UTC, NilsLankila wrote:
>
> Given the lack of answers I would suggest to go ahead with a PR or at least open an issue. Lexing is not a big deal but if nobody cares this will never be done.

Agreed, Nils. Mind you someone cared enough to include U+2028 and U+2029 in the lexer spec.

I have no idea how to initiate a "PR". Perhaps someone could help me with this?
September 08, 2020
PR = "Pull Request".

Easy way is to fork the project on github, clone your (forked version of the) project, make changes, push back. This could be in ~master on your own fork, or ideally in a separate branch.

Then on github, go to the original project and start a new pull request. It should automagically detect that you've made changes (again ideally in a branch of your fork), and offer to make a pull request with your changes against ~master (or whatever is set as the default branch for the project).

James

On 9/8/20 2:42 AM, Cecil Ward wrote:
> On Friday, 4 September 2020 at 05:28:47 UTC, NilsLankila wrote:
>>
>> Given the lack of answers I would suggest to go ahead with a PR or at least open an issue. Lexing is not a big deal but if nobody cares this will never be done.
> 
> Agreed, Nils. Mind you someone cared enough to include U+2028 and U+2029 in the lexer spec.
> 
> I have no idea how to initiate a "PR". Perhaps someone could help me with this?