Newline character set in the D lexer - NEL

Aug 31, 2020

Cecil Ward

Aug 31, 2020

Dominikus Dittes Scherkl

Sep 04, 2020

Sep 04, 2020

Sep 08, 2020

Sep 08, 2020

Aug 31, 2020

Aug 31, 2020

August 31, 2020

Re: Newline character set in the D lexer - NEL

Posted by Dominikus Dittes Scherkl
in reply to Cecil Ward

Permalink

Dominikus Dittes Scherkl

Posted in reply to Cecil Ward

Permalink

On Monday, 31 August 2020 at 01:49:06 UTC, Cecil Ward wrote:
> Would there be any benefit from the following suggestion? Add the character Unicode NEL U+0085 into the set of EndOfLine characters in the lexer ?
>
> Cecil Ward.

I personally think we should have these definitions:

             /*  NUL    EM    SUB */
EndOfFile   = { 0x00 | 0x19 | 0x1A | PhysicalEndOfFile };
             /*  LF     FF     CR      CR LF     NEL     LSEP     PSEP  */
EndOfLine   = { 0x0A | 0x0C | 0x0D | 0x0D 0x0A | 0x85 | 0x2028 | 0x2029 | EndOfFile };

             /*  HT     VT     SP    NBSP    NQSP     MQSP     ENSP     EMSP     3/MSP */
WhiteSpace  = { 0x09 | 0x0B | 0x20 | 0xA0 | 0x2000 | 0x2001 | 0x2002 | 0x2003 | 0x2004

             /*  4/MSP    6/MSP     FSP      PSP     THSP      HSP     ZWSP     NNBSP */
              | 0x2005 | 0x2006 | 0x2007 | 0x2008 | 0x2009 | 0x200A | 0x200B | 0x202F

             /*  MMSP      WJ      IDSP    ZWNBSP */
              | 0x205F | 0x2060 | 0x3000 | 0xFEFF | EndOfLine };

The definition of D source files misses quite a lot of them :-(

EM = end of medium (what if not this should end a file?!?)
NEL = New Line
LSEP = Line Separator
PSEP = Paragraph Separator

NBSP = non-braking space
NQSP = ENSP = N-wide space
MQSP = EMSP = M-wide space
3/MSP = 1/3 M-wide space (three spaces together are as wide as an M)
4/MSP = 1/4 M-wide space
6/MSP = 1/6 M-wide space
FSP = figure space
PSP = point space
THSP = thin space
HSP = hair space
ZWSP = zero width space
NNBSP = narrow non-braking space
MMSP = mathematic space
WJ = word joiner (invisible space that separate words for the spelling correction)
IDSP = ideographic space (same width as a chinese character)
ZWNBSP = zero-width non-braking space

On Monday, 31 August 2020 at 01:49:06 UTC, Cecil Ward wrote: > Would there be any benefit from the following suggestion? Add the character Unicode NEL U+0085 into the set of EndOfLine characters in the lexer ? > > Cecil Ward. Pardon me but why bother while ascii gives already all we need to put spaces and new lines with fast decode (< 80h) ?

On Monday, 31 August 2020 at 09:39:12 UTC, Nils Lankila wrote: > On Monday, 31 August 2020 at 01:49:06 UTC, Cecil Ward wrote: >> Would there be any benefit from the following suggestion? Add the character Unicode NEL U+0085 into the set of EndOfLine characters in the lexer ? >> >> Cecil Ward. > > Pardon me but why bother while ascii gives already all we need to put spaces and new lines with fast decode (< 80h) ? D already recognizes some non-ascii characters as spaces and line separators [1], so the decision to "bother" has already been made. [1] https://dlang.org/spec/lex.html#character_set

I agree with Dominikus Note to earlier poster: NEL was used and just possibly may still be used by IBM mainframe users; XML 1.1 understands NEL iirc; see https://www.w3.org/TR/newline/ and https://www.w3.org/International/questions/qa-controls

On Friday, 4 September 2020 at 00:48:59 UTC, Cecil Ward wrote: > I agree with Dominikus > > Note to earlier poster: NEL was used and just possibly may still be used by IBM mainframe users; XML 1.1 understands NEL iirc; > > see https://www.w3.org/TR/newline/ and > > https://www.w3.org/International/questions/qa-controls Given the lack of answers I would suggest to go ahead with a PR or at least open an issue. Lexing is not a big deal but if nobody cares this will never be done.

On Friday, 4 September 2020 at 05:28:47 UTC, NilsLankila wrote: > > Given the lack of answers I would suggest to go ahead with a PR or at least open an issue. Lexing is not a big deal but if nobody cares this will never be done. Agreed, Nils. Mind you someone cared enough to include U+2028 and U+2029 in the lexer spec. I have no idea how to initiate a "PR". Perhaps someone could help me with this?

PR = "Pull Request". Easy way is to fork the project on github, clone your (forked version of the) project, make changes, push back. This could be in ~master on your own fork, or ideally in a separate branch. Then on github, go to the original project and start a new pull request. It should automagically detect that you've made changes (again ideally in a branch of your fork), and offer to make a pull request with your changes against ~master (or whatever is set as the default branch for the project). James On 9/8/20 2:42 AM, Cecil Ward wrote: > On Friday, 4 September 2020 at 05:28:47 UTC, NilsLankila wrote: >> >> Given the lack of answers I would suggest to go ahead with a PR or at least open an issue. Lexing is not a big deal but if nobody cares this will never be done. > > Agreed, Nils. Mind you someone cared enough to include U+2028 and U+2029 in the lexer spec. > > I have no idea how to initiate a "PR". Perhaps someone could help me with this?

Forums