Non-ASCII in the future in the lexer (page 2)

Settings

Help

Index » General » Non-ASCII in the future in the lexer (page 2)

June 01, 2023

Re: Non-ASCII in the future in the lexer

Posted by Quirin Schroll
in reply to Cecil Ward

Permalink

Quirin Schroll

Posted in reply to Cecil Ward

Permalink

TL;DR: What you want can be gained using smart fonts or other smart UI tools.

On Wednesday, 31 May 2023 at 06:23:43 UTC, Cecil Ward wrote:

Unicode has been around for 30 years now and yet it is not getting fully used in programming languages for example. We are still stuck in our minds with ASCII only. Should we in future start mining the riches of unicode when we make changes to the grammar of programming languages (and other grammars)?

The gain is too little for the cost. The gain is circumstantially negative and that will happen at exactly those places where it is particularly unfortunate.

Would it be worthwhile considering wider unicode alternatives for keywords that we already have? Examples: comparison operators and other operators. We have unicode symbols for

≤ less than or equal <=
≥ greater than or equal >=

a proper multiplication sign ‘×’, like an x, as well as the * that we have been stuck with since the beginning of time.

± plus or minus might come in useful someday, can’t think what for.

I can: ± could be used for in-place negation. Let’s say you have:

ref int f(); // is costly or has side-effects

To negate the result in-place, you have to do:

int* p = &f();
*p = -*p;

(ref int x) { x = -x; }(f());

I have … as one character; would be nice to have that as an alternative to .. (two ASCII fullstops) maybe?

I realise that this issue is hardly about the cure for world peace, but there seems to be little reason to be confined to ASCII forever when there are better suited alternatives and things that might spark the imagination of designers.

The problem are fonts that don’t support certain characters and editors defaulting to legacy encodings. One can handle FranÃ§ais, but a Ã— b (UTF-8 read as Windows-1252) is a problem because who knows what the character was.

It’s not that the gain is rather little, it’s the potential for high cost. A lot of people will avoid those like the plague because of legacy issues.

One extreme case or two: Many editors now automatically employ ‘ ’ supposed to be 6-9 quotes, instead of ASCII '', so too with “ ” (6-9 matching pair).

Many document processors do that. Whoever writes code in them, they’re wrong.

When Walter was designing the literal strings lexical items many items needed to be found for all the alternatives. And we have « » which are familiar to French speakers? It would be very nice to to fall over on 6-9 quotes anyway, and just accept them as an alternative.

Accepting them is one possibility. Having an editor that replaces “” by "" and ‘’ by '' is another. Any regex-replace can easily used for that: ‘([^’]*)’ by '$1'.

The second case that comes to mind: I was thinking about regex grammars and XML’s grammar, and I think one or both can now handle all kinds of unicode whitespace.

Definitely not regex. It’s not standardized at all.

XML is quite a non-problem because directly supports specifying an encoding.

That’s the kind of thinking I’m interested in. It would be good to handle all kinds of whitespace, as we do all kinds of newline sequences. We probably already do both well. And no one complains saying ‘we ought not bother with tab’, so handling U+0085 and the various whitespace types such as &nbsp in our lexicon of our grammar is to me a no-brainer.

And what use might we find some day for § and ¶ ? Could be great for some new exotic grammatical structural pattern. Look at the mess that C++ got into with the syntax of templates. They needed something other than < >. Almost anything. They could have done no worse with « ».

As a German, I find «» and ‹› a little irritating, because we’re using them like this: »« and ›‹. The Swiss use «content» and the French use « content » (with half-spaces).

C++ was wrong on template syntax, but they were right on using ASCII. D has good template syntax, and it’s ASCII.

Another point: These exotics are easy to find in your text editor because they won’t be overused.

Citation needed.

As for usability, some of our tools now have or could have ‘favourite characters’ or ‘snippet’ text strings in a place in the ui where they are readily accessible. I have a unicode character map app and also a file with my unicode favourite characters in it. So there are things that we can do ourselves. And having a favourites comment block in a starter template file might be another example.

If you employ tooling, the best option is to leave the source code as-is and use a OpenType font or other UI-oriented things.

Argument against: would complicate our regexes with a new need for multiple alternatives as in [xyz] rather than just one possible character in a search or replace operation. But I think that some regex engines are unicode aware and can understand concepts like all x-characters where x is some property or defines a subset.

Making grep harder to use is definitely a deal-breaker.

I have a concern. I love the betterC idea. Something inside my head tells me not to move too far from C. But we have already left the grammar of C behind, for good reason. C doesn’t have .. or … ( :-) ) nor does it have $. So that train has left. But I’m talking about things that C is never going to have.

Unicode has U+2025 ‥ for you as well.

C is overly restrictive. It’s not based on ASCII, but a proper subset of ASCII that’s compatible with even older standards like EBCDIC. In today’s age, ASCII support is quite a safe bet. Unicode support isn’t.

One point of clarification: I am not talking about D runtime. I’m confining myself to D’s lexer and D’s grammar.

It sounds great in theory, but if any tool in your chain has no support for that, you’re out. I was running into that on Windows recently. Not D related.

I’m a Unicode fan. I created my own keyboard layout which puts a lot of nice stuff on AltGr and dead key sequences (e.g. proper quotation marks, currency symbols, math symbols, the complete Greek alphabet) while leaving anything that is printed on the keys where it was. Yet I fail to see the advantage of × over * and similar in code. There are several fonts that visually replace <= by a wider ≤ sign, != by a wide ≠, etc. If you want alternatives, use a font. It’s non-intrusive to the source code. It’s a million times better than Unicode in source. I don’t use those fonts because for some reason, they add a plethora of things that make sense in certain languages, e.g. replace >> by a ligature (think of »). That makes sense when it’s an operator, but it doesn’t when it’s two closing angle brackets (cf. Java or C++).

June 02, 2023

Re: Non-ASCII in the future in the lexer

Posted by Richard (Rikki) Andrew Cattermole
in reply to Quirin Schroll

Permalink

Richard (Rikki) Andrew Cattermole

Posted in reply to Quirin Schroll

Permalink

On 02/06/2023 3:47 AM, Quirin Schroll wrote:
>     The second case that comes to mind: I was thinking about regex
>     grammars and XML’s grammar, and I think one or both can now handle
>     all kinds of unicode whitespace.
> 
> Definitely not regex. It’s not standardized at all.

Not the point of the above but related: https://www.unicode.org/reports/tr18/

Unicode for regex is in fact standardized :)

June 01, 2023

Re: Non-ASCII in the future in the lexer

Posted by Walter Bright
in reply to Timon Gehr

Permalink

Walter Bright

Posted in reply to Timon Gehr

Permalink

On 6/1/2023 6:49 AM, Timon Gehr wrote:
> I am just using the Agda input mode in emacs, so e.g., I just type "\to" and I get "→", "\'a" and I get "á", etc.

https://agda.readthedocs.io/en/v2.6.3/tools/emacs-mode.html

https://github.com/DigitalMars/med/blob/master/src/med/more.d#L350

June 01, 2023

Re: Non-ASCII in the future in the lexer

Posted by Timon Gehr
in reply to Walter Bright

Permalink

Timon Gehr

Posted in reply to Walter Bright

Permalink

On 6/1/23 20:20, Walter Bright wrote:
> On 6/1/2023 6:49 AM, Timon Gehr wrote:
>> I am just using the Agda input mode in emacs, so e.g., I just type "\to" and I get "→", "\'a" and I get "á", etc.
> 
> https://agda.readthedocs.io/en/v2.6.3/tools/emacs-mode.html
> 

Only this part is relevant:
https://agda.readthedocs.io/en/v2.6.3/tools/emacs-mode.html#unicode-input

(Agda has an emacs mode for the language and an input mode. I am using the input mode even for D code. There's also a TeX input mode, but the Agda input mode has more convenient bindings, so I am using that.)

> https://github.com/DigitalMars/med/blob/master/src/med/more.d#L350

June 01, 2023

Re: Non-ASCII in the future in the lexer

Posted by Cecil Ward
in reply to Quirin Schroll

Permalink

Cecil Ward

Posted in reply to Quirin Schroll

Permalink

On Thursday, 1 June 2023 at 15:47:00 UTC, Quirin Schroll wrote:
> TL;DR: What you want can be gained using smart fonts or other smart UI tools.
>
> ---
>
> On Wednesday, 31 May 2023 at 06:23:43 UTC, Cecil Ward wrote:
>> Unicode has been around for 30 years now and yet it is not getting fully used in programming languages for example. We are still stuck in our minds with ASCII only. Should we in future start mining the riches of unicode when we make changes to the grammar of programming languages (and other grammars)?
>
> The gain is too little for the cost. The gain is circumstantially negative and that will happen at exactly those places where it is particularly unfortunate.
>
>> Would it be worthwhile considering wider unicode alternatives for keywords that we already have? Examples: comparison operators and other operators. We have unicode symbols for
>>
>> ≤     less than or equal <=
>> ≥    greater than or equal >=
>>
>> a proper multiplication sign ‘×’, like an x, as well as the * that we have been stuck with since the beginning of time.
>>
>> ± 	plus or minus might come in useful someday, can’t think what for.
>
> I can: `±` could be used for in-place negation. Let’s say you have:
> ```d
> ref int f(); // is costly or has side-effects
> ```
> To negate the result in-place, you have to do:
> ```d
> int* p = &f();
> *p = -*p;
> ```
> or
> ```d
> (ref int x) { x = -x; }(f());
> ```
>
>> I have … as one character; would be nice to have that as an alternative to .. (two ASCII fullstops) maybe?
>>
>> I realise that this issue is hardly about the cure for world peace, but there seems to be little reason to be confined to ASCII forever when there are better suited alternatives and things that might spark the imagination of designers.
>
> The problem are fonts that don’t support certain characters and editors defaulting to legacy encodings. One can handle `FranÃ§ais`, but `a Ã— b` (UTF-8 read as Windows-1252) is a problem because who knows what the character was.
>
> It’s not that the gain is rather little, it’s the potential for high cost. A lot of people will avoid those like the plague because of legacy issues.
>
>> One extreme case or two: Many editors now automatically employ ‘ ’ supposed to be 6-9 quotes, instead of ASCII '', so too with “ ” (6-9 matching pair).
>
> Many document processors do that. Whoever writes code in them, they’re wrong.
>
>> When Walter was designing the literal strings lexical items many items needed to be found for all the alternatives. And we have « » which are familiar to French speakers? It would be very nice to to fall over on 6-9 quotes anyway, and just accept them as an alternative.
>
> Accepting them is one possibility. Having an editor that replaces “” by "" and ‘’ by '' is another. Any regex-replace can easily used for that: `‘([^’]*)’` by `'$1'`.
>
>> The second case that comes to mind: I was thinking about regex grammars and XML’s grammar, and I think one or both can now handle all kinds of unicode whitespace.
>
> Definitely not regex. It’s not standardized at all.
>
> XML is quite a non-problem because directly supports specifying an encoding.
>
>> That’s the kind of thinking I’m interested in. It would be good to handle all kinds of whitespace, as we do all kinds of newline sequences. We probably already do both well. And no one complains saying ‘we ought not bother with tab’, so handling U+0085 and the various whitespace types such as &nbsp in our lexicon of our grammar is to me a no-brainer.
>>
>> And what use might we find some day for § and ¶ ? Could be great for some new exotic grammatical structural pattern. Look at the mess that C++ got into with the syntax of templates. They needed something other than < >. Almost anything. They could have done no worse with « ».
>
> As a German, I find «» and ‹› a little irritating, because we’re using them like this: »« and ›‹. The Swiss use «content» and the French use « content » (with half-spaces).
>
> C++ was wrong on template syntax, but they were right on using ASCII. D has good template syntax, and it’s ASCII.
>
>> Another point: These exotics are easy to find in your text editor because they won’t be overused.
>
> Citation needed.
>
>> As for usability, some of our tools now have or could have ‘favourite characters’ or ‘snippet’ text strings in a place in the ui where they are readily accessible. I have a unicode character map app and also a file with my unicode favourite characters in it. So there are things that we can do ourselves. And having a favourites comment block in a starter template file might be another example.
>
> If you employ tooling, the best option is to leave the source code as-is and use a OpenType font or other UI-oriented things.
>
>> Argument against: would complicate our regexes with a new need for multiple alternatives as in  [xyz] rather than just one possible character in a search or replace operation. But I think that some regex engines are unicode aware and can understand concepts like all x-characters where x is some property or defines a subset.
>
> Making `grep` harder to use is definitely a deal-breaker.
>
>> I have a concern. I love the betterC idea. Something inside my head tells me not to move too far from C. But we have already left the grammar of C behind, for good reason. C doesn’t have .. or … ( :-) ) nor does it have $. So that train has left. But I’m talking about things that C is never going to have.
>
> Unicode has U+2025 ‥ for you as well.
>
> C is overly restrictive. It’s not based on ASCII, but a proper subset of ASCII that’s compatible with even older standards like EBCDIC. In today’s age, ASCII support is quite a safe bet. Unicode support isn’t.
>
>> One point of clarification: I am not talking about D runtime. I’m confining myself to D’s lexer and D’s grammar.
>
> It sounds great in theory, but if any tool in your chain has no support for that, you’re out. I was running into that on Windows recently. Not D related.
>
> I’m a Unicode fan. I created my own keyboard layout which puts a lot of nice stuff on AltGr and dead key sequences (e.g. proper quotation marks, currency symbols, math symbols, the complete Greek alphabet) while leaving anything that is printed on the keys where it was. Yet I fail to see the advantage of × over * and similar *in code.* There are several fonts that visually replace <= by a wider ≤ sign, != by a wide ≠, etc. If you want alternatives, use a font. It’s non-intrusive to the source code. It’s a million times better than Unicode in source. I don’t use those fonts because for some reason, they add a plethora of things that make sense in certain languages, e.g. replace `>>` by a ligature (think of `»`). That makes sense when it’s an operator, but it doesn’t when it’s two closing angle brackets (cf. Java or C++).

About the search in your text editor. You

I had thought of ‘×’ for cross-product maybe. :-)

I don’t want you all to misunderstand me here, I’m not suggesting that I can defend all of these ideas, I’m just trying to free up our imagination. If we decide that we really want some perfect symbol for a new situation, maybe something already established, perhaps in maths or elsewhere, then I’m merely saying that we should perhaps remember that unicode exists and is not a new weird thing anymore.

The usability thing is not something that I’m too worried about because solutions will rise to meet problems. I have my favourite little snippets of IPA characters in a document and I keep that handy. My iPad has installable keyboard handlers of all sorts, including poly tonic ancient greek.

What made me think about this topic though is looking at my iPad’s virtual keyboard. The character … is no less accessible than ‘a’, and é and ß are just a long press. The ± £ § ¥ € characters on my iPad are no less accessible than ASCII. Over time, maybe keyboards will evolve seeing as it has already been with the iPad.

But we absolutely should not be ignoring usability here. We should ‘game’ how users will cope when affected by more adventurous proposals

A thought, those of us who hate new keywords (I am not one - I even love ADA!) would be able to consider mining Unicode for single character or few-character symbols instead of long english words that might cause breakage apart from restricting the space remaining for user-defined identifiers.

Staying inside ASCII’s 95 characters forever, it’s a bit like being a caged animal that when freed doesn’t want to leave its small world. We’re so very used to ASCII.

The takeaway here is just ‘remember unicode exists’ and the usability situation for some users is now first class if you are either lucky, like iPad owners, or else you set yourself up with some simple aids in the right way for what works for you. Having unicode in the back of your mind might help us become beloved by mathematicians, because we’ve made the ‘perfect fit’ choice. But the usability thing has to always be kept in mind and tips and links towards little apps ought to be readily handed out. We probably want to have ASCII longwinded fallbacks for users who really don’t like the keyboard situation though.

June 01, 2023

Re: Non-ASCII in the future in the lexer

Posted by H. S. Teoh
in reply to Cecil Ward

Permalink

H. S. Teoh

Posted in reply to Cecil Ward

Permalink

On Thu, Jun 01, 2023 at 08:54:19PM +0000, Cecil Ward via Digitalmars-d wrote: [...]
> I don’t want you all to misunderstand me here, I’m not suggesting that
> I can defend all of these ideas, I’m just trying to free up our
> imagination. If we decide that we really want some perfect symbol for
> a new situation, maybe something already established, perhaps in maths
> or elsewhere, then I’m merely saying that we should perhaps remember
> that unicode exists and is not a new weird thing anymore.
> 
> The usability thing is not something that I’m too worried about because solutions will rise to meet problems. I have my favourite little snippets of IPA characters in a document and I keep that handy. My iPad has installable keyboard handlers of all sorts, including poly tonic ancient greek.

Coincidentally, I recently wrote a program (in D, of course :-P) that translates ASCII transcriptions of IPA into Unicode.  And many years ago, I also wrote a program (in C -- this was before I discovered D) that translated ASCII wrapped inside <grk>...</grk> or <rus>...</rus> tags into polytonic Greek or Cyrillic.  In my text editor I could just type out the desired ASCII transcriptions, select the text, and pipe it through these programs to get the Unicode out.

> What made me think about this topic though is looking at my iPad’s
> virtual keyboard. The character … is no less accessible than ‘a’, and
> é and ß are just a long press. The ± £ § ¥ € characters on my iPad are
> no less accessible than ASCII. Over time, maybe keyboards will evolve
> seeing as it has already been with the iPad.
[...]

I believe that the next step is to USB/WiFi touchscreen keyboards that can be reconfigured to any symbol set by software.  All we need is a long, horizontal device with a touchscreen mounted on suitable support that makes it comfortable to type on, then have a standard API for software to configure whatever symbols it wishes the user to use on it. Instantly switch to APL symbols and back, for example.  Or, for that matter, have the layout completely software-driven: imagine instantly switching from a typewriter keyboard to a piano keyboard, for example, for easy music input. Or a guitar fret for instant MIDI improvisation.

T

-- 
When you breathe, you inspire. When you don't, you expire. -- The Weekly Reader

June 02, 2023

Re: Non-ASCII in the future in the lexer

Posted by Abdulhaq
in reply to H. S. Teoh

Permalink

Abdulhaq

Posted in reply to H. S. Teoh

Permalink

On Thursday, 1 June 2023 at 22:04:11 UTC, H. S. Teoh wrote:
>
> I believe that the next step is to USB/WiFi touchscreen keyboards that can be reconfigured to any symbol set by software.  All we need is a long, horizontal device with a touchscreen mounted on suitable support that makes it comfortable to type on, then have a standard API for software to configure whatever symbols it wishes the user to use on it. Instantly switch to APL symbols and back, for example.  Or, for that matter, have the layout completely software-driven: imagine instantly switching from a typewriter keyboard to a piano keyboard, for example, for easy music input. Or a guitar fret for instant MIDI improvisation.
>
 Of course it's subjective but I strongly dislike typing on touchscreens and am surprised to find a programmer who prefers them.

Also when playing the guitar, the way the string is struck and the position and force of the finger on the fretboard allows great variation in the sound. The idea of somehow even attempting to simulate that on a touch screen makes me feel sad for the loss of virtuosity even just thinking about it :-) . Similarly for the loss of key weight and travel on a piano keyboard. It all reminds me of how the virtual world is generally supplanting the real world, with the massive loss that entails.

I obviously got out of bed on the wrong side today :-)

June 02, 2023

Re: Non-ASCII in the future in the lexer

Posted by Abdulhaq
in reply to H. S. Teoh

Permalink

Abdulhaq

Posted in reply to H. S. Teoh

Permalink

On Thursday, 1 June 2023 at 22:04:11 UTC, H. S. Teoh wrote:
> -.
> [...]
>
> I believe that the next step is to USB/WiFi touchscreen keyboards that can be reconfigured to any symbol set by software.  All we need is a long, horizontal device with a touchscreen mounted on suitable support that makes it comfortable to type on, then have a standard API for software to configure whatever symbols it wishes the user to use on it. Instantly switch to APL symbols and back, for example.  Or, for that matter, have the layout completely software-driven: imagine instantly switching from a typewriter keyboard to a piano keyboard, for example, for easy music input. Or a guitar fret for instant MIDI improvisation.
>
>
> T

Of course it's subjective but I strongly dislike typing on touchscreens and am surprised to find a programmer who prefers them.

Also when playing the guitar, the way the string is struck and the position and force of the finger on the fretboard allows great variation in the sound. The idea of somehow even attempting to simulate that on a touch screen makes me feel sad for the loss of virtuosity even just thinking about it :-) . Similarly for the loss of key weight and travel on a piano keyboard. It all reminds me of how the virtual world is generally supplanting the real world, with the massive loss that entails.

I obviously got out of bed on the wrong side today :-)

June 02, 2023

Re: Non-ASCII in the future in the lexer

Posted by Meta
in reply to Abdulhaq

Permalink

Forums