Jump to page: 1 2 3
Thread overview
Non-ASCII in the future in the lexer
May 31, 2023
Cecil Ward
May 31, 2023
Dom DiSc
May 31, 2023
Walter Bright
Jun 01, 2023
Kagamin
May 31, 2023
Walter Bright
May 31, 2023
H. S. Teoh
Jun 01, 2023
Walter Bright
Jun 01, 2023
Danni Coy
Jun 01, 2023
Timon Gehr
Jun 01, 2023
Walter Bright
Jun 01, 2023
Timon Gehr
Jun 01, 2023
Quirin Schroll
Jun 01, 2023
Cecil Ward
Jun 01, 2023
H. S. Teoh
Jun 02, 2023
Abdulhaq
Jun 02, 2023
Abdulhaq
Jun 02, 2023
Meta
Jun 03, 2023
Walter Bright
May 31, 2023

What do you think? It occurred to me that as the language develops we are occasionally having discussions about new keywords, or even changing them, for example: s/body/do/ some while back.

Unicode has been around for 30 years now and yet it is not getting fully used in programming languages for example. We are still stuck in our minds with ASCII only. Should we in future start mining the riches of unicode when we make changes to the grammar of programming languages (and other grammars)?

Would it be worthwhile considering wider unicode alternatives for keywords that we already have? Examples: comparison operators and other operators. We have unicode symbols for

≤     less than or equal <=
≥    greater than or equal >=

a proper multiplication sign ‘×’, like an x, as well as the * that we have been stuck with since the beginning of time.

± 	plus or minus might come in useful someday, can’t think what for.

I have … as one character; would be nice to have that as an alternative to .. (two ASCII fullstops) maybe?

I realise that this issue is hardly about the cure for world peace, but there seems to be little reason to be confined to ASCII forever when there are better suited alternatives and things that might spark the imagination of designers.

One extreme case or two: Many editors now automatically employ ‘ ’ supposed to be 6-9 quotes, instead of ASCII '', so too with “ ” (6-9 matching pair). When Walter was designing the literal strings lexical items many items needed to be found for all the alternatives. And we have « » which are familiar to French speakers? It would be very nice to to fall over on 6-9 quotes anyway, and just accept them as an alternative. The second case that comes to mind: I was thinking about regex grammars and XML’s grammar, and I think one or both can now handle all kinds of unicode whitespace. That’s the kind of thinking I’m interested in. It would be good to handle all kinds of whitespace, as we do all kinds of newline sequences. We probably already do both well. And no one complains saying ‘we ought not bother with tab’, so handling U+0085 and the various whitespace types such as &nbsp in our lexicon of our grammar is to me a no-brainer.

And what use might we find some day for § and ¶ ? Could be great for some new exotic grammatical structural pattern. Look at the mess that C++ got into with the syntax of templates. They needed something other than < >. Almost anything. They could have done no worse with « ».

Another point: These exotics are easy to find in your text editor because they won’t be overused.

As for usability, some of our tools now have or could have ‘favourite characters’ or ‘snippet’ text strings in a place in the ui where they are readily accessible. I have a unicode character map app and also a file with my unicode favourite characters in it. So there are things that we can do ourselves. And having a favourites comment block in a starter template file might be another example.

Argument against: would complicate our regexes with a new need for multiple alternatives as in  [xyz] rather than just one possible character in a search or replace operation. But I think that some regex engines are unicode aware and can understand concepts like all x-characters where x is some property or defines a subset.

I have a concern. I love the betterC idea. Something inside my head tells me not to move too far from C. But we have already left the grammar of C behind, for good reason. C doesn’t have .. or … ( :-) ) nor does it have $. So that train has left. But I’m talking about things that C is never going to have.

One point of clarification: I am not talking about D runtime. I’m confining myself to D’s lexer and D’s grammar.
May 31, 2023
On Wednesday, 31 May 2023 at 06:23:43 UTC, Cecil Ward wrote:
>
>
> What do you think? It occurred to me that as the language develops we are occasionally having discussions about new keywords, or even changing them, for example: s/body/do/ some while back.
>
> Unicode has been around for 30 years now and yet it is not getting fully used in programming languages for example. We are still stuck in our minds with ASCII only. Should we in future start mining the riches of unicode when we make changes to the grammar of programming languages (and other grammars)?

I'm fully with you, but the problem is not to have any Unicode symbols in the grammar as operators or delimiters or whatever. It's the input method. Most keyboards don't have them on the keys and any other method is awfully slow. Even a well-designed selector table is slow if it is needed often - and most editors are far from providing such.
May 31, 2023
Some interesting food for thought. Thanks for taking the time to post this.
May 31, 2023
On 5/31/2023 1:22 AM, Dom DiSc wrote:
> I'm fully with you, but the problem is not to have any Unicode symbols in the grammar as operators or delimiters or whatever. It's the input method. Most keyboards don't have them on the keys and any other method is awfully slow. Even a well-designed selector table is slow if it is needed often - and most editors are far from providing such.

I use putty a lot to access computers remotely in text mode. With some experimentation, some Unicode characters are rendered, but some aren't, like the 69 quotes. Maybe the programming world isn't quite ready for them yet.
May 31, 2023
On 31/05/2023 8:47 PM, Walter Bright wrote:
> Maybe the programming world isn't quite ready for them yet.

s/programming/Windows/

Try ConEmu terminal emulator, it supports Putty. I find ConEmu works very well with Cygwin for Unicode printing.

An extension of this is that we really need a proper console module in Phobos that uses the 16bit stuff as that makes it work out of the box.
May 31, 2023
On Wed, May 31, 2023 at 06:23:43AM +0000, Cecil Ward via Digitalmars-d wrote:
> What do you think? It occurred to me that as the language develops we are occasionally having discussions about new keywords, or even changing them, for example: s/body/do/ some while back.
> 
> Unicode has been around for 30 years now and yet it is not getting fully used in programming languages for example. We are still stuck in our minds with ASCII only. Should we in future start mining the riches of unicode when we make changes to the grammar of programming languages (and other grammars)?

D already supports Unicode identifiers.  For example, this is valid D today:

	int функция(int параметр) {
		return (параметр > 0) ? 2*функция(параметр-1) + 1 : 2;
	}

Of course, current language keywords are English- (and ASCII-) only.


> Would it be worthwhile considering wider unicode alternatives for keywords that we already have? Examples: comparison operators and other operators. We have unicode symbols for
> 
> ≤     less than or equal <=
> ≥    greater than or equal >=
> 
> a proper multiplication sign ‘×’, like an x, as well as the * that we have been stuck with since the beginning of time.

This is all great, but as someone else has already said, the input
method could be a problem area.  On my PC, I've set up XKB input with a
compose key such that many of these symbols are relatively easily
accessible; for example, Compose + < + = produces ≤; and Compose + v + /
produces √.  However, some symbols are more tricky to input, and some
are not accessible this way.  While it's always possible to, e.g., use a
character map widget to select a particular symbol, that significantly
slows down how fast you can type code, which negatively affects
productivity.

One dream I've always had is the so-called software-controlled keyboard: instead of a keyboard with physical keys, you'd have a keyboard that's actually a touchscreen, with keys that can be replaced from software. So for example, when writing D + Unicode symbols, you'd switch to "Unicode D" layout where symbols like ≤, ≥, ×, etc. are easily accessible.  We already have this on our mobile devices, in fact, to various degrees of customizability.  It just has to be taken to the next step of allowing easy remapping of keyboard layouts and switching between them.  Each future programming language, for example, could come with its own layout having language-specific symbols easily accessible.


> ± 	plus or minus might come in useful someday, can’t think what for.

In one of my projects, there's a vector calculator program where ± produces an expression that returns a list of values produced by all possible combinations of signs where the ± operator appears.  It's very useful for certain applications, like combinatorial polytopes where ± appears frequently.


[...]
> Argument against: would complicate our regexes with a new need for multiple alternatives as in  [xyz] rather than just one possible character in a search or replace operation. But I think that some regex engines are unicode aware and can understand concepts like all x-characters where x is some property or defines a subset.

std.regex *is* unicode-aware, BTW. Check this out:

````d
import std;
string преобразовать(string текст) {
	return текст.replaceAll(regex(`[а-я]`), "X");
}
void main() {
	writefln("blah blah это не правда blah blah".преобразовать);
}
````

Output:

````
blah blah XXX XX XXXXXX blah blah
````

It correctly handles ranges of non-ASCII characters.


T

-- 
Real Programmers use "cat > a.out".
May 31, 2023
On 5/31/2023 8:13 AM, H. S. Teoh wrote:
> This is all great, but as someone else has already said, the input
> method could be a problem area.  On my PC, I've set up XKB input with a
> compose key such that many of these symbols are relatively easily
> accessible; for example, Compose + < + = produces ≤; and Compose + v + /
> produces √.  However, some symbols are more tricky to input, and some
> are not accessible this way.

I've struggled with that, too. On MicroEmacs, I fixed ^X-U to scroll through the various incarnations of a letter. So, placing the cursor on a, and hitting ^X-U, changes it to a with an umlaut, a with an accent, etc. On a -, it scrolls through the various - variations. On ", it scrolls through the quoting symbols.

Of course, this is pretty limited.

June 01, 2023
On Thu, Jun 1, 2023 at 4:35 PM Walter Bright via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
>
> On 5/31/2023 8:13 AM, H. S. Teoh wrote:
> > This is all great, but as someone else has already said, the input
> > method could be a problem area.  On my PC, I've set up XKB input with a
> > compose key such that many of these symbols are relatively easily
> > accessible; for example, Compose + < + = produces ≤; and Compose + v + /
> > produces √.  However, some symbols are more tricky to input, and some
> > are not accessible this way.
>
> I've struggled with that, too. On MicroEmacs, I fixed ^X-U to scroll through the various incarnations of a letter. So, placing the cursor on a, and hitting ^X-U, changes it to a with an umlaut, a with an accent, etc. On a -, it scrolls through the various - variations. On ", it scrolls through the quoting symbols.
>
> Of course, this is pretty limited.
>

The compose key on X windows (Linux) is user configurable.
You can use it to do basically whatever you want.
there are extra bindings available online for the greek alphabet and
mathematical symbols.
it's controlled from a configuration file for which the syntax looks
something like this.
<Multi_key> <asciitilde> <asciitilde>          : "≈"

On windows there is at least one addon that adds this functionality
and is user configurable.
I don't know what the situation is on Mac or on Wayland.

As low hanging fruit I would like to see constants such as MATH_PI
defined as by the symbol (eg π).
I think one of the most important qualities of code is readability and
getting the balance between verlbosity and terseness is important.

I would also like to see syntax like the following be possible

if ( 0 < x ≤ 8) {}  (lowers to if ( x > 0 && x <= 8) {} )

June 01, 2023
On 6/1/23 08:31, Walter Bright wrote:
> On 5/31/2023 8:13 AM, H. S. Teoh wrote:
>> This is all great, but as someone else has already said, the input
>> method could be a problem area.  On my PC, I've set up XKB input with a
>> compose key such that many of these symbols are relatively easily
>> accessible; for example, Compose + < + = produces ≤; and Compose + v + /
>> produces √.  However, some symbols are more tricky to input, and some
>> are not accessible this way.
> 
> I've struggled with that, too. On MicroEmacs, I fixed ^X-U to scroll through the various incarnations of a letter. So, placing the cursor on a, and hitting ^X-U, changes it to a with an umlaut, a with an accent, etc. On a -, it scrolls through the various - variations. On ", it scrolls through the quoting symbols.
> 
> Of course, this is pretty limited.
> 

I am just using the Agda input mode in emacs, so e.g., I just type "\to" and I get "→", "\'a" and I get "á", etc. Many editors have similar plugins. This also works perfectly over ssh. In any case, the approach I have taken with my own lexers is that Unicode is supported, but never required. E.g., people can just write "->" instead of "→" and this is the case for all Unicode syntax elements (except if you have to match an identifier name I guess). After that, whether or not non-ASCII tokens are used at all becomes a question of code style and formatting.

In my experience, many programmers are too lazy (and/or ideologically against Unicode) to set up simple Unicode input and still prefer to write ASCII, but I much prefer reading Unicode. Further down the road, I plan to address this disconnect using an automatic code formatter.
June 01, 2023
On Wednesday, 31 May 2023 at 08:47:04 UTC, Walter Bright wrote:
> I use putty a lot to access computers remotely in text mode. With some experimentation, some Unicode characters are rendered, but some aren't, like the 69 quotes. Maybe the programming world isn't quite ready for them yet.

Do you have Consolas font set in configuration Window -> Appearance?
« First   ‹ Prev
1 2 3