Arbitrary identifiers - syntax - D Programming Language Discussion Forum

July 04, 2023

Arbitrary identifiers - syntax

Posted by Cecil Ward

Permalink

Cecil Ward

Permalink

This is an intentionally vague post about an idea without a clear solution, so this is not a concrete proposal, but is intended to solicit suggestions and ideas.

In mathematics or physics, you might have variables such as t and t′ the second character of the last variable is a U+2032 (prime), and there’s also a similar glyph at U+02B9. I posted a while back about the use of unicode, and in that I was thinking about text in various non-English human languages. The docs say that D identifiers such as variable names are chosen from a subset of Unicode defined by an appendix of C99. This gives a massive list of acceptable characters in umpteen writing systems and human languages. How does D deal with that in the lexer? Enormous table lookup? I would be interested to know, compiler authors.

However in maths many of the symbols such as my earlier example contain characters that are not legal in identifiers as Unicode considers them to be maybe punctuation or similar non-ident concept. How to make D maths-friendly. Yes we can and do write things like t_prime, but it doesn’t look great. And it’s longwinded. Yes I hear you about the ease-of-use of Unicode but that was discussed before and belongs to the earlier thread. Is there a way of allowing (almost) ‘arbitrary’ content in identifiers in D’s grammar? Think of the kind of syntax that exploits say "my file.ext"-type double quoting for otherwise unacceptable filenames such as this example one with a space in it.

Is it at all possible that a future D might have a mechanism like that to accommodate arbitrary identifiers for maths? Maybe even a kind of extensible lexer? - perhaps way too hard, and an easier but less attractive solution like the bracketing could be found. abut whatever is suggested would have to be compact, neat and minimal so that mathematical equations could clearly resemble D statements and expressions.

I thought about all the imaginative literal string syntax that we already have, where a lot of work was done to make literal strings more workable in various use-cases.

I’d be very interested to hear suggestions as to how we do special relativity with t, t′, and then t″. `it may be just simply too hard to do it cleverly. I’m thinking about making D the most maths-friendly language, Let’s displace Fortran ;-). ( Would need to make complex numbers friendlier for that though, maybe with more of the syntactic sugar brought back, but that’s another story. ) I think it would possibly be a good idea to restrict ‘arbitrary’ characters to a certain subset, not allowing absolutely any Unicode character, so no whitespace, no control characters, no existing D tokens such as ‘=‘, maybe disallow all punctuation characters that are already ‘taken’ in D, that is, already in use in the existing lexer’s grammar, but I’m unsure about that. What do do about ‘-‘ hyphen-minus? It is allowed in some languages, such as XSLT and used there a lot. Perhaps ban it because of the confusion with minus for subtraction. I don’t know. It doesn’t seem to be used in physics, for that same reason.

Thoughts?

Ah huh!

This is something that I am very familiar with, as I'm updating dmd to use UAX31 identifiers (Unicode 15).

What you are wanting is called Medial.

The definition of a UAX31 identifier is: ``<Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*``

For possible characters for Medial: https://unicode.org/reports/tr31/#Table_Optional_Medial

https://unicode.org/reports/tr31

As for how to represent it... the way that dmd does it currently is with a ``wchar[2][]`` and then a binary search with a start + end. This of course isn't standard and is not the best.

The standard solution as per Unicode Demystified (strongly recommend buying it if you are interested in this subject) is to use an inversion list which is just the start of a given range, and using the index odd/even to determine if its in the range or not. You would use a search algorithm like binary to do the lookup.

I will be switching dmd over should my C23 PR go in, to a inversion list + fibonacci search to take advantage of ASCII, BMP, then per plane probabilities. I've been talking about this quite a bit recently on Discord #langdev channel.

Forums