Updating D beyond Unicode 2.0 (page 7)

On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote: > 2. There are no rules about what *encoding* is acceptable, it's implementation defined. So various compilers have different rules as to what will be accepted in the actual source code. In fact, I read somewhere that not even ASCII is guaranteed to be supported. > Indeed. IBM mainframes have C compilers too but not ASCII. They code in EBCDIC. That's why for instance it's not portable to do things like if(c >= 'A' && c <= 'Z') printf("CAPITAL LETTER\n"); is not true in EBCDIC.

On 9/24/18 3:18 PM, Patrick Schluter wrote: > On Monday, 24 September 2018 at 13:26:14 UTC, Steven Schveighoffer wrote: >> 2. There are no rules about what *encoding* is acceptable, it's implementation defined. So various compilers have different rules as to what will be accepted in the actual source code. In fact, I read somewhere that not even ASCII is guaranteed to be supported. >> > Indeed. IBM mainframes have C compilers too but not ASCII. They code in EBCDIC. That's why for instance it's not portable to do things like > > if(c >= 'A' && c <= 'Z') printf("CAPITAL LETTER\n"); > > is not true in EBCDIC. Right. But it's just a side-note -- I'd guess all modern compilers support ASCII, and definitely ones that we would want to interoperate with. Besides, that example is more concerned about *input data* encoding, not *source code* encoding. If the above is written in ASCII, then I would assume that the bytes in the source file are the ASCII bytes, and probably the IBM compilers would not know what to do with such files (it would all be gibberish if you opened on an EBCDIC editor). You'd first have to translate it to EBCDIC, which is a red flag that likely this isn't going to work :) -Steve

On 9/23/2018 12:06 PM, Abdulhaq wrote: > The early history of computer science is completely dominated by cultures who use latin script based characters, Small character sets are much more implementable on primitive systems like telegraphs and electro-mechanical ttys. It wasn't even practical to display a rich character set until the early 1980's or so. There wasn't enough memory. Glass ttys at the time could barely, and I mean barely, display ASCII. I know because I designed and built one.

On Friday, 21 September 2018 at 23:17:42 UTC, Seb wrote: > In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it. > (This is already supported in D.) I just want to chime in that I've definitely used greek letters in "ordinary" code - it's handy when writing math and feeling lazy. Note that on Linux, with a simple configuration tweak (Windows key mapped to Compose, and https://gist.githubusercontent.com/zkat/6718053/raw/4535a2e2a988aa90937a69dbb8f10eb6a43b4010/.XCompose ), you can for instance type "<windows key> l a m" to make the lambda symbol, or other greek letters very easily.

When I make code that I expect to be only used around here, I generally write the code itself in english but comments in my own language. I agree that in general, it's better to stick with english in identifiers when the programming language and the standard library is English. On Tuesday, 25 September 2018 at 09:28:33 UTC, FeepingCreature wrote: > On Friday, 21 September 2018 at 23:17:42 UTC, Seb wrote: >> In all seriousness I hate it when someone thought its funny to use the lambda symbol as an identifier and I have to copy that symbol whenever I want to use it because there's no convenient way to type it. >> (This is already supported in D.) > > I just want to chime in that I've definitely used greek letters in "ordinary" code - it's handy when writing math and feeling lazy. On the other hand, Unicode identifiers till have their value IMO. The quote above is one reason for that -if there is a very specialized codebase it may be just inpractical to letterize everything. Another reason is that something may not have a good translation to English. If there is an enum type listing city names, it is IMO better to write them as normal, using Unicode. CityName.seinäjoki, not CityName.seinaejoki.

September 25, 2018

Re: Updating D beyond Unicode 2.0

Posted by Jacob Carlborg
in reply to Neia Neutuladh

Permalink

Jacob Carlborg

Posted in reply to Neia Neutuladh

Permalink

On 2018-09-21 18:27, Neia Neutuladh wrote:
> D's currently accepted identifier characters are based on Unicode 2.0:
> 
> * ASCII range values are handled specially.
> * Letters and combining marks from Unicode 2.0 are accepted.
> * Numbers outside the ASCII range are accepted.
> * Eight random punctuation marks are accepted.
> 
> This follows the C99 standard.
> 
> Many languages use the Unicode standard explicitly: C#, Go, Java, Python, ECMAScript, just to name a few. A small number of languages reject non-ASCII characters: Dart, Perl. Some languages are weirdly generous: Swift and C11 allow everything outside the Basic Multilingual Plane.
> 
> I'd like to update that so that D accepts something as a valid identifier character if it's a letter or combining mark or modifier symbol that's present in Unicode 11, or a non-ASCII number. This allows the 146 most popular writing systems and a lot more characters from those writing systems. This *would* reject those eight random punctuation marks, so I'll keep them in as legacy characters.
> 
> It would mean we don't have to reference the C99 standard when enumerating the allowed characters; we just have to refer to the Unicode standard, which we already need to talk about in the lexical part of the spec.
> 
> It might also make the lexer a tiny bit faster; it reduces the number of valid-ident-char segments to search from 245 to 134. On the other hand, it will change the ident char ranges from wchar to dchar, which means the table takes up marginally more memory.
> 
> And, of course, it lets you write programs entirely in Linear B, and that's a marketing ploy not to be missed.
> 
> I've got this coded up and can submit a PR, but I thought I'd get feedback here first.
> 
> Does anyone see any horrible potential problems here?
> 
> Or is there an interestingly better option?
> 
> Does this need a DIP?

I'm not a native English speaker but I write all my public and private code in English. Anyone I work with, I will expect them and make sure they're writing the code in English as well. English is not enough either, it has to be American English.

Despite this I think that D should support as much of the Unicode as possible (including using Unicode for identifiers). It should not be up to the programming language to decide which language the developer should write the code in.

-- 
/Jacob Carlborg

On 09/24/2018 08:17 AM, 0xEAB wrote: > - Non-idiomatic translations of tech terms [2] This is something I had heard from a Digital Research programmer in early 90s: English message was something like "No memory left" and the German translation was "No memory on the left hand side" :) Ali

On Wednesday, 26 September 2018 at 02:12:07 UTC, Ali Çehreli wrote: > On 09/24/2018 08:17 AM, 0xEAB wrote: > > > - Non-idiomatic translations of tech terms [2] > > This is something I had heard from a Digital Research programmer in early 90s: > > English message was something like "No memory left" and the German translation was "No memory on the left hand side" :) My ex-girlfriend tried to learn SQL from a book that had gotten a prize for its use of Norwegian. As a result, every single concept used a different name from what everybody else uses, and while it may be possible to learn som SQL from this, it made googling an absolute nightmare. Just imagine a whole book saying CHOOSE for SELECT, IF for WHERE, and USING instead of FROM - only worse, since it's a different language. It even used SQL pseudo-code with these made-up names, and showed how to translate it to proper SQL as more of an afterthought. -- Simen

On 25/09/18 15:35, Dukc wrote: > Another reason is that something may not have a good translation to English. If there is an enum type listing city names, it is IMO better to write them as normal, using Unicode. CityName.seinäjoki, not CityName.seinaejoki. This sounded like a very compelling example, until I gave it a second thought. I now fail to see how this example translates to a real-life scenario. City names (data, changes over time) as enums (compile time set) seem like a horrible idea. That may sound like a very technical objection to an otherwise valid point, but it really think that's not the case. The properties that cause city names to be poor candidates for enum values are the same as those that make them Unicode candidates. Shachar

Forums