string\utf question - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » string\utf question

Thread overview

string\utf question

Aug 05, 2004

Lars Ivar Igesund

Aug 05, 2004

Aug 05, 2004

Aug 05, 2004

Lars Ivar Igesund

Aug 05, 2004

Lars Ivar Igesund

Aug 05, 2004

Lars Ivar Igesund

Aug 05, 2004

Aug 05, 2004

Lars Ivar Igesund

Aug 06, 2004

Aug 06, 2004

Aug 06, 2004

Aug 06, 2004

Martin M. Pedersen

Aug 06, 2004

Re: [OT] reply before post? (was string\utf question)
Aug 06, 2004 J C Calvarese

August 05, 2004

string\utf question

Posted by Lars Ivar Igesund

Lars Ivar Igesund

I don't really know much about utf strings, but doing some processing of D files (e.g. ddepcheck), I suspect that I should know at least something before I proceed. lex.html states that identifiers can contain Universal Alphas. What is an universal alpha, and is it possible to check if a character is an universal alpha using some function currently in Phobos (or will it come with std.utype)?

Also, I use File : Stream's toString to get the content of the file (I don't care whether this is the most efficient way to do it or not, since it makes the processing itself much simpler compared to reading line by line). File's toString returns a char [] no matter what, whereas the std.ctype functions all take dchars as inputs.
What's the recommended type to use (char, wchar, dchar)?
What's the recommended way to convert the char [] to the best type?

Lars Ivar Igesund

August 05, 2004

Re: string\utf question

Posted by Arcane Jill
in reply to Lars Ivar Igesund

Arcane Jill

Posted in reply to Lars Ivar Igesund

In article <ceso29$18p3$1@digitaldaemon.com>, Lars Ivar Igesund says...
>
>I don't really know much about utf strings, but doing some processing of D files (e.g. ddepcheck), I suspect that I should know at least something before I proceed. lex.html states that identifiers can contain Universal Alphas. What is an universal alpha,

Google "ISO/IEC 9899:1999 (E)" or "ISO-C-FDIS.1999-04.pdf" and then head to
Annex D on page 438. (Obvious really! :))



>and is it possible to check if a character is an universal alpha using some function currently in Phobos (or will it come with std.utype)?

No and no. But it would be easy for me to add such a function to etc.unicode if you want. It would end up being called isUniversalAlpha(dchar).



>Also, I use File : Stream's toString to get the content of the file (I
>don't care whether this is the most efficient way to do it or not, since
>it makes the processing itself much simpler compared to reading line by
>line). File's toString returns a char [] no matter what, whereas the
>std.ctype functions all take dchars as inputs.
>What's the recommended type to use (char, wchar, dchar)?
>What's the recommended way to convert the char [] to the best type?

That's an application-dependent question, but personally I'd just do
std.utf.toUTF32(char[]).

Arcane Jill

August 05, 2004

Re: string\utf question

Posted by Arcane Jill
in reply to Arcane Jill

Arcane Jill

Posted in reply to Arcane Jill

In article <cesppu$19jq$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <ceso29$18p3$1@digitaldaemon.com>, Lars Ivar Igesund says...

Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours says 09:41:38 + 1 (which must be wrong). Curious. Anyway...


>No and no. But it would be easy for me to add such a function to etc.unicode if you want. It would end up being called isUniversalAlpha(dchar).

Thinking about this logically, isUniversalAlpha() would be a custom property, not actually part of the Unicode standard, so it probably doesn't really belong in etc.unicode. In fact, it probably belongs in std.compiler. I could invent etc.compiler and put it there, where it could stay until (if) Walter moves it. Would that make more sense?

Jill (trying to stay organized)

August 05, 2004

Re: string\utf question

Posted by Lars Ivar Igesund
in reply to Arcane Jill

Lars Ivar Igesund

Posted in reply to Arcane Jill

Arcane Jill wrote:

> In article <cesppu$19jq$1@digitaldaemon.com>, Arcane Jill says...
> 
>>In article <ceso29$18p3$1@digitaldaemon.com>, Lars Ivar Igesund says...
> 
> 
> Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My
> post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours
> says 09:41:38 + 1 (which must be wrong). Curious. Anyway...

That is indeed curious, my system clock is only 5 minutes quick, but the post is in my sent folder with the 9:41 time. Ok, this post should get the time 16:50 (GMT+1) if everythings correct.

> 
>>No and no. But it would be easy for me to add such a function to etc.unicode if
>>you want. It would end up being called isUniversalAlpha(dchar).
> 
> 
> Thinking about this logically, isUniversalAlpha() would be a custom property,
> not actually part of the Unicode standard, so it probably doesn't really belong
> in etc.unicode. In fact, it probably belongs in std.compiler. I could invent
> etc.compiler and put it there, where it could stay until (if) Walter moves it.
> Would that make more sense?

Well, maybe I can write that one myself and just keep it in my project until it is included in Phobos (unless it is very complicated). Currently I have no need for (other) external libraries. Anyway, thanks for the answers.

Lars Ivar Igesund

August 05, 2004

Re: string\utf question

Posted by Lars Ivar Igesund
in reply to Lars Ivar Igesund

Lars Ivar Igesund

Posted in reply to Lars Ivar Igesund

Lars Ivar Igesund wrote:

> Arcane Jill wrote:
> 
>> In article <cesppu$19jq$1@digitaldaemon.com>, Arcane Jill says...
>>
>>> In article <ceso29$18p3$1@digitaldaemon.com>, Lars Ivar Igesund says...
>>
>>
>>
>> Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My
>> post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours
>> says 09:41:38 + 1 (which must be wrong). Curious. Anyway...
> 
> 
> That is indeed curious, my system clock is only 5 minutes quick, but the post is in my sent folder with the 9:41 time. Ok, this post should get the time 16:50 (GMT+1) if everythings correct.

Shows up correct here, and so something is strange somewhere (or was when I posted that message this morning). Many posts on the newsgroup has strange times (IMO), possibly because time zones are handled differently in different clients (and maybe on the server). Also, your message shows up with the time 9:06 (GMT+1, that is) in my Thunderbird.  Maybe discrepancies pop up when some time pass from it is sent to it is accepted at the server at the same time as there are time zone differences.

Well, I don't know, the *real* answer is probably that you have time machine and went back in time to answer my questions.

Lars Ivar Igesund

August 05, 2004

Re: string\utf question

Posted by Lars Ivar Igesund
in reply to Arcane Jill

Lars Ivar Igesund

Posted in reply to Arcane Jill

Arcane Jill wrote:

> 
> Thinking about this logically, isUniversalAlpha() would be a custom property,
> not actually part of the Unicode standard, so it probably doesn't really belong
> in etc.unicode. In fact, it probably belongs in std.compiler. I could invent
> etc.compiler and put it there, where it could stay until (if) Walter moves it.
> Would that make more sense?
> 
> Jill (trying to stay organized)

I looked at the document you pointed out (together with the docs...), but the ranges there include Digits, and digits aren't part of the Universal Alphas allowed to use as an IdentifierStart. Sorry for acting stupid here, but are Digits and Special Characters from Annex D part of the Universal Alphas, or are there unmentioned exceptions? Also, trying to add any of these characters to my identifier names using Vim's hexadecimal mode fails miserably, but that's probably an error on my side.

Lars Ivar Igesund

August 05, 2004

Re: string\utf question

Posted by Arcane Jill
in reply to Lars Ivar Igesund

Arcane Jill

Posted in reply to Lars Ivar Igesund

In article <cetmo6$1qpe$1@digitaldaemon.com>, Lars Ivar Igesund says...

>I looked at the document you pointed out (together with the docs...), but the ranges there include Digits, and digits aren't part of the Universal Alphas allowed to use as an IdentifierStart.

Well, digits are allowed in identifiers - just not at the start.

>Sorry for acting stupid here, but are Digits and Special Characters from Annex D part of the Universal Alphas,

Yes, they are.

>or are there unmentioned exceptions?

Not so far as I am aware.

In http://www.digitalmars.com/d/lex.html, it says: "Identifiers start with a letter, _, or unicode alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.) Identifiers can be arbitrarily long, and are case sensitive. Identifiers starting with __ (two underscores) are reserved."

So it looks like "universal alphas" are not actually permitted as the /first/ character of a D identifier, only as the second or subsequent char. The /first/ character is apparently allowed to be a "unicode alpha", not a "universal alpha".

Of course, this begs the question "what is a unicode alpha"? The docs don't define this. Almost certainly, Walter means "a Unicode character which has the Alphabetic property", but I don't know that for sure. I would suggest this needs to be clarified in the documentation. The definition should also state /which version/ of Unicode the D compiler uses, since new "unicode alphas" will be added with each new version of Unicode. (Otherwise you could end up in the curious state whereby new Unicode letters would be allowed at the start of an identifier but not in the middle or end!)

Moreover, you probably wouldn't /want/ the definition of an identifier to change with each new release of Unicode.

Suggestion to Walter: you could redefine an identifier start to be: "an ASCII letter, underscore, or any universal alpha which has the Unicode Alphabetic property" (and, obviously, ensure that this definition is met, which it probably is already). Now you don't need to state a Unicode version number, because you're dealing only with a fixed and stable subset of Unicode.


>Also, trying to add any of these characters to my identifier names using Vim's hexadecimal mode fails miserably, but that's probably an error on my side.

I know nothing about vim.

August 05, 2004

Re: string\utf question

Posted by Lars Ivar Igesund
in reply to Arcane Jill

Lars Ivar Igesund

Posted in reply to Arcane Jill

Arcane Jill wrote:

> In article <cetmo6$1qpe$1@digitaldaemon.com>, Lars Ivar Igesund says...
> 
> 
>>I looked at the document you pointed out (together with the docs...), but the ranges there include Digits, and digits aren't part of the Universal Alphas allowed to use as an IdentifierStart.
> 
> 
> Well, digits are allowed in identifiers - just not at the start.
> 
> 
>>Sorry for acting stupid here, but are Digits and Special Characters from Annex D part of the Universal Alphas,
> 
> 
> Yes, they are.
> 
> 
>>or are there unmentioned exceptions?
> 
> 
> Not so far as I am aware.
> 
> In http://www.digitalmars.com/d/lex.html, it says: "Identifiers start with a
> letter, _, or unicode alpha, and are followed by any number of letters, _,
> digits, or universal alphas. Universal alphas are as defined in ISO/IEC
> 9899:1999(E) Appendix D. (This is the C99 Standard.) Identifiers can be
> arbitrarily long, and are case sensitive. Identifiers starting with __ (two
> underscores) are reserved."
> 
> So it looks like "universal alphas" are not actually permitted as the /first/
> character of a D identifier, only as the second or subsequent char. The /first/
> character is apparently allowed to be a "unicode alpha", not a "universal
> alpha".
> 
> Of course, this begs the question "what is a unicode alpha"? The docs don't
> define this. Almost certainly, Walter means "a Unicode character which has the
> Alphabetic property", but I don't know that for sure. I would suggest this needs
> to be clarified in the documentation. The definition should also state /which
> version/ of Unicode the D compiler uses, since new "unicode alphas" will be
> added with each new version of Unicode. (Otherwise you could end up in the
> curious state whereby new Unicode letters would be allowed at the start of an
> identifier but not in the middle or end!)
> 
> Moreover, you probably wouldn't /want/ the definition of an identifier to change
> with each new release of Unicode. 
> 
> Suggestion to Walter: you could redefine an identifier start to be: "an ASCII
> letter, underscore, or any universal alpha which has the Unicode Alphabetic
> property" (and, obviously, ensure that this definition is met, which it probably
> is already). Now you don't need to state a Unicode version number, because
> you're dealing only with a fixed and stable subset of Unicode.

Hmm, that didn't answer anything, you just came to the same conclusion as me regarding the somewhat lacking documentation :) Another point, above the text you quoted,
"
IdentifierStart:
	_
	Letter
	UniversalAlpha
"

Note that there is no mention of Unicode Alpha, neither there or elsewhere except in the excerpt you mentioned.

Walter, obvious bug in documentation has been found. A fix would be received with joyous celebrations across the globe (or at least in an axis between me and Jill). And a clarification in this thread :)

>>Also, trying to add any of these characters to my identifier names using Vim's hexadecimal mode fails miserably, but that's probably an error on my side.
> 
> 
> I know nothing about vim.

Well, what I did, was to add characters from the list as part of identifiers, but dmd didn't accept them. It is possible that the file wasn't saved in the correct format, but it didn't look like the problem. I might look at again later when I have something to test.

Lars Ivar Igesund

August 06, 2004

Re: string\utf question

Posted by J C Calvarese
in reply to Lars Ivar Igesund

J C Calvarese

Posted in reply to Lars Ivar Igesund

Attachments:

utf8.zip

Lars Ivar Igesund wrote:
> Arcane Jill wrote:
> 
>> In article <cetmo6$1qpe$1@digitaldaemon.com>, Lars Ivar Igesund says...
...

> Hmm, that didn't answer anything, you just came to the same conclusion
> as me regarding the somewhat lacking documentation :) Another point,
> above the text you quoted,
> "
> IdentifierStart:
>     _
>     Letter
>     UniversalAlpha
> "
> 
> Note that there is no mention of Unicode Alpha, neither there or elsewhere except in the excerpt you mentioned.

I think when he mentioned "Unicode Alpha", he meant "UniversalAlpha".

> 
> Walter, obvious bug in documentation has been found. A fix would be received with joyous celebrations across the globe (or at least in an axis between me and Jill). And a clarification in this thread :)
> 
>>> Also, trying to add any of these characters to my identifier names using Vim's hexadecimal mode fails miserably, but that's probably an error on my side.
>>
>>
>>
>> I know nothing about vim.
> 
> 
> Well, what I did, was to add characters from the list as part of identifiers, but dmd didn't accept them. It is possible that the file wasn't saved in the correct format, but it didn't look like the problem. I might look at again later when I have something to test.
> 
> Lars Ivar Igesund

I don't know about your test, but I got D to accept a Spanish letter (ñ) and a Chinese character (義) as an identifier.

In case, this stuff gets garbled in the transmission, I attached a .zip.


const char[] ñ = "eñe";
const char[] 義 = "justice";

import std.stdio;

void main()
{
     writefln("Feliz Cumpleaños.");
     writefln(ñ);
     writefln(義);

     /* It doesn't print right, but that's probably DOS's fault. */
}


-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

August 06, 2004

Re: [OT] reply before post? (was string\utf question)

Posted by J C Calvarese
in reply to Arcane Jill

J C Calvarese

Posted in reply to Arcane Jill

Arcane Jill wrote:
> In article <cesppu$19jq$1@digitaldaemon.com>, Arcane Jill says...
> 
>>In article <ceso29$18p3$1@digitaldaemon.com>, Lars Ivar Igesund says...
> 
> 
> Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My
> post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours
> says 09:41:38 + 1 (which must be wrong). Curious. Anyway...

The web interface runs on tachyons. ;)

Actually, I've seen this happen before. I think it's related to the delay in time before a message appears on the web when it's posted through the web interface. The order looks normal if you're viewing through Thunderbird.

Go figure.

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation