Thread overview
Chars ASCII 128+ in Literals?
Mar 29, 2005
AEon
Mar 29, 2005
Regan Heath
Mar 29, 2005
AEon
Mar 30, 2005
Regan Heath
Mar 30, 2005
AEon
Mar 30, 2005
Chris Sauls
Apr 04, 2005
psychotic
March 29, 2005
For a short moment I had feared all my char[] based code would not work with ASCII characters > 127. But that seems to work...

But when trying to test my function, via:

char[]  t = "Æmoo";   ->	aepar.d(69): invalid UTF-8 sequence


I then tried this, since it may be a mapping issue:

ubyte[] t = "Æmoo";
char[] tc = cast(char[]) t;
writefln( "\""~tc~"\" -> \""~remove_q1_Color_Names(tc)~"\"");

But that does not help either.

It seems that string literals cannot contain character that are > ASCII 127?

Hmmm...

AEon
March 29, 2005
On Tue, 29 Mar 2005 23:06:29 +0200, AEon <aeon2001@lycos.de> wrote:
> For a short moment I had feared all my char[] based code would not work with ASCII characters > 127. But that seems to work...
>
> But when trying to test my function, via:
>
> char[]  t = "�Æmoo";   ->	aepar.d(69): invalid UTF-8 sequence
>
>
> I then tried this, since it may be a mapping issue:
>
> ubyte[] t = "�Æmoo";
> char[] tc = cast(char[]) t;
> writefln( "\""~tc~"\" -> \""~remove_q1_Color_Names(tc)~"\"");
>
> But that does not help either.
>
> It seems that string literals cannot contain character that are > ASCII 127?

Are you saving this source file as UTF-8?
D source files *must* be saved as on of the UTF variants.

Regan

p.s. I would use , instead of ~ in the writefln above as ~ will append the strings together forming extra temporary strings whereas , will simply print one at a time creating no extra temporary strings.
March 29, 2005
Regan Heath wrote:

> Are you saving this source file as UTF-8?
> D source files *must* be saved as on of the UTF variants.

Hmmm...UltraEdit has:

Auto detec UTF-8 files (On)
Write UTF-8 BOOM header to all UTF-8 files when saved (On)
Write UTF-8 on new files created with UltraEdit (if above is not set) (On)

But since I don't even know what UTF-8 supposed to be, and why that should matter... hmmm

AEon


March 30, 2005
AEon wrote:
> For a short moment I had feared all my char[] based code would not work with ASCII characters > 127. But that seems to work...
> 
> But when trying to test my function, via:
> 
> char[]  t = "Æmoo";   ->    aepar.d(69): invalid UTF-8 sequence

Try this:
# char[]  t = "\&AElig;moo";

This is an example of "Named Character Entities," a new feature as of DMD 0.116 -- Read more here:
http://www.digitalmars.com/d/lex.html#EscapeSequence
http://www.digitalmars.com/d/entity.html

-- Chris Sauls
March 30, 2005
On Wed, 30 Mar 2005 00:04:01 +0200, AEon <aeon2001@lycos.de> wrote:
> Regan Heath wrote:
>
>> Are you saving this source file as UTF-8?
>> D source files *must* be saved as on of the UTF variants.
>
> Hmmm...UltraEdit has:
>
> Auto detec UTF-8 files (On)
> Write UTF-8 BOOM header to all UTF-8 files when saved (On)
> Write UTF-8 on new files created with UltraEdit (if above is not set) (On)
>
> But since I don't even know what UTF-8 supposed to be, and why that should matter... hmmm

UTF-8, UTF-16, and UTF-32 are encodings which can encode any unicode character. unicode ("universal encoding") has an encoding for every known existing character in every language in the world. (or at least that's the idea).

UTF-8 uses 8 bit code units, 1 or more code unit is used to represent a single character.
UTF-16 uses 16 bit code units, 1 or more code unit is used to represent a single character.
UTF-32 uses 32 bit code units, 1 or more code unit is used to represent a single character.

UTF-16 and UTF-32 can be in BE (Big Endian) or LE (LittleEndian) form. In short BE means the MSB (most significant bits) of the code unit appear first, followed by the LSB (least significant bits). LE is the opposite of BE.

It's my understanding that if you save your source file as UTF-8 then if it contains "�Æmoo" the literal will be saved as a valid UTF-8 sequence so you should be able to say:

char[] t = "�Æmoo";

and it should compile and run.

The next trick/problem is that the console you write it to must support UTF-8 and be in UTF-8 mode. By default the windows and I believe unix consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I am not sure how this is achieved, hopefully someone else will fill us both in :)

Regan
March 30, 2005
Regan Heath wrote:
> On Wed, 30 Mar 2005 00:04:01 +0200, AEon <aeon2001@lycos.de> wrote:
> 
>> Regan Heath wrote:
>>
>>> Are you saving this source file as UTF-8?
>>> D source files *must* be saved as on of the UTF variants.
>>
>>
>> Hmmm...UltraEdit has:
>>
>> Auto detec UTF-8 files (On)
>> Write UTF-8 BOOM header to all UTF-8 files when saved (On)
>> Write UTF-8 on new files created with UltraEdit (if above is not set)  (On)
>>
>> But since I don't even know what UTF-8 supposed to be, and why that  should matter... hmmm
> 
> UTF-8, UTF-16, and UTF-32 are encodings which can encode any unicode  character. unicode ("universal encoding") has an encoding for every known  existing character in every language in the world. (or at least that's the  idea).
> 
> UTF-8 uses 8 bit code units, 1 or more code unit is used to represent a  single character.
> UTF-16 uses 16 bit code units, 1 or more code unit is used to represent a  single character.
> UTF-32 uses 32 bit code units, 1 or more code unit is used to represent a  single character.
> 
> UTF-16 and UTF-32 can be in BE (Big Endian) or LE (LittleEndian) form. In  short BE means the MSB (most significant bits) of the code unit appear  first, followed by the LSB (least significant bits). LE is the opposite of  BE.

Thanx for explaining that. I had skimmed over it in the manual and did note really find it relevant. I am probably too much of an old-timer, who things 8bit is good enough ;)

> It's my understanding that if you save your source file as UTF-8 then if  it contains "�Æmoo" the literal will be saved as a valid UTF-8 sequence so  you should be able to say:
> 
> char[] t = "�Æmoo";
> 
> and it should compile and run.

I will do some tests. Possibly I did not create a "new" file, and thus inherited some DOS or default windows txt format, that is not UTF-8. In that case UltraEdit would not change it to UTF-8.

> The next trick/problem is that the console you write it to must support  UTF-8 and be in UTF-8 mode. By default the windows and I believe unix  consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I am  not sure how this is achieved, hopefully someone else will fill us both in  :)

I noted this when running my ASCII 0-255 test string on my convertion function. The DOS console will not properly show the Æ (possible a font issue), if will show some weird looking "F" character. When you copy/paste the console output, to UltraEdit the Æ is presented as |, but if you divert the exe's output to a text time (e.g. aepar -q3a > .tmp). Then load .tmp in UltraEdit, the Æ is properly shown.

So, as you pointed out, there are a few issues between the DOS console and chars > ASCII 127.

AEon
March 30, 2005
Regan Heath wrote:
> The next trick/problem is that the console you write it to must support  UTF-8 and be in UTF-8 mode. By default the windows and I believe unix  consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I am  not sure how this is achieved, hopefully someone else will fill us both in  :)
> 

I think linux supports it natively (at least I haven't had any problems with UTF-8 output, even if I had to configure it on Ubuntu).

On Windows (I don't know if all Windows), you can do it like this: "chcp 65001". By default, it isn't set. I don't know how to make it the default codepage.

> Regan

_______________________
Carlos Santander Bernal
April 04, 2005
Maybe you will find this example i posted on dsource.org interesting, although, not cross-platform. [ http://www.dsource.org/tutorials/index.php?show_example=147 ]. To quote the desctiption: "This Windows specific code, allows you to print international characters (Greek for instance) on the non UTF-8 Windows console".

Best Regards
~psychotic