View mode: basic / threaded / horizontal-split · Log in · Help
March 29, 2005
Chars ASCII 128+ in Literals?
For a short moment I had feared all my char[] based code would not work 
with ASCII characters > 127. But that seems to work...

But when trying to test my function, via:

char[]  t = "Æmoo";   ->	aepar.d(69): invalid UTF-8 sequence


I then tried this, since it may be a mapping issue:

ubyte[] t = "Æmoo";
char[] tc = cast(char[]) t;
writefln( "\""~tc~"\" -> \""~remove_q1_Color_Names(tc)~"\"");

But that does not help either.

It seems that string literals cannot contain character that are > ASCII 127?

Hmmm...

AEon
March 29, 2005
Re: Chars ASCII 128+ in Literals?
On Tue, 29 Mar 2005 23:06:29 +0200, AEon <aeon2001@lycos.de> wrote:
> For a short moment I had feared all my char[] based code would not work  
> with ASCII characters > 127. But that seems to work...
>
> But when trying to test my function, via:
>
> char[]  t = "�Æmoo";   ->	aepar.d(69): invalid UTF-8 sequence
>
>
> I then tried this, since it may be a mapping issue:
>
> ubyte[] t = "�Æmoo";
> char[] tc = cast(char[]) t;
> writefln( "\""~tc~"\" -> \""~remove_q1_Color_Names(tc)~"\"");
>
> But that does not help either.
>
> It seems that string literals cannot contain character that are > ASCII  
> 127?

Are you saving this source file as UTF-8?
D source files *must* be saved as on of the UTF variants.

Regan

p.s. I would use , instead of ~ in the writefln above as ~ will append the  
strings together forming extra temporary strings whereas , will simply  
print one at a time creating no extra temporary strings.
March 29, 2005
Re: Chars ASCII 128+ in Literals?
Regan Heath wrote:

> Are you saving this source file as UTF-8?
> D source files *must* be saved as on of the UTF variants.

Hmmm...UltraEdit has:

Auto detec UTF-8 files (On)
Write UTF-8 BOOM header to all UTF-8 files when saved (On)
Write UTF-8 on new files created with UltraEdit (if above is not set) (On)

But since I don't even know what UTF-8 supposed to be, and why that 
should matter... hmmm

AEon
March 30, 2005
Re: Chars ASCII 128+ in Literals?
AEon wrote:
> For a short moment I had feared all my char[] based code would not work 
> with ASCII characters > 127. But that seems to work...
> 
> But when trying to test my function, via:
> 
> char[]  t = "Æmoo";   ->    aepar.d(69): invalid UTF-8 sequence

Try this:
# char[]  t = "\&AElig;moo";

This is an example of "Named Character Entities," a new feature as of 
DMD 0.116 -- Read more here:
http://www.digitalmars.com/d/lex.html#EscapeSequence
http://www.digitalmars.com/d/entity.html

-- Chris Sauls
March 30, 2005
Re: Chars ASCII 128+ in Literals?
On Wed, 30 Mar 2005 00:04:01 +0200, AEon <aeon2001@lycos.de> wrote:
> Regan Heath wrote:
>
>> Are you saving this source file as UTF-8?
>> D source files *must* be saved as on of the UTF variants.
>
> Hmmm...UltraEdit has:
>
> Auto detec UTF-8 files (On)
> Write UTF-8 BOOM header to all UTF-8 files when saved (On)
> Write UTF-8 on new files created with UltraEdit (if above is not set)  
> (On)
>
> But since I don't even know what UTF-8 supposed to be, and why that  
> should matter... hmmm

UTF-8, UTF-16, and UTF-32 are encodings which can encode any unicode  
character. unicode ("universal encoding") has an encoding for every known  
existing character in every language in the world. (or at least that's the  
idea).

UTF-8 uses 8 bit code units, 1 or more code unit is used to represent a  
single character.
UTF-16 uses 16 bit code units, 1 or more code unit is used to represent a  
single character.
UTF-32 uses 32 bit code units, 1 or more code unit is used to represent a  
single character.

UTF-16 and UTF-32 can be in BE (Big Endian) or LE (LittleEndian) form. In  
short BE means the MSB (most significant bits) of the code unit appear  
first, followed by the LSB (least significant bits). LE is the opposite of  
BE.

It's my understanding that if you save your source file as UTF-8 then if  
it contains "�Æmoo" the literal will be saved as a valid UTF-8 sequence so  
you should be able to say:

char[] t = "�Æmoo";

and it should compile and run.

The next trick/problem is that the console you write it to must support  
UTF-8 and be in UTF-8 mode. By default the windows and I believe unix  
consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I am  
not sure how this is achieved, hopefully someone else will fill us both in  
:)

Regan
March 30, 2005
Re: Chars ASCII 128+ in Literals?
Regan Heath wrote:
> On Wed, 30 Mar 2005 00:04:01 +0200, AEon <aeon2001@lycos.de> wrote:
> 
>> Regan Heath wrote:
>>
>>> Are you saving this source file as UTF-8?
>>> D source files *must* be saved as on of the UTF variants.
>>
>>
>> Hmmm...UltraEdit has:
>>
>> Auto detec UTF-8 files (On)
>> Write UTF-8 BOOM header to all UTF-8 files when saved (On)
>> Write UTF-8 on new files created with UltraEdit (if above is not set)  
>> (On)
>>
>> But since I don't even know what UTF-8 supposed to be, and why that  
>> should matter... hmmm
> 
> UTF-8, UTF-16, and UTF-32 are encodings which can encode any unicode  
> character. unicode ("universal encoding") has an encoding for every 
> known  existing character in every language in the world. (or at least 
> that's the  idea).
> 
> UTF-8 uses 8 bit code units, 1 or more code unit is used to represent a  
> single character.
> UTF-16 uses 16 bit code units, 1 or more code unit is used to represent 
> a  single character.
> UTF-32 uses 32 bit code units, 1 or more code unit is used to represent 
> a  single character.
> 
> UTF-16 and UTF-32 can be in BE (Big Endian) or LE (LittleEndian) form. 
> In  short BE means the MSB (most significant bits) of the code unit 
> appear  first, followed by the LSB (least significant bits). LE is the 
> opposite of  BE.

Thanx for explaining that. I had skimmed over it in the manual and did 
note really find it relevant. I am probably too much of an old-timer, 
who things 8bit is good enough ;)

> It's my understanding that if you save your source file as UTF-8 then 
> if  it contains "�Æmoo" the literal will be saved as a valid UTF-8 
> sequence so  you should be able to say:
> 
> char[] t = "�Æmoo";
> 
> and it should compile and run.

I will do some tests. Possibly I did not create a "new" file, and thus 
inherited some DOS or default windows txt format, that is not UTF-8. In 
that case UltraEdit would not change it to UTF-8.

> The next trick/problem is that the console you write it to must support  
> UTF-8 and be in UTF-8 mode. By default the windows and I believe unix  
> consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I 
> am  not sure how this is achieved, hopefully someone else will fill us 
> both in  :)

I noted this when running my ASCII 0-255 test string on my convertion 
function. The DOS console will not properly show the Æ (possible a font 
issue), if will show some weird looking "F" character. When you 
copy/paste the console output, to UltraEdit the Æ is presented as |, but 
if you divert the exe's output to a text time (e.g. aepar -q3a > .tmp). 
Then load .tmp in UltraEdit, the Æ is properly shown.

So, as you pointed out, there are a few issues between the DOS console 
and chars > ASCII 127.

AEon
March 30, 2005
Re: Chars ASCII 128+ in Literals?
Regan Heath wrote:
> The next trick/problem is that the console you write it to must support  
> UTF-8 and be in UTF-8 mode. By default the windows and I believe unix  
> consoles are not in UTF-8 mode and you need to switch to UTF-8 mode. I 
> am  not sure how this is achieved, hopefully someone else will fill us 
> both in  :)
> 

I think linux supports it natively (at least I haven't had any problems 
with UTF-8 output, even if I had to configure it on Ubuntu).

On Windows (I don't know if all Windows), you can do it like this: "chcp 
65001". By default, it isn't set. I don't know how to make it the 
default codepage.

> Regan

_______________________
Carlos Santander Bernal
April 04, 2005
Re: Chars ASCII 128+ in Literals?
Maybe you will find this example i posted on dsource.org interesting, although,
not cross-platform. [
http://www.dsource.org/tutorials/index.php?show_example=147 ]. To quote the
desctiption: "This Windows specific code, allows you to print international
characters (Greek for instance) on the non UTF-8 Windows console".

Best Regards
~psychotic
Top | Discussion index | About this forum | D home