UTF-8 problems - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » UTF-8 problems

Thread overview

UTF-8 problems
Jun 12, 2006 Deewiant
Jun 12, 2006 Oskar Linde
Jun 12, 2006 Deewiant
Jun 12, 2006 Oskar Linde
Jun 12, 2006 Deewiant
Jun 12, 2006 Oskar Linde
Jun 12, 2006 Deewiant
Jun 12, 2006 Carlos Santander
Jun 12, 2006 Deewiant

June 12, 2006

Posted by Deewiant

Deewiant

import std.stream, std.cstream;

// åäöΔ

void main() {
	Stream file = new File(__FILE__, FileMode.In);
	// alternatively:
	//Stream file = din;

	while (!file.eof)
		dout.writef("%s", file.getc);
}
--

With the above UTF-8 code, I expect the program's source to be output, also in UTF-8. However, I get ASCII output, and on line three appears everyone's favourite "Error: 4invalid UTF-8 sequence".

Furthermore, unless I use the "alternative" where std.cstream.din is used, the two line breaks after "std.cstream;" are not \r\n as they should be in the DOS encoding I use, they are \r\r\n. Converting the line breaks to just \n causes them to become \r\n in the output. Whence the extra \r?

What's strange is if I use e.g. readLine instead of getc, everything is fine. Since readLine seems to use getc internally, I'm having trouble understanding why this is the case.

A bug or two, or where am I going wrong?

June 12, 2006

Re: UTF-8 problems

Posted by Oskar Linde
in reply to Deewiant

Oskar Linde

Posted in reply to Deewiant

Deewiant skrev:
> import std.stream, std.cstream;
> 
> // åäöΔ
> 
> void main() {
> 	Stream file = new File(__FILE__, FileMode.In);
> 	// alternatively:
> 	//Stream file = din;
> 
> 	while (!file.eof)
> 		dout.writef("%s", file.getc);
> }
> --
> 
> With the above UTF-8 code, I expect the program's source to be output, also in
> UTF-8. However, I get ASCII output, and on line three appears everyone's
> favourite "Error: 4invalid UTF-8 sequence".
> 
> Furthermore, unless I use the "alternative" where std.cstream.din is used, the
> two line breaks after "std.cstream;" are not \r\n as they should be in the DOS
> encoding I use, they are \r\r\n. Converting the line breaks to just \n causes
> them to become \r\n in the output. Whence the extra \r?
> 
> What's strange is if I use e.g. readLine instead of getc, everything is fine.
> Since readLine seems to use getc internally, I'm having trouble understanding
> why this is the case.
> 
> A bug or two, or where am I going wrong?

I had a quick look at the std.stream sources and it seems std.stream isn't really unicode aware. getc() assumes the stream to be in utf-8 and returns a char, which means it returns a utf8 code unit, not a full character. getcw() on the other hand assumes the string is in utf-16 and returns a utf-16 code unit as a wchar.

You are printing individial utf-8 code units as characters, which triggers your error.

If D claims to have full unicode support, std.stream ought to either have decoding routines that return a dchar, or have a utf-decoding wrapper stream, in which case std.stream.getc() ought to return a ubyte, not a char...

/Oskar

June 12, 2006

Re: UTF-8 problems

Posted by Deewiant
in reply to Oskar Linde

Deewiant

Posted in reply to Oskar Linde

Oskar Linde wrote:
> Deewiant skrev:
>> import std.stream, std.cstream;
>>
>> // åäöΔ
>>
>> void main() {
>>     Stream file = new File(__FILE__, FileMode.In);
>>     // alternatively:
>>     //Stream file = din;
>>
>>     while (!file.eof)
>>         dout.writef("%s", file.getc);
>> }
>> -- 
>>
>> With the above UTF-8 code, I expect the program's source to be output,
>> also in
>> UTF-8. However, I get ASCII output, and on line three appears everyone's
>> favourite "Error: 4invalid UTF-8 sequence".
>>
>> Furthermore, unless I use the "alternative" where std.cstream.din is
>> used, the
>> two line breaks after "std.cstream;" are not \r\n as they should be in
>> the DOS
>> encoding I use, they are \r\r\n. Converting the line breaks to just \n
>> causes
>> them to become \r\n in the output. Whence the extra \r?
>>
>> What's strange is if I use e.g. readLine instead of getc, everything
>> is fine.
>> Since readLine seems to use getc internally, I'm having trouble
>> understanding
>> why this is the case.
>>
>> A bug or two, or where am I going wrong?
> 
> I had a quick look at the std.stream sources and it seems std.stream isn't really unicode aware. getc() assumes the stream to be in utf-8 and returns a char, which means it returns a utf8 code unit, not a full character. getcw() on the other hand assumes the string is in utf-16 and returns a utf-16 code unit as a wchar.
> 
> You are printing individial utf-8 code units as characters, which triggers your error.
> 
> /Oskar

Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in these matters to correct the problem.

So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"?

Say I check if the char received from getc() is greater than 127 (outside ASCII)
and if it is, I store it and the following char in two ubytes. Now what? How do
I get a char?

June 12, 2006

Re: UTF-8 problems

Posted by Oskar Linde
in reply to Deewiant

Oskar Linde

Posted in reply to Deewiant

Deewiant skrev:
> Oskar Linde wrote:
>> Deewiant skrev:
>>> import std.stream, std.cstream;
>>>
>>> // åäöΔ
>>>
>>> void main() {
>>>     Stream file = new File(__FILE__, FileMode.In);
>>>     // alternatively:
>>>     //Stream file = din;
>>>
>>>     while (!file.eof)
>>>         dout.writef("%s", file.getc);
>>> }
>>> -- 
>>>
>>> With the above UTF-8 code, I expect the program's source to be output,
>>> also in
>>> UTF-8. However, I get ASCII output, and on line three appears everyone's
>>> favourite "Error: 4invalid UTF-8 sequence".
>>>
>>> Furthermore, unless I use the "alternative" where std.cstream.din is
>>> used, the
>>> two line breaks after "std.cstream;" are not \r\n as they should be in
>>> the DOS
>>> encoding I use, they are \r\r\n. Converting the line breaks to just \n
>>> causes
>>> them to become \r\n in the output. Whence the extra \r?
>>>
>>> What's strange is if I use e.g. readLine instead of getc, everything
>>> is fine.
>>> Since readLine seems to use getc internally, I'm having trouble
>>> understanding
>>> why this is the case.
>>>
>>> A bug or two, or where am I going wrong?
>> I had a quick look at the std.stream sources and it seems std.stream
>> isn't really unicode aware. getc() assumes the stream to be in utf-8 and
>> returns a char, which means it returns a utf8 code unit, not a full
>> character. getcw() on the other hand assumes the string is in utf-16 and
>> returns a utf-16 code unit as a wchar.
>>
>> You are printing individial utf-8 code units as characters, which
>> triggers your error.
>>
>> /Oskar
> 
> Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in these
> matters to correct the problem.
> 
> So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I
> combine the former two into a single "char"?
> 
> Say I check if the char received from getc() is greater than 127 (outside ASCII)
> and if it is, I store it and the following char in two ubytes. Now what? How do
> I get a char?

dchar std.utf.decode(char[],int)

even if it can be quite clumsy. A hint is to use:

std.utf.UTF8stride[c] to get the total number of bytes that are part of the starting token c.

/Oskar

June 12, 2006

Re: UTF-8 problems

Posted by Deewiant
in reply to Oskar Linde

Deewiant

Posted in reply to Oskar Linde

Oskar Linde wrote:
> Deewiant skrev:
>> So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"?
>> 
>> Say I check if the char received from getc() is greater than 127 (outside
>> ASCII) and if it is, I store it and the following char in two ubytes. Now
>> what? How do I get a char?
> 
> dchar std.utf.decode(char[],int)
> 
> even if it can be quite clumsy. A hint is to use:
> 
> std.utf.UTF8stride[c] to get the total number of bytes that are part of the starting token c.
> 
> /Oskar

Thanks, that works. What I did was write a short function looking like this:

dchar myGetchar(Stream s) {
	char c = s.getc;

	// ASCII
	if (c <= 127)
		return c;
	else {
		// UTF-8
		char[] str = new char[2];
		str[0] = c;
		str[1] = s.getc;

		// dummy var, needed by decode
		size_t i = 0;
		return decode(str, i);
	}
}

Using that in place of getc() pretty much does the trick.

Unfortunately, when reading from files instead of stdin, I still run into the problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is being converted into \r\n because I'm on a Windows platform. I use the following workaround:

if (c == '\r') {
	char d = s.getc;
	if (d == '\n')
		return '\n';
	else {
		s.ungetc(d);
		return c;
	}
}

June 12, 2006

Re: UTF-8 problems

Posted by Oskar Linde
in reply to Deewiant

Oskar Linde

Posted in reply to Deewiant

Deewiant skrev:
> Oskar Linde wrote:
>> Deewiant skrev:
>>> So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"?
>>>
>>> Say I check if the char received from getc() is greater than 127 (outside
>>> ASCII) and if it is, I store it and the following char in two ubytes. Now what? How do I get a char?
>> dchar std.utf.decode(char[],int)
>>
>> even if it can be quite clumsy. A hint is to use:
>>
>> std.utf.UTF8stride[c] to get the total number of bytes that are part of the
>> starting token c.
>>
>> /Oskar
> 
> Thanks, that works. What I did was write a short function looking like this:

This only works for a small subset of Unicode...

> dchar myGetchar(Stream s) {
> 	char c = s.getc;
> 
> 	// ASCII
> 	if (c <= 127)
> 		return c;
> 	else {
> 		// UTF-8
> 		char[] str = new char[2];
> 		str[0] = c;
> 		str[1] = s.getc;

For a more general implementation, change the last 3 lines to:

		char[6] str;
                str[0] = c;
                int n = std.utf.UTF8stride[c];
                if (n == 0xff)
                        return cast(dchar)-1;; // corrupt string
                for (int i = 1; i < n; i++)
                        str[i] = s.getc;

> 
> 		// dummy var, needed by decode
> 		size_t i = 0;
> 		return decode(str, i);
> 	}
> }
> 
> Using that in place of getc() pretty much does the trick.
> 
> Unfortunately, when reading from files instead of stdin, I still run into the
> problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is
> being converted into \r\n because I'm on a Windows platform. I use the following
> workaround:

Yes. This is another proof that std.stream is lacking functionality. Because of this conversion, it is clear that std.stream isn't a binary stream, and as such, it ought to be either a utf-8, utf-16 or utf-32 encoded text stream, and in those cases std.stream.getc should have a function returning a dchar, just as the above code.

/Oskar

June 12, 2006

Re: UTF-8 problems

Posted by Carlos Santander
in reply to Deewiant

Carlos Santander

Posted in reply to Deewiant

Deewiant escribió:
> 
> Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in these
> matters to correct the problem.
> 
> So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I
> combine the former two into a single "char"?
> 
> Say I check if the char received from getc() is greater than 127 (outside ASCII)
> and if it is, I store it and the following char in two ubytes. Now what? How do
> I get a char?

Keep using readLine. The entire line should be made of valid UTF8 characters.

Maybe something to do about it would be to add getUTF8char, getUTF16char and getUTF32char, which would return char[], wchar[] and dchar, respectively, the first one returning an array of 1 to 4 elements, and the second 1 or 2.

-- 
Carlos Santander Bernal

June 12, 2006

Re: UTF-8 problems

Posted by Deewiant
in reply to Oskar Linde

Deewiant

Posted in reply to Oskar Linde

Oskar Linde wrote:
> Deewiant skrev:
>> Oskar Linde wrote:
>>> Deewiant skrev:
>>>> So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"?
>>>>
>>>> Say I check if the char received from getc() is greater than 127
>>>> (outside
>>>> ASCII) and if it is, I store it and the following char in two
>>>> ubytes. Now what? How do I get a char?
>>> dchar std.utf.decode(char[],int)
>>>
>>> even if it can be quite clumsy. A hint is to use:
>>>
>>> std.utf.UTF8stride[c] to get the total number of bytes that are part
>>> of the
>>> starting token c.
>>>
>>> /Oskar
>>
>> Thanks, that works. What I did was write a short function looking like this:
> 
> This only works for a small subset of Unicode...

Thanks for correcting it, I was unsure myself.

>> dchar myGetchar(Stream s) {
>>     char c = s.getc;
>>
>>     // ASCII
>>     if (c <= 127)
>>         return c;
>>     else {
>>         // UTF-8
>>         char[] str = new char[2];
>>         str[0] = c;
>>         str[1] = s.getc;
> 
> For a more general implementation, change the last 3 lines to:
> 
>         char[6] str;

6? Aren't 4 UTF-8 units enough for all of Unicode? I see that UTF8stride also has 5 or 6 as some of its elements; why is that?

>                 str[0] = c;
>                 int n = std.utf.UTF8stride[c];
>                 if (n == 0xff)
>                         return cast(dchar)-1;; // corrupt string
>                 for (int i = 1; i < n; i++)
>                         str[i] = s.getc;
> 
>>
>>         // dummy var, needed by decode
>>         size_t i = 0;
>>         return decode(str, i);
>>     }
>> }
>>
>> Using that in place of getc() pretty much does the trick.
>>
>> Unfortunately, when reading from files instead of stdin, I still run
>> into the
>> problem of \r\n being converted to \r\r\n. I think I know why, too:
>> '\n' is
>> being converted into \r\n because I'm on a Windows platform. I use the
>> following
>> workaround:
> 
> Yes. This is another proof that std.stream is lacking functionality. Because of this conversion, it is clear that std.stream isn't a binary stream, and as such, it ought to be either a utf-8, utf-16 or utf-32 encoded text stream, and in those cases std.stream.getc should have a function returning a dchar, just as the above code.
> 
> /Oskar

Yes, I agree wholeheartedly. It would appear that the std.stream classes are for textual input, but currently some of the methods choke on UTF-x input.

In addition to a getcd() method to complement getc() and getcw(), a getb()
method returning an ubyte might also be handy, for when one really wants
byte-by-byte input. Perhaps getc()'s signature should actually be changed into
that, since after all that's all it seems currently to be doing.

June 12, 2006

Re: UTF-8 problems

Posted by Deewiant
in reply to Carlos Santander

Deewiant

Posted in reply to Carlos Santander

Carlos Santander wrote:
> Deewiant escribió:
>> So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"?
>> 
>> Say I check if the char received from getc() is greater than 127 (outside
>> ASCII) and if it is, I store it and the following char in two ubytes. Now
>> what? How do I get a char?
> 
> Keep using readLine. The entire line should be made of valid UTF8 characters.

That would work, but I was originally using only getc() so it's easier for me to replace that than to change half of my input paradigm. <g>

> Maybe something to do about it would be to add getUTF8char, getUTF16char and getUTF32char, which would return char[], wchar[] and dchar, respectively, the first one returning an array of 1 to 4 elements, and the second 1 or 2.
> 

Something like that would indeed be handy. It's too bad std.stream is lacking in some respects, such as this.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation