Only support for UTF-8? - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Issues » Only support for UTF-8?

Thread overview

Only support for UTF-8?
Aug 13, 2004 Nick
Aug 13, 2004 J C Calvarese
Aug 14, 2004 Sean Kelly
Aug 14, 2004 Nick
Aug 15, 2004 Arcane Jill
Aug 15, 2004 Nick

August 13, 2004

Only support for UTF-8?

Posted by Nick

Nick

The D text formatting system seems to only support Unicode, where most "international" characters must be coded in two or more bytes. However, many systems are not set up with UTF-8 or similar by default. For example, I'm currently on a linux system set up with (non-Unicode) ISO-8859-Latin1 charset, and the following fails:

# import std.stdio;
# import std.stream;
#
# alias std.stream.stdin stdin;
# alias std.stream.stdout stdout;
#
# void main()
# {
#   // Type some characters with byte values > 128, eg. æøå
#   char[] test = stdin.readLine();
#
#   // This works like wanted
#   stdout.writeLine("stream: " ~ test);
#
#   // "Error: invalid UTF-8 sequence"
#   writefln("writefln: ", test);
# }

The streams only work because they are made for raw binary data. Also format(...) failes. Most text files anywhere will not be unicode, so this is a BIG problem. Is there a plan to fix this? Or is it supposed to work already?

Also, the compiler itself won't accept my poor scandinavian characters in string litterals...

Nick

August 13, 2004

Re: Only support for UTF-8?

Posted by J C Calvarese
in reply to Nick

J C Calvarese

Posted in reply to Nick

Nick wrote:
> The D text formatting system seems to only support Unicode, where most
> "international" characters must be coded in two or more bytes. However, many
> systems are not set up with UTF-8 or similar by default. For example, I'm
> currently on a linux system set up with (non-Unicode) ISO-8859-Latin1 charset,
> and the following fails:
> 
> # import std.stdio;
> # import std.stream;
> #
> # alias std.stream.stdin stdin;
> # alias std.stream.stdout stdout;
> #
> # void main()
> # {
> #   // Type some characters with byte values > 128, eg. æøå
> #   char[] test = stdin.readLine();
> #
> #   // This works like wanted
> #   stdout.writeLine("stream: " ~ test);
> #
> #   // "Error: invalid UTF-8 sequence"
> #   writefln("writefln: ", test);
> # }
> 
> The streams only work because they are made for raw binary data. Also
> format(...) failes. Most text files anywhere will not be unicode, so this is a
> BIG problem. Is there a plan to fix this? Or is it supposed to work already?

I think you'll either need to program this yourself or wait for someone
to write a library that does this. I can think of someone who might
write such a library, but I'm not going to name names. (If you can't
guess who I'm thinking of, you need to read some of the recent posts
about Unicode in the main newsgroup.)

I've heard the reason why D is Unicode-based (when it's not ASCII-based)
is because there are so many different charsets out there that it'd be
hard to cover every one of them.

> 
> Also, the compiler itself won't accept my poor scandinavian characters in string
> litterals...

Look at http://www.digitalmars.com/d/lex.html

D source text can be in one of the following formats:
  * ASCII
  * UTF-8
  * UTF-16BE
  * UTF-16LE
  * UTF-32BE
  * UTF-32LE

UTF-8 is a superset of traditional 7-bit ASCII. One of the following UTF
BOMs (Byte Order Marks) can be present at the beginning of the source
text ...

If you're saving files as ISO-8859-Latin1, I'd wager that you're out of luck.

Even if UTF-8/UTF-16BE/UTF-16LE/UTF-32BE/UTF-32LE isn't the default, it
is an option, right?

If you want to see why Unicdoe is so popular, you might want to review
some of the Unicode threads:
http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues

>
> Nick
>
>

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

August 14, 2004

Re: Only support for UTF-8?

Posted by Sean Kelly
in reply to Nick

Sean Kelly

Posted in reply to Nick

In article <cfjgul$1a79$1@digitaldaemon.com>, Nick says...
>
>The streams only work because they are made for raw binary data. Also format(...) failes. Most text files anywhere will not be unicode, so this is a BIG problem. Is there a plan to fix this? Or is it supposed to work already?

I've already got a fix for this (as far as it's been tested anyway).  readf is meant to be a compliment to writef and is available as a part of unformat:

http://home.f4.ca/sean/d/unformat.d http://home.f4.ca/sean/d/utf.d

utf.d is a drop-in replacement for std.utf.  The readf calls pretty much follow the spec for scanf.  You may also have to compile std.format into your application as the compiler seems to get hung up on multiply defined stdarg symbols otherwise.

If you've got any comments please feel free.  There are some differences between readf and writef that I'm not sure if I should fix (it doesn't throw an exception on an argument mismatch, for example).


Sean

August 14, 2004

Re: Only support for UTF-8?

Posted by Nick
in reply to Sean Kelly

Nick

Posted in reply to Sean Kelly

In article <cfld0n$2ga6$1@digitaldaemon.com>, Sean Kelly says...
>
>I've already got a fix for this (as far as it's been tested anyway).  readf is meant to be a compliment to writef and is available as a part of unformat:
>
>http://home.f4.ca/sean/d/unformat.d http://home.f4.ca/sean/d/utf.d
>
>utf.d is a drop-in replacement for std.utf.  The readf calls pretty much follow the spec for scanf.  You may also have to compile std.format into your application as the compiler seems to get hung up on multiply defined stdarg symbols otherwise.

Thank you, I'll make sure to play around with them later.

Nick

August 15, 2004

Re: Only support for UTF-8?

Posted by Arcane Jill
in reply to Nick

Arcane Jill

Posted in reply to Nick

In article <cfjgul$1a79$1@digitaldaemon.com>, Nick says...
>
>The D text formatting system seems to only support Unicode,

When I read that D "only" supports Unicode, I had to smile, because of course, many other systems only support a subset of Unicode, wheras D supports all of it. But I do know what you mean - your problem in fact is that D only supports UTF character encodings.


>where most
>"international" characters must be coded in two or more bytes.

Right, so your problem is the encoding, not the character set.


>However, many
>systems are not set up with UTF-8 or similar by default. For example, I'm
>currently on a linux system set up with (non-Unicode) ISO-8859-Latin1 charset,

I assume you mean ISO-8859-1, colloquially known as "Latin 1". Incidently, these days, ISO-8859-1 is considered to be an encoding, not a character set (though the word "charset" continues to exist in HTML and other suchlike, for historical reasons).


>and the following fails:
><snip>
>Is there a plan to fix this? Or is it supposed to work already?

There is a plan to fix this (I think), but I'm not clear on the details. Sean is doing some stream stuff, and Hauke, last I heard, was doing some string stuff involving the handling of transcoding issues (and convertion to/from ISO-8859-1 is certainly a transcoding issue). We could probably do with an update on who's doing what.



>Also, the compiler itself won't accept my poor scandinavian characters in string litterals...

Well, /this/ one, at least, we can solve for you, right now. Re-save your source file in UTF-8, and then recompile it. Then your characters will be accepted.

Arcane Jill

August 15, 2004

Re: Only support for UTF-8?

Posted by Nick
in reply to Arcane Jill

Nick

Posted in reply to Arcane Jill

In article <cfnuq3$nq5$1@digitaldaemon.com>, Arcane Jill says...
>When I read that D "only" supports Unicode, I had to smile, because of course, many other systems only support a subset of Unicode, wheras D supports all of it. But I do know what you mean - your problem in fact is that D only supports UTF character encodings.

That is correct. I've read up a bit on unicode the last couple of days, and I'm starting to appreciate the fact that D supports it fully and natively. Seeing how many character encodings there are in existance and all the problems related, it's easy to see that ubiquitous support for Unicode would be a blessing, and I think D is now a part of setting that standard. But D also needs some support for other encodings, since these are still used everywhere. More on this below.

>There is a plan to fix this (I think), but I'm not clear on the details. Sean is doing some stream stuff, and Hauke, last I heard, was doing some string stuff involving the handling of transcoding issues (and convertion to/from ISO-8859-1 is certainly a transcoding issue). We could probably do with an update on who's doing what.

I will add to that and say I'm currently writing some simple wrappers for the C iconv functions, hopefully to be posted soon (hence the errno posts in the other NG.) These support a large set of encodings but are native to unix, I think. I'm not sure what to do for Win32.

Of course, it would also be nice to have some way to autodetect locale settings and convert all stdin/stdout traffic automatically, but I really have other things to do :-)

>Well, /this/ one, at least, we can solve for you, right now. Re-save your source file in UTF-8, and then recompile it. Then your characters will be accepted.

Yes, I figured that out. Emacs cooperated after I threatened to kick it in the pants, as is the usual pratice.

I also recently found out how ridiculously easy it was to switch to UTF-8 in linux (which involved setting an entire environment variable!) Legacy apps are still a problem, though.

Nick

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation