Error: invalid UTF-8 sequence - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Error: invalid UTF-8 sequence

Thread overview

Error: invalid UTF-8 sequence

Nov 28, 2004

Nov 28, 2004

Anders F Björklund

Nov 29, 2004

Nov 29, 2004

Anders F Björklund

Nov 29, 2004

Nov 29, 2004

Nov 30, 2004

Nov 30, 2004

Re: Error: invalid UTF-8 sequence (libiconv)
Nov 30, 2004 Anders F Björklund
Nov 30, 2004 Ben Hinkle
Nov 30, 2004 Anders F Björklund
Nov 30, 2004 Kris

November 28, 2004

Error: invalid UTF-8 sequence

Posted by Carotinho

Carotinho

Hi all!
I'm new here and to D. I wrote a simple program:

import std.stdio;
import std.stream;

int main() {
  char[] stringa;
  stringa = std.stream.stdin.readLine();
  writefln("%s",stringa);
  return 0;
}

If i type normal characters, like a,b,c etc. everything is ok.
But when I tries to type special characters like è, ò, ù I get
  Error: invalid UTF-8 sequence
when the program tries to rewrite the string I got.
What is this?

Thanks in advance!

Carotinho

November 28, 2004

Re: Error: invalid UTF-8 sequence

Posted by Anders F Björklund
in reply to Carotinho

Anders F Björklund

Posted in reply to Carotinho

Carotinho wrote:

> If i type normal characters, like a,b,c etc. everything is ok. But when I tries to type special characters like è, ò, ù I get
>   Error: invalid UTF-8 sequence
> when the program tries to rewrite the string I got.
> What is this?

D only works with Unicode. You need to set your shell to UTF-8.
(Or, the tricky version, you can cast(ubyte[]) and convert it ?)

--anders

November 29, 2004

Re: Error: invalid UTF-8 sequence

Posted by Simon Buchan
in reply to Anders F Björklund

Simon Buchan

Posted in reply to Anders F Björklund

On Mon, 29 Nov 2004 00:39:29 +0100, Anders F Björklund <afb@algonet.se> wrote:

> Carotinho wrote:
>
>> If i type normal characters, like a,b,c etc. everything is ok. But when I tries to type special characters like è, ò, ù I get
>>   Error: invalid UTF-8 sequence
>> when the program tries to rewrite the string I got.
>> What is this?
>
> D only works with Unicode. You need to set your shell to UTF-8.
> (Or, the tricky version, you can cast(ubyte[]) and convert it ?)
>
> --anders

I don't think cast works. Unfortunately, the Windows shell can't use
UTF. This discussion was referenced somewhere else (maybe
digitalmars.D.bugs?)

I have a project that tries to write a file with funky punctuation to
the screen... the closest I got was to use read/writeString exclusively
which gives you rubbish for special characters.

There was something mentioned about a Win32 API that converted UTF to
codepages and vice-versa... sounded promising, but I don't know if
it is currently available to D. Look around, you may get lucky.
(and if you do, tell the rest of us :D)

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/

November 29, 2004

Re: Error: invalid UTF-8 sequence

Posted by Anders F Björklund
in reply to Simon Buchan

Anders F Björklund

Posted in reply to Simon Buchan

Simon Buchan wrote:

>> (Or, the tricky version, you can cast(ubyte[]) and convert it ?)
> 
> I don't think cast works. Unfortunately, the Windows shell can't use
> UTF. This discussion was referenced somewhere else (maybe
> digitalmars.D.bugs?)

It does. The problem is that D just assumes that the shell is UTF-8,
and feeds you char[] that are *invalid* (as they are native-encoded)
If you translate them yourself, I've found it to work just fine...

I don't have a DOS console (@echo off allergies), but it does work
with a zsh console set to the ISO-8859-1 encoding (instead of UTF-8)
Of course, if the console *is* Unicode - then this doesn't work...

Anyway, my test code looked like:
> void main(char[][] args)
> {
> 	wchar[256] mapping = iso88591.mapping;
> 
> 	char[] test = cast(char[]) decode_string(cast(ubyte[]) args[1], mapping);
> 	writefln("%s",test);
> 	
> 	static ubyte[1] z = [ 0 ];
> 	printf("%s\n", cast(char*) (encode_string(test, mapping) ~ z) );
> }

Usually when you call old C functions, you want ubyte[] and not char[]
since they don't handle UTF-8? The D tradition is to pretend that they
have the D definition (char *) anyway, since "it is the same bit size".

I use ubyte[] for legacy 8-bit encodings, and char[] for Unicode only.

> There was something mentioned about a Win32 API that converted UTF to
> codepages and vice-versa... sounded promising, but I don't know if
> it is currently available to D. Look around, you may get lucky.
> (and if you do, tell the rest of us :D)

There is a Win32-only API, and some open source libraries (iconv, ICU):
http://msdn.microsoft.com/library/en-us/intl/unicode_19mb.asp
http://www.gnu.org/software/libiconv/
http://oss.software.ibm.com/icu/

I might share my own little hack later on too, when I've packaged it up.
(it just does the 4 main mappings, not the other 200* that the above do,
ISO-8859-1 [Latin-1], CP-437 [DOS], CP-1252 [Win], MacRoman [Mac OS 9] )

It's a lot smaller than the real mccoy, and will be under zlib license.
http://www.opensource.org/licenses/zlib-license.php (my usual license)
If you need the full functionality, look at Mango/ICU or iconv instead?

--anders

PS. I'm not kidding, it really has hundreds (!) of different encodings:
    http://oss.software.ibm.com/icu/charset/

November 29, 2004

Re: Error: invalid UTF-8 sequence

Posted by Ben Hinkle
in reply to Simon Buchan

Ben Hinkle

Posted in reply to Simon Buchan

Attachments:

[snip]

> There was something mentioned about a Win32 API that converted UTF to
> codepages and vice-versa... sounded promising, but I don't know if
> it is currently available to D. Look around, you may get lucky.
> (and if you do, tell the rest of us :D)

The following solution doesn't handle errors well due to some errno
confusion I'm trying to figure out, but it is a start. Here's what you can
do. Get iconv.dll from the zip file at
http://prdownloads.sourceforge.net/gettext/libiconv-1.9.1.bin.woe32.zip?download
and put it in the same directory as your executable. The attached libiconv.d
will load the three functions you need. The attached iconv_example.d shows
how to call iconv to convert utf-8 to utf-16 little endian.
I'm looking into the errno issues and will probably have to recompile
libiconv with DMC or something. But for typical usage the above instructions
should work. Also I'd like to put a small wrapper around the low-level API
to make it easier to use for the simple cases when the input is complete.

-Ben

November 29, 2004

Re: Error: invalid UTF-8 sequence

Posted by Carotinho
in reply to Ben Hinkle

Carotinho

Posted in reply to Ben Hinkle

I thanks you all, I'll start experiments!
For information, I'm running Linux, and even here I'm quite a newbie :)

Byez!

November 30, 2004

Re: Error: invalid UTF-8 sequence

Posted by Ben Hinkle
in reply to Carotinho

Ben Hinkle

Posted in reply to Carotinho

"Carotinho" <carotinobg@yahoo.it> wrote in message news:cog796$1rjr$1@digitaldaemon.com...
>I thanks you all, I'll start experiments!
> For information, I'm running Linux, and even here I'm quite a newbie :)
>
> Byez!

oh, even better. you don't need the dll then - just get the .d file that declares the iconv functions and you're all set (well, except for figuring out the API and getting the right encodings).

November 30, 2004

Re: Error: invalid UTF-8 sequence

Posted by Simon Buchan
in reply to Ben Hinkle

Simon Buchan

Posted in reply to Ben Hinkle

On Mon, 29 Nov 2004 14:14:33 -0500, Ben Hinkle <bhinkle@mathworks.com> wrote:

> [snip]
>
>> There was something mentioned about a Win32 API that converted UTF to
>> codepages and vice-versa... sounded promising, but I don't know if
>> it is currently available to D. Look around, you may get lucky.
>> (and if you do, tell the rest of us :D)
>
> The following solution doesn't handle errors well due to some errno
> confusion I'm trying to figure out, but it is a start. Here's what you can
> do. Get iconv.dll from the zip file at
> http://prdownloads.sourceforge.net/gettext/libiconv-1.9.1.bin.woe32.zip?download
> and put it in the same directory as your executable. The attached libiconv.d
> will load the three functions you need. The attached iconv_example.d shows
> how to call iconv to convert utf-8 to utf-16 little endian.
> I'm looking into the errno issues and will probably have to recompile
> libiconv with DMC or something. But for typical usage the above instructions
> should work. Also I'd like to put a small wrapper around the low-level API
> to make it easier to use for the simple cases when the input is complete.
>
> -Ben
>

This doesnt let you make UTF-8 into an OEM codepage, though, does it?
Linux users should be fine if they set their console to a UTF, but poor
Windows users are stuck with weird codepages. (I do have the UTF codepages
installed, they have to be, but I don't know how you can tell the console
to use them)

-- 
"Unhappy Microsoft customers have a funny way of becoming Linux,
Salesforce.com and Oracle customers." - www.microsoft-watch.com:
"The Year in Review: Microsoft Opens Up"

November 30, 2004

Re: Error: invalid UTF-8 sequence (libiconv)

Posted by Anders F Björklund
in reply to Ben Hinkle

Anders F Björklund

Posted in reply to Ben Hinkle

Ben Hinkle wrote:

> The following solution doesn't handle errors well due to some errno
> confusion I'm trying to figure out, but it is a start. Here's what you can
> do. Get iconv.dll from the zip file at
> http://prdownloads.sourceforge.net/gettext/libiconv-1.9.1.bin.woe32.zip?download
> and put it in the same directory as your executable. The attached libiconv.d
> will load the three functions you need. The attached iconv_example.d shows
> how to call iconv to convert utf-8 to utf-16 little endian.
> I'm looking into the errno issues and will probably have to recompile
> libiconv with DMC or something. But for typical usage the above instructions
> should work. Also I'd like to put a small wrapper around the low-level API
> to make it easier to use for the simple cases when the input is complete.

This code doesn't work everywhere... (POSIX?)
At least not without some more modifications.

>   // on POSIX systems iconv is built into libc so loading is automatic

It doesn't work on Mac OS X, unfortunately.

# gdc -o iconv_example iconv_example.d libiconv.d -liconv 

> /usr/bin/ld: Undefined symbols:
> _iconv
> _iconv_close
> _iconv_open
> collect2: ld returned 1 exit status

(It's being loaded from System's  /usr/lib/libiconv.dylib)

in /usr/include/iconv.h:
> #define iconv_t libiconv_t
> #ifndef LIBICONV_PLUG
> #define iconv_open libiconv_open
> #define iconv libiconv
> #define iconv_close libiconv_close
> #endif

Annoying, isn't it ? So one needs to declare
the C functions with the "lib" prefix, and
then do wrappers in D for the usual names...

> } else version (darwin) { 
> 
>   // On Mac OS X, link with -liconv (/usr/lib/libiconv.dylib)
>   typedef void *libiconv_t;
> 
>   // allocate a converter between charsets fromcode and tocode
>   extern (C) libiconv_t libiconv_open (char *tocode, char *fromcode);
>   iconv_t iconv_open (char *tocode, char *fromcode)
>   { return cast(iconv_t) libiconv_open(tocode, fromcode); }
> 
>   // convert inbuf to outbuf and set inbytesleft to unused input and
>   // outbuf to unused output and return number of non-reversable   // conversions or -1 on error.
>   extern (C) size_t libiconv (libiconv_t cd, void **inbuf,
> 			   size_t *inbytesleft,
> 			   void **outbuf,
> 			   size_t *outbytesleft);
>   size_t iconv (iconv_t cd, void **inbuf, size_t *inbytesleft,
> 			   void **outbuf, size_t *outbytesleft)
>   { return libiconv(cast(libiconv_t) cd, inbuf, inbytesleft, outbuf, outbytesleft); }
> 
>   // close converter
>   extern (C) int libiconv_close (libiconv_t cd);
>   int iconv_close (iconv_t cd)
>   { return libiconv_close(cast(libiconv_t) cd); }
> 
> } else { 


And the test code assumed that everything is X86:

> version (LittleEndian)
>   // convert from utf-8 to utf-16 little endian
>   iconv_t cd = iconv_open("UTF-16LE","UTF-8");
> else version (BigEndian)
>   // convert from utf-8 to utf-16 big endian
>   iconv_t cd = iconv_open("UTF-16BE","UTF-8");

That's actually one of the biggest drawbacks of UTF-16...


Besides those little flaws, the code works just fine :-)
--anders

November 30, 2004

Re: Error: invalid UTF-8 sequence (libiconv)

Posted by Ben Hinkle
in reply to Anders F Björklund

Ben Hinkle

Posted in reply to Anders F Björklund

"Anders F Björklund" <afb@algonet.se> wrote in message news:coi25c$1ijr$1@digitaldaemon.com...
> Ben Hinkle wrote:
>
> > The following solution doesn't handle errors well due to some errno confusion I'm trying to figure out, but it is a start. Here's what you
can
> > do. Get iconv.dll from the zip file at
> >
http://prdownloads.sourceforge.net/gettext/libiconv-1.9.1.bin.woe32.zip?download
> > and put it in the same directory as your executable. The attached
libiconv.d
> > will load the three functions you need. The attached iconv_example.d
shows
> > how to call iconv to convert utf-8 to utf-16 little endian.
> > I'm looking into the errno issues and will probably have to recompile
> > libiconv with DMC or something. But for typical usage the above
instructions
> > should work. Also I'd like to put a small wrapper around the low-level
API
> > to make it easier to use for the simple cases when the input is
complete.
>
> This code doesn't work everywhere... (POSIX?)
> At least not without some more modifications.
>
> >   // on POSIX systems iconv is built into libc so loading is automatic
>
> It doesn't work on Mac OS X, unfortunately.
>
> # gdc -o iconv_example iconv_example.d libiconv.d -liconv
>
> > /usr/bin/ld: Undefined symbols:
> > _iconv
> > _iconv_close
> > _iconv_open
> > collect2: ld returned 1 exit status
>
> (It's being loaded from System's  /usr/lib/libiconv.dylib)
>
> in /usr/include/iconv.h:
> > #define iconv_t libiconv_t
> > #ifndef LIBICONV_PLUG
> > #define iconv_open libiconv_open
> > #define iconv libiconv
> > #define iconv_close libiconv_close
> > #endif
>
> Annoying, isn't it ? So one needs to declare
> the C functions with the "lib" prefix, and
> then do wrappers in D for the usual names...

That is a bummer. Love those #defines!

> > } else version (darwin) {
> >
> >   // On Mac OS X, link with -liconv (/usr/lib/libiconv.dylib)
> >   typedef void *libiconv_t;
> >
> >   // allocate a converter between charsets fromcode and tocode
> >   extern (C) libiconv_t libiconv_open (char *tocode, char *fromcode);
> >   iconv_t iconv_open (char *tocode, char *fromcode)
> >   { return cast(iconv_t) libiconv_open(tocode, fromcode); }
> >
> >   // convert inbuf to outbuf and set inbytesleft to unused input and
> >   // outbuf to unused output and return number of non-reversable
> >   // conversions or -1 on error.
> >   extern (C) size_t libiconv (libiconv_t cd, void **inbuf,
> >    size_t *inbytesleft,
> >    void **outbuf,
> >    size_t *outbytesleft);
> >   size_t iconv (iconv_t cd, void **inbuf, size_t *inbytesleft,
> >    void **outbuf, size_t *outbytesleft)
> >   { return libiconv(cast(libiconv_t) cd, inbuf, inbytesleft, outbuf,
outbytesleft); }
> >
> >   // close converter
> >   extern (C) int libiconv_close (libiconv_t cd);
> >   int iconv_close (iconv_t cd)
> >   { return libiconv_close(cast(libiconv_t) cd); }
> >
> > } else {

Maybe I'll try using std.loader for this case, too, and have iconv be a function pointer. Hmm...

>
> And the test code assumed that everything is X86:
>
> > version (LittleEndian)
> >   // convert from utf-8 to utf-16 little endian
> >   iconv_t cd = iconv_open("UTF-16LE","UTF-8");
> > else version (BigEndian)
> >   // convert from utf-8 to utf-16 big endian
> >   iconv_t cd = iconv_open("UTF-16BE","UTF-8");
>
> That's actually one of the biggest drawbacks of UTF-16...

That's true. I was being lazy with the example. When I tried just plain-old "UTF-16" I think it used big-endian.

>
> Besides those little flaws, the code works just fine :-) --anders

Thanks for the update. I obviously hadn't tried on the Mac.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation