Error: 4invalid UTF-8 sequence - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Error: 4invalid UTF-8 sequence

Thread overview

Error: 4invalid UTF-8 sequence
Mar 01, 2005 jicman
Mar 01, 2005 Regan Heath
Mar 01, 2005 Anders F Björklund
Mar 01, 2005 jicman
Mar 01, 2005 jicman
Mar 01, 2005 Regan Heath
Mar 01, 2005 jicman
Mar 01, 2005 Anders F Björklund
Mar 01, 2005 jicman

March 01, 2005

Error: 4invalid UTF-8 sequence

Posted by jicman

jicman

Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8 sequence."

Let's say that I am working with a data that contains names with accented charaters from all over the world and they are giving me problems. ie.

...
...
0 forms took 0.397589 sec || Avg forms/sec = 5.34942
----------------------------------------------------------------
--  725
Counting forms for yrajau (Rajau, Yannis)               --
Application :       Qty   Deleted      Left     Total
Distribute :         1         0         1       840
Total Forms :         1         0         0      2461
1 forms took 0.327413 sec || Avg forms/sec = 5.34778
----------------------------------------------------------------
--  726
Counting forms for CGiunta (Giunta, Cosmo A)            --
Application :       Qty   Deleted      Left     Total
Distribute :         6         0         6       846
Total Forms :         6         0         0      2467
6 forms took 0.589351 sec || Avg forms/sec = 5.35397
----------------------------------------------------------------
--  727
Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence

...
...

So, I need to be able to change that charater in order to print it.  The character causing the problem is a "é" which we already have figured out how to save.  But, I have lots of data that has some of these charaters and it's causing problems for writefln.  Any ideas how to change a non-UTF-8 string to a UTF-8 string?

thanks.

Going to bed.  Worked on this for too long.

josé

March 01, 2005

Re: Error: 4invalid UTF-8 sequence

Posted by Regan Heath
in reply to jicman

Regan Heath

Posted in reply to jicman

On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote:
> Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8 sequence."
>
> Let's say that I am working with a data that contains names with accented
> charaters from all over the world and they are giving me problems. ie.
>
> ...
> ...
> 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
> ----------------------------------------------------------------
> --  725
> Counting forms for yrajau (Rajau, Yannis)               --
> Application :       Qty   Deleted      Left     Total
> Distribute :         1         0         1       840
> Total Forms :         1         0         0      2461
> 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
> ----------------------------------------------------------------
> --  726
> Counting forms for CGiunta (Giunta, Cosmo A)            --
> Application :       Qty   Deleted      Left     Total
> Distribute :         6         0         6       846
> Total Forms :         6         0         0      2467
> 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
> ----------------------------------------------------------------
> --  727
> Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence
>
> ...
> ...
>
> So, I need to be able to change that charater in order to print it.  The
> character causing the problem is a "é" which we already have figured out how to
> save.

How are you saving it? in what format/encoding?

> But, I have lots of data that has some of these charaters and it's
> causing problems for writefln.  Any ideas how to change a non-UTF-8 string to a
> UTF-8 string?

If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding.

So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result.

If you cannot write/find a function, ask here, someone will either have one, or write one, most likely.

Where's Arcane Jill when we need her?

Regan

March 01, 2005

Re: Error: 4invalid UTF-8 sequence

Posted by Anders F Björklund
in reply to Regan Heath

Anders F Björklund

Posted in reply to Regan Heath

Regan Heath wrote:

>> But, I have lots of data that has some of these charaters and it's
>> causing problems for writefln.  Any ideas how to change a non-UTF-8  string to a UTF-8 string?
> 
> If you had saved it in utf-8, you could simply load it and print it. As  this isn't working, I assume you've saved it in another encoding.
> 
> So, to do this you load the data you've saved into a byte[] or ubyte[]  then write (or find) a function that converts from your encoding into  utf-8, utf-16 or utf-32, call that, and print the result.
> 
> If you cannot write/find a function, ask here, someone will either have  one, or write one, most likely.

There is no support in the D language or libraries for legacy encodings,
but I provided three different methods: latin-1 cast, lookup or libiconv

1)
http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
(see the "8-bit encodings" section for sample code)

2)
http://www.algonet.se/~afb/d/mapping.d (wchar[256] lookup tables)
http://www.algonet.se/~afb/d/mapping.zip

3)
http://www.algonet.se/~afb/d/libiconv.d
http://www.gnu.org/software/libiconv/ (has a lot of different encodings)

I suggest "ubyte[]", to avoid any issues with signs when converting ?
Got my tables from http://www.unicode.org/Public/MAPPINGS/, by the way

--anders

March 01, 2005

Re: Error: 4invalid UTF-8 sequence

Posted by jicman
in reply to Anders F Björklund

jicman

Posted in reply to Anders F Björklund

Thanks.


In article <d02lop$qlu$1@digitaldaemon.com>, =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= says...
>
>Regan Heath wrote:
>
>>> But, I have lots of data that has some of these charaters and it's causing problems for writefln.  Any ideas how to change a non-UTF-8 string to a UTF-8 string?
>> 
>> If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding.
>> 
>> So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result.
>> 
>> If you cannot write/find a function, ask here, someone will either have one, or write one, most likely.
>
>There is no support in the D language or libraries for legacy encodings, but I provided three different methods: latin-1 cast, lookup or libiconv
>
>1)
>http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
>(see the "8-bit encodings" section for sample code)
>
>2)
>http://www.algonet.se/~afb/d/mapping.d (wchar[256] lookup tables)
>http://www.algonet.se/~afb/d/mapping.zip
>
>3)
>http://www.algonet.se/~afb/d/libiconv.d
>http://www.gnu.org/software/libiconv/ (has a lot of different encodings)
>
>I suggest "ubyte[]", to avoid any issues with signs when converting ? Got my tables from http://www.unicode.org/Public/MAPPINGS/, by the way
>
>--anders

March 01, 2005

Re: Error: 4invalid UTF-8 sequence

Posted by jicman
in reply to Regan Heath

jicman

Posted in reply to Regan Heath

In article <opsmy79sem23k2f5@ally>, Regan Heath says...
>
>On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote:
>> Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8 sequence."
>>
>> Let's say that I am working with a data that contains names with accented charaters from all over the world and they are giving me problems. ie.
>>
>> ...
>> ...
>> 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
>> ----------------------------------------------------------------
>> --  725
>> Counting forms for yrajau (Rajau, Yannis)               --
>> Application :       Qty   Deleted      Left     Total
>> Distribute :         1         0         1       840
>> Total Forms :         1         0         0      2461
>> 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
>> ----------------------------------------------------------------
>> --  726
>> Counting forms for CGiunta (Giunta, Cosmo A)            --
>> Application :       Qty   Deleted      Left     Total
>> Distribute :         6         0         6       846
>> Total Forms :         6         0         0      2467
>> 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
>> ----------------------------------------------------------------
>> --  727
>> Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence
>>
>> ...
>> ...
>>
>> So, I need to be able to change that charater in order to print it.  The
>> character causing the problem is a "é" which we already have figured out
>> how to
>> save.
>
>How are you saving it? in what format/encoding?

I don't save it.  A software using IE as client allows for data entry and that's how josé was entered.  I am just dumping lots of xml from that server and it's always breaks on josé.

>
>> But, I have lots of data that has some of these charaters and it's
>> causing problems for writefln.  Any ideas how to change a non-UTF-8
>> string to a
>> UTF-8 string?
>
>If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding.

But I didn't.  It must be WindoZE or Windows, as others call it.   There are two
ways of entering an é on the computer.
1. Using the ALT key + 130 on the number keys on the right side of the keyboard
or having two keyboards on your system and changing keyboards when needed.

>So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result.

Yeah, I was thinking that I may have to do this, or something... :-)

>
>If you cannot write/find a function, ask here, someone will either have one, or write one, most likely.
>
>Where's Arcane Jill when we need her?

Yeah, where is she?

thanks.

jic

March 01, 2005

Re: Error: 4invalid UTF-8 sequence

Posted by Regan Heath
in reply to jicman

Regan Heath

Posted in reply to jicman

On Tue, 1 Mar 2005 21:35:27 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote:
> In article <opsmy79sem23k2f5@ally>, Regan Heath says...
>>
>> On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman
>> <jicman_member@pathlink.com> wrote:
>>> Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8
>>> sequence."
>>>
>>> Let's say that I am working with a data that contains names with accented
>>> charaters from all over the world and they are giving me problems. ie.
>>>
>>> ...
>>> ...
>>> 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
>>> ----------------------------------------------------------------
>>> --  725
>>> Counting forms for yrajau (Rajau, Yannis)               --
>>> Application :       Qty   Deleted      Left     Total
>>> Distribute :         1         0         1       840
>>> Total Forms :         1         0         0      2461
>>> 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
>>> ----------------------------------------------------------------
>>> --  726
>>> Counting forms for CGiunta (Giunta, Cosmo A)            --
>>> Application :       Qty   Deleted      Left     Total
>>> Distribute :         6         0         6       846
>>> Total Forms :         6         0         0      2467
>>> 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
>>> ----------------------------------------------------------------
>>> --  727
>>> Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence
>>>
>>> ...
>>> ...
>>>
>>> So, I need to be able to change that charater in order to print it.  The
>>> character causing the problem is a "é" which we already have figured out
>>> how to
>>> save.
>>
>> How are you saving it? in what format/encoding?
>
> I don't save it.  A software using IE as client allows for data entry and that's
> how josé was entered.  I am just dumping lots of xml from that server and it's
> always breaks on josé.

Then the question is "What encoding does it save the character data in?"

>>> But, I have lots of data that has some of these charaters and it's
>>> causing problems for writefln.  Any ideas how to change a non-UTF-8
>>> string to a
>>> UTF-8 string?
>>
>> If you had saved it in utf-8, you could simply load it and print it. As
>> this isn't working, I assume you've saved it in another encoding.
>
> But I didn't.  It must be WindoZE or Windows, as others call it.

Windows has nothing to do with the problem AFAICS.

A program "A software using IE as client" has saved the data in a certain encoding.

You're reading that data, into a char[], and then printing it with writef, which finds an invalid UTF-8 character, because the data isn't UTF-8 encoded, it's something else.

> There are two
> ways of entering an é on the computer.
> 1. Using the ALT key + 130 on the number keys on the right side of the keyboard
> or having two keyboards on your system and changing keyboards when needed.

Sure, and when you enter that 'é' the program you enter it into has _lots_ of different options as to how to encode it. UTF-8 is the option you need it to take, or, you need to transcode from the option it uses, to UTF-8.

Regan

March 01, 2005

Re: Error: 4invalid UTF-8 sequence

Posted by jicman
in reply to Regan Heath

jicman

Posted in reply to Regan Heath

In article <opsmzb2xhn23k2f5@ally>, Regan Heath says...
>
>On Tue, 1 Mar 2005 21:35:27 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote:
>> In article <opsmy79sem23k2f5@ally>, Regan Heath says...
>>>
>>> On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman
>>> <jicman_member@pathlink.com> wrote:
>>>> Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8 sequence."
>>>>
>>>> Let's say that I am working with a data that contains names with
>>>> accented
>>>> charaters from all over the world and they are giving me problems. ie.
>>>>
>>>> ...
>>>> ...
>>>> 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
>>>> ----------------------------------------------------------------
>>>> --  725
>>>> Counting forms for yrajau (Rajau, Yannis)               --
>>>> Application :       Qty   Deleted      Left     Total
>>>> Distribute :         1         0         1       840
>>>> Total Forms :         1         0         0      2461
>>>> 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
>>>> ----------------------------------------------------------------
>>>> --  726
>>>> Counting forms for CGiunta (Giunta, Cosmo A)            --
>>>> Application :       Qty   Deleted      Left     Total
>>>> Distribute :         6         0         6       846
>>>> Total Forms :         6         0         0      2467
>>>> 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
>>>> ----------------------------------------------------------------
>>>> --  727
>>>> Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence
>>>>
>>>> ...
>>>> ...
>>>>
>>>> So, I need to be able to change that charater in order to print it.
>>>> The
>>>> character causing the problem is a "é" which we already have figured
>>>> out
>>>> how to
>>>> save.
>>>
>>> How are you saving it? in what format/encoding?
>>
>> I don't save it.  A software using IE as client allows for data entry
>> and that's
>> how josé was entered.  I am just dumping lots of xml from that server
>> and it's
>> always breaks on josé.
>
>Then the question is "What encoding does it save the character data in?"

Here is a response from the server:

HTTP/1.1 200 OK
Date: Tue, 01 Mar 2005 22:19:06 GMT
Server: FlowPort Web Server/FlowPort 2.2.1.88 created 6/3/03 4:07 AM
MIME-version: 1.0
Content-Type: application/xml

<?xml version="1.0" encoding="iso-8859-1"?>

[blah- clip -blah]

<UserInfo>
<UserName>jcabrera</UserName>
<LastName>cabrera</LastName>
<FirstName>josError: 4invalid UTF-8 sequence


So, it's iso-8859-1.  Maybe I could do my post and accept only UTF-8.  That could work.


>
>>>> But, I have lots of data that has some of these charaters and it's
>>>> causing problems for writefln.  Any ideas how to change a non-UTF-8
>>>> string to a
>>>> UTF-8 string?
>>>
>>> If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding.
>>
>> But I didn't.  It must be WindoZE or Windows, as others call it.
>
>Windows has nothing to do with the problem AFAICS.
>
>A program "A software using IE as client" has saved the data in a certain encoding.
>
>You're reading that data, into a char[], and then printing it with writef, which finds an invalid UTF-8 character, because the data isn't UTF-8 encoded, it's something else.
>
>> There are two
>> ways of entering an é on the computer.
>> 1. Using the ALT key + 130 on the number keys on the right side of the
>> keyboard
>> or having two keyboards on your system and changing keyboards when
>> needed.
>
>Sure, and when you enter that 'é' the program you enter it into has _lots_ of different options as to how to encode it. UTF-8 is the option you need it to take, or, you need to transcode from the option it uses, to UTF-8.
>
>Regan

again, thanks.

March 01, 2005

Re: Error: 4invalid UTF-8 sequence

Posted by Anders F Björklund
in reply to jicman

Anders F Björklund

Posted in reply to jicman

jicman wrote:

> So, it's iso-8859-1.  Maybe I could do my post and accept only UTF-8.  That
> could work.

You're in luck then. It's by far the simplest to convert to UTF...

--anders

March 01, 2005

Re: Error: 4invalid UTF-8 sequence

Posted by jicman
in reply to Anders F Björklund

jicman

Posted in reply to Anders F Björklund

Anders_F_Bj=F6rklund?= says...
>
>jicman wrote:
>
>> So, it's iso-8859-1.  Maybe I could do my post and accept only UTF-8.  That could work.
>
>You're in luck then. It's by far the simplest to convert to UTF...
>
>--anders

I don't have time right now... (time constraint!), but did came up with this little function for anyone out there to use, for a quick "print patching":

char[] CheckForUTF8(char[] name)
{
char[] outStr = null;
foreach(char c;name)
if(std.ctype.isascii(c) > 0)
outStr ~= c;
else
outStr ~= "+";
return outStr;
}

it will replace the offending character to a + and allow printing. :-)  Hey, I didn't say it was pretty. :-)  It just allows me to print.  So, now the output looks like:


----------------------------------------------------------------
--    6
Counting forms for jcabrera (cabrera, jos+ isa+as)      --
Application :       Qty   Deleted      Left     Total
DocumentToken :      2589         0      2589      2596
Distribute :         7         0         7        19
Total Forms :      2596         0         0      2615
2596 forms took 29.1392 sec || Avg forms/sec =  87.989

----------------------------------------------------------------

Pretty, uh? :-)

thanks for all the help and info.

jic

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation