Thread overview
Error: 4invalid UTF-8 sequence
Mar 01, 2005
jicman
Mar 01, 2005
Regan Heath
Mar 01, 2005
jicman
Mar 01, 2005
jicman
Mar 01, 2005
Regan Heath
Mar 01, 2005
jicman
Mar 01, 2005
jicman
March 01, 2005
Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8 sequence."

Let's say that I am working with a data that contains names with accented charaters from all over the world and they are giving me problems. ie.

...
...
0 forms took 0.397589 sec || Avg forms/sec = 5.34942
----------------------------------------------------------------
--  725
Counting forms for yrajau (Rajau, Yannis)               --
Application :       Qty   Deleted      Left     Total
Distribute :         1         0         1       840
Total Forms :         1         0         0      2461
1 forms took 0.327413 sec || Avg forms/sec = 5.34778
----------------------------------------------------------------
--  726
Counting forms for CGiunta (Giunta, Cosmo A)            --
Application :       Qty   Deleted      Left     Total
Distribute :         6         0         6       846
Total Forms :         6         0         0      2467
6 forms took 0.589351 sec || Avg forms/sec = 5.35397
----------------------------------------------------------------
--  727
Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence

...
...

So, I need to be able to change that charater in order to print it.  The character causing the problem is a "é" which we already have figured out how to save.  But, I have lots of data that has some of these charaters and it's causing problems for writefln.  Any ideas how to change a non-UTF-8 string to a UTF-8 string?

thanks.

Going to bed.  Worked on this for too long.

josé


March 01, 2005
On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote:
> Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8 sequence."
>
> Let's say that I am working with a data that contains names with accented
> charaters from all over the world and they are giving me problems. ie.
>
> ...
> ...
> 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
> ----------------------------------------------------------------
> --  725
> Counting forms for yrajau (Rajau, Yannis)               --
> Application :       Qty   Deleted      Left     Total
> Distribute :         1         0         1       840
> Total Forms :         1         0         0      2461
> 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
> ----------------------------------------------------------------
> --  726
> Counting forms for CGiunta (Giunta, Cosmo A)            --
> Application :       Qty   Deleted      Left     Total
> Distribute :         6         0         6       846
> Total Forms :         6         0         0      2467
> 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
> ----------------------------------------------------------------
> --  727
> Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence
>
> ...
> ...
>
> So, I need to be able to change that charater in order to print it.  The
> character causing the problem is a "é" which we already have figured out how to
> save.

How are you saving it? in what format/encoding?

> But, I have lots of data that has some of these charaters and it's
> causing problems for writefln.  Any ideas how to change a non-UTF-8 string to a
> UTF-8 string?

If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding.

So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result.

If you cannot write/find a function, ask here, someone will either have one, or write one, most likely.

Where's Arcane Jill when we need her?

Regan
March 01, 2005
Regan Heath wrote:

>> But, I have lots of data that has some of these charaters and it's
>> causing problems for writefln.  Any ideas how to change a non-UTF-8  string to a UTF-8 string?
> 
> If you had saved it in utf-8, you could simply load it and print it. As  this isn't working, I assume you've saved it in another encoding.
> 
> So, to do this you load the data you've saved into a byte[] or ubyte[]  then write (or find) a function that converts from your encoding into  utf-8, utf-16 or utf-32, call that, and print the result.
> 
> If you cannot write/find a function, ask here, someone will either have  one, or write one, most likely.

There is no support in the D language or libraries for legacy encodings,
but I provided three different methods: latin-1 cast, lookup or libiconv

1)
http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
(see the "8-bit encodings" section for sample code)

2)
http://www.algonet.se/~afb/d/mapping.d (wchar[256] lookup tables)
http://www.algonet.se/~afb/d/mapping.zip

3)
http://www.algonet.se/~afb/d/libiconv.d
http://www.gnu.org/software/libiconv/ (has a lot of different encodings)

I suggest "ubyte[]", to avoid any issues with signs when converting ?
Got my tables from http://www.unicode.org/Public/MAPPINGS/, by the way

--anders
March 01, 2005
Thanks.


In article <d02lop$qlu$1@digitaldaemon.com>, =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= says...
>
>Regan Heath wrote:
>
>>> But, I have lots of data that has some of these charaters and it's causing problems for writefln.  Any ideas how to change a non-UTF-8 string to a UTF-8 string?
>> 
>> If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding.
>> 
>> So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result.
>> 
>> If you cannot write/find a function, ask here, someone will either have one, or write one, most likely.
>
>There is no support in the D language or libraries for legacy encodings, but I provided three different methods: latin-1 cast, lookup or libiconv
>
>1)
>http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
>(see the "8-bit encodings" section for sample code)
>
>2)
>http://www.algonet.se/~afb/d/mapping.d (wchar[256] lookup tables)
>http://www.algonet.se/~afb/d/mapping.zip
>
>3)
>http://www.algonet.se/~afb/d/libiconv.d
>http://www.gnu.org/software/libiconv/ (has a lot of different encodings)
>
>I suggest "ubyte[]", to avoid any issues with signs when converting ? Got my tables from http://www.unicode.org/Public/MAPPINGS/, by the way
>
>--anders


March 01, 2005
In article <opsmy79sem23k2f5@ally>, Regan Heath says...
>
>On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote:
>> Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8 sequence."
>>
>> Let's say that I am working with a data that contains names with accented charaters from all over the world and they are giving me problems. ie.
>>
>> ...
>> ...
>> 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
>> ----------------------------------------------------------------
>> --  725
>> Counting forms for yrajau (Rajau, Yannis)               --
>> Application :       Qty   Deleted      Left     Total
>> Distribute :         1         0         1       840
>> Total Forms :         1         0         0      2461
>> 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
>> ----------------------------------------------------------------
>> --  726
>> Counting forms for CGiunta (Giunta, Cosmo A)            --
>> Application :       Qty   Deleted      Left     Total
>> Distribute :         6         0         6       846
>> Total Forms :         6         0         0      2467
>> 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
>> ----------------------------------------------------------------
>> --  727
>> Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence
>>
>> ...
>> ...
>>
>> So, I need to be able to change that charater in order to print it.  The
>> character causing the problem is a "é" which we already have figured out
>> how to
>> save.
>
>How are you saving it? in what format/encoding?

I don't save it.  A software using IE as client allows for data entry and that's how josé was entered.  I am just dumping lots of xml from that server and it's always breaks on josé.

>
>> But, I have lots of data that has some of these charaters and it's
>> causing problems for writefln.  Any ideas how to change a non-UTF-8
>> string to a
>> UTF-8 string?
>
>If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding.

But I didn't.  It must be WindoZE or Windows, as others call it.   There are two
ways of entering an é on the computer.
1. Using the ALT key + 130 on the number keys on the right side of the keyboard
or having two keyboards on your system and changing keyboards when needed.

>So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result.

Yeah, I was thinking that I may have to do this, or something... :-)

>
>If you cannot write/find a function, ask here, someone will either have one, or write one, most likely.
>
>Where's Arcane Jill when we need her?

Yeah, where is she?

thanks.

jic


March 01, 2005
On Tue, 1 Mar 2005 21:35:27 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote:
> In article <opsmy79sem23k2f5@ally>, Regan Heath says...
>>
>> On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman
>> <jicman_member@pathlink.com> wrote:
>>> Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8
>>> sequence."
>>>
>>> Let's say that I am working with a data that contains names with accented
>>> charaters from all over the world and they are giving me problems. ie.
>>>
>>> ...
>>> ...
>>> 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
>>> ----------------------------------------------------------------
>>> --  725
>>> Counting forms for yrajau (Rajau, Yannis)               --
>>> Application :       Qty   Deleted      Left     Total
>>> Distribute :         1         0         1       840
>>> Total Forms :         1         0         0      2461
>>> 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
>>> ----------------------------------------------------------------
>>> --  726
>>> Counting forms for CGiunta (Giunta, Cosmo A)            --
>>> Application :       Qty   Deleted      Left     Total
>>> Distribute :         6         0         6       846
>>> Total Forms :         6         0         0      2467
>>> 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
>>> ----------------------------------------------------------------
>>> --  727
>>> Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence
>>>
>>> ...
>>> ...
>>>
>>> So, I need to be able to change that charater in order to print it.  The
>>> character causing the problem is a "é" which we already have figured out
>>> how to
>>> save.
>>
>> How are you saving it? in what format/encoding?
>
> I don't save it.  A software using IE as client allows for data entry and that's
> how josé was entered.  I am just dumping lots of xml from that server and it's
> always breaks on josé.

Then the question is "What encoding does it save the character data in?"

>>> But, I have lots of data that has some of these charaters and it's
>>> causing problems for writefln.  Any ideas how to change a non-UTF-8
>>> string to a
>>> UTF-8 string?
>>
>> If you had saved it in utf-8, you could simply load it and print it. As
>> this isn't working, I assume you've saved it in another encoding.
>
> But I didn't.  It must be WindoZE or Windows, as others call it.

Windows has nothing to do with the problem AFAICS.

A program "A software using IE as client" has saved the data in a certain encoding.

You're reading that data, into a char[], and then printing it with writef, which finds an invalid UTF-8 character, because the data isn't UTF-8 encoded, it's something else.

> There are two
> ways of entering an é on the computer.
> 1. Using the ALT key + 130 on the number keys on the right side of the keyboard
> or having two keyboards on your system and changing keyboards when needed.

Sure, and when you enter that 'é' the program you enter it into has _lots_ of different options as to how to encode it. UTF-8 is the option you need it to take, or, you need to transcode from the option it uses, to UTF-8.

Regan
March 01, 2005
In article <opsmzb2xhn23k2f5@ally>, Regan Heath says...
>
>On Tue, 1 Mar 2005 21:35:27 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote:
>> In article <opsmy79sem23k2f5@ally>, Regan Heath says...
>>>
>>> On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman
>>> <jicman_member@pathlink.com> wrote:
>>>> Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8 sequence."
>>>>
>>>> Let's say that I am working with a data that contains names with
>>>> accented
>>>> charaters from all over the world and they are giving me problems. ie.
>>>>
>>>> ...
>>>> ...
>>>> 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
>>>> ----------------------------------------------------------------
>>>> --  725
>>>> Counting forms for yrajau (Rajau, Yannis)               --
>>>> Application :       Qty   Deleted      Left     Total
>>>> Distribute :         1         0         1       840
>>>> Total Forms :         1         0         0      2461
>>>> 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
>>>> ----------------------------------------------------------------
>>>> --  726
>>>> Counting forms for CGiunta (Giunta, Cosmo A)            --
>>>> Application :       Qty   Deleted      Left     Total
>>>> Distribute :         6         0         6       846
>>>> Total Forms :         6         0         0      2467
>>>> 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
>>>> ----------------------------------------------------------------
>>>> --  727
>>>> Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence
>>>>
>>>> ...
>>>> ...
>>>>
>>>> So, I need to be able to change that charater in order to print it.
>>>> The
>>>> character causing the problem is a "é" which we already have figured
>>>> out
>>>> how to
>>>> save.
>>>
>>> How are you saving it? in what format/encoding?
>>
>> I don't save it.  A software using IE as client allows for data entry
>> and that's
>> how josé was entered.  I am just dumping lots of xml from that server
>> and it's
>> always breaks on josé.
>
>Then the question is "What encoding does it save the character data in?"

Here is a response from the server:

HTTP/1.1 200 OK
Date: Tue, 01 Mar 2005 22:19:06 GMT
Server: FlowPort Web Server/FlowPort 2.2.1.88 created 6/3/03 4:07 AM
MIME-version: 1.0
Content-Type: application/xml

<?xml version="1.0" encoding="iso-8859-1"?>

[blah- clip -blah]

<UserInfo>
<UserName>jcabrera</UserName>
<LastName>cabrera</LastName>
<FirstName>josError: 4invalid UTF-8 sequence


So, it's iso-8859-1.  Maybe I could do my post and accept only UTF-8.  That could work.


>
>>>> But, I have lots of data that has some of these charaters and it's
>>>> causing problems for writefln.  Any ideas how to change a non-UTF-8
>>>> string to a
>>>> UTF-8 string?
>>>
>>> If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding.
>>
>> But I didn't.  It must be WindoZE or Windows, as others call it.
>
>Windows has nothing to do with the problem AFAICS.
>
>A program "A software using IE as client" has saved the data in a certain encoding.
>
>You're reading that data, into a char[], and then printing it with writef, which finds an invalid UTF-8 character, because the data isn't UTF-8 encoded, it's something else.
>
>> There are two
>> ways of entering an é on the computer.
>> 1. Using the ALT key + 130 on the number keys on the right side of the
>> keyboard
>> or having two keyboards on your system and changing keyboards when
>> needed.
>
>Sure, and when you enter that 'é' the program you enter it into has _lots_ of different options as to how to encode it. UTF-8 is the option you need it to take, or, you need to transcode from the option it uses, to UTF-8.
>
>Regan

again, thanks.


March 01, 2005
jicman wrote:

> So, it's iso-8859-1.  Maybe I could do my post and accept only UTF-8.  That
> could work.

You're in luck then. It's by far the simplest to convert to UTF...

--anders
March 01, 2005
Anders_F_Bj=F6rklund?= says...
>
>jicman wrote:
>
>> So, it's iso-8859-1.  Maybe I could do my post and accept only UTF-8.  That could work.
>
>You're in luck then. It's by far the simplest to convert to UTF...
>
>--anders

I don't have time right now... (time constraint!), but did came up with this little function for anyone out there to use, for a quick "print patching":

char[] CheckForUTF8(char[] name)
{
char[] outStr = null;
foreach(char c;name)
if(std.ctype.isascii(c) > 0)
outStr ~= c;
else
outStr ~= "+";
return outStr;
}

it will replace the offending character to a + and allow printing. :-)  Hey, I didn't say it was pretty. :-)  It just allows me to print.  So, now the output looks like:


----------------------------------------------------------------
--    6
Counting forms for jcabrera (cabrera, jos+ isa+as)      --
Application :       Qty   Deleted      Left     Total
DocumentToken :      2589         0      2589      2596
Distribute :         7         0         7        19
Total Forms :      2596         0         0      2615
2596 forms took 29.1392 sec || Avg forms/sec =  87.989

----------------------------------------------------------------

Pretty, uh? :-)

thanks for all the help and info.

jic