Thread overview | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
|
March 01, 2005 Error: 4invalid UTF-8 sequence | ||||
---|---|---|---|---|
| ||||
Greetings! And sorry about the revisit of "Error: 4invalid UTF-8 sequence." Let's say that I am working with a data that contains names with accented charaters from all over the world and they are giving me problems. ie. ... ... 0 forms took 0.397589 sec || Avg forms/sec = 5.34942 ---------------------------------------------------------------- -- 725 Counting forms for yrajau (Rajau, Yannis) -- Application : Qty Deleted Left Total Distribute : 1 0 1 840 Total Forms : 1 0 0 2461 1 forms took 0.327413 sec || Avg forms/sec = 5.34778 ---------------------------------------------------------------- -- 726 Counting forms for CGiunta (Giunta, Cosmo A) -- Application : Qty Deleted Left Total Distribute : 6 0 6 846 Total Forms : 6 0 0 2467 6 forms took 0.589351 sec || Avg forms/sec = 5.35397 ---------------------------------------------------------------- -- 727 Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence ... ... So, I need to be able to change that charater in order to print it. The character causing the problem is a "é" which we already have figured out how to save. But, I have lots of data that has some of these charaters and it's causing problems for writefln. Any ideas how to change a non-UTF-8 string to a UTF-8 string? thanks. Going to bed. Worked on this for too long. josé |
March 01, 2005 Re: Error: 4invalid UTF-8 sequence | ||||
---|---|---|---|---|
| ||||
Posted in reply to jicman | On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote: > Greetings! And sorry about the revisit of "Error: 4invalid UTF-8 sequence." > > Let's say that I am working with a data that contains names with accented > charaters from all over the world and they are giving me problems. ie. > > ... > ... > 0 forms took 0.397589 sec || Avg forms/sec = 5.34942 > ---------------------------------------------------------------- > -- 725 > Counting forms for yrajau (Rajau, Yannis) -- > Application : Qty Deleted Left Total > Distribute : 1 0 1 840 > Total Forms : 1 0 0 2461 > 1 forms took 0.327413 sec || Avg forms/sec = 5.34778 > ---------------------------------------------------------------- > -- 726 > Counting forms for CGiunta (Giunta, Cosmo A) -- > Application : Qty Deleted Left Total > Distribute : 6 0 6 846 > Total Forms : 6 0 0 2467 > 6 forms took 0.589351 sec || Avg forms/sec = 5.35397 > ---------------------------------------------------------------- > -- 727 > Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence > > ... > ... > > So, I need to be able to change that charater in order to print it. The > character causing the problem is a "é" which we already have figured out how to > save. How are you saving it? in what format/encoding? > But, I have lots of data that has some of these charaters and it's > causing problems for writefln. Any ideas how to change a non-UTF-8 string to a > UTF-8 string? If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding. So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result. If you cannot write/find a function, ask here, someone will either have one, or write one, most likely. Where's Arcane Jill when we need her? Regan |
March 01, 2005 Re: Error: 4invalid UTF-8 sequence | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | Regan Heath wrote: >> But, I have lots of data that has some of these charaters and it's >> causing problems for writefln. Any ideas how to change a non-UTF-8 string to a UTF-8 string? > > If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding. > > So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result. > > If you cannot write/find a function, ask here, someone will either have one, or write one, most likely. There is no support in the D language or libraries for legacy encodings, but I provided three different methods: latin-1 cast, lookup or libiconv 1) http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs (see the "8-bit encodings" section for sample code) 2) http://www.algonet.se/~afb/d/mapping.d (wchar[256] lookup tables) http://www.algonet.se/~afb/d/mapping.zip 3) http://www.algonet.se/~afb/d/libiconv.d http://www.gnu.org/software/libiconv/ (has a lot of different encodings) I suggest "ubyte[]", to avoid any issues with signs when converting ? Got my tables from http://www.unicode.org/Public/MAPPINGS/, by the way --anders |
March 01, 2005 Re: Error: 4invalid UTF-8 sequence | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | Thanks. In article <d02lop$qlu$1@digitaldaemon.com>, =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= says... > >Regan Heath wrote: > >>> But, I have lots of data that has some of these charaters and it's causing problems for writefln. Any ideas how to change a non-UTF-8 string to a UTF-8 string? >> >> If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding. >> >> So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result. >> >> If you cannot write/find a function, ask here, someone will either have one, or write one, most likely. > >There is no support in the D language or libraries for legacy encodings, but I provided three different methods: latin-1 cast, lookup or libiconv > >1) >http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs >(see the "8-bit encodings" section for sample code) > >2) >http://www.algonet.se/~afb/d/mapping.d (wchar[256] lookup tables) >http://www.algonet.se/~afb/d/mapping.zip > >3) >http://www.algonet.se/~afb/d/libiconv.d >http://www.gnu.org/software/libiconv/ (has a lot of different encodings) > >I suggest "ubyte[]", to avoid any issues with signs when converting ? Got my tables from http://www.unicode.org/Public/MAPPINGS/, by the way > >--anders |
March 01, 2005 Re: Error: 4invalid UTF-8 sequence | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | In article <opsmy79sem23k2f5@ally>, Regan Heath says... > >On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote: >> Greetings! And sorry about the revisit of "Error: 4invalid UTF-8 sequence." >> >> Let's say that I am working with a data that contains names with accented charaters from all over the world and they are giving me problems. ie. >> >> ... >> ... >> 0 forms took 0.397589 sec || Avg forms/sec = 5.34942 >> ---------------------------------------------------------------- >> -- 725 >> Counting forms for yrajau (Rajau, Yannis) -- >> Application : Qty Deleted Left Total >> Distribute : 1 0 1 840 >> Total Forms : 1 0 0 2461 >> 1 forms took 0.327413 sec || Avg forms/sec = 5.34778 >> ---------------------------------------------------------------- >> -- 726 >> Counting forms for CGiunta (Giunta, Cosmo A) -- >> Application : Qty Deleted Left Total >> Distribute : 6 0 6 846 >> Total Forms : 6 0 0 2467 >> 6 forms took 0.589351 sec || Avg forms/sec = 5.35397 >> ---------------------------------------------------------------- >> -- 727 >> Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence >> >> ... >> ... >> >> So, I need to be able to change that charater in order to print it. The >> character causing the problem is a "é" which we already have figured out >> how to >> save. > >How are you saving it? in what format/encoding? I don't save it. A software using IE as client allows for data entry and that's how josé was entered. I am just dumping lots of xml from that server and it's always breaks on josé. > >> But, I have lots of data that has some of these charaters and it's >> causing problems for writefln. Any ideas how to change a non-UTF-8 >> string to a >> UTF-8 string? > >If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding. But I didn't. It must be WindoZE or Windows, as others call it. There are two ways of entering an é on the computer. 1. Using the ALT key + 130 on the number keys on the right side of the keyboard or having two keyboards on your system and changing keyboards when needed. >So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result. Yeah, I was thinking that I may have to do this, or something... :-) > >If you cannot write/find a function, ask here, someone will either have one, or write one, most likely. > >Where's Arcane Jill when we need her? Yeah, where is she? thanks. jic |
March 01, 2005 Re: Error: 4invalid UTF-8 sequence | ||||
---|---|---|---|---|
| ||||
Posted in reply to jicman | On Tue, 1 Mar 2005 21:35:27 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote: > In article <opsmy79sem23k2f5@ally>, Regan Heath says... >> >> On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman >> <jicman_member@pathlink.com> wrote: >>> Greetings! And sorry about the revisit of "Error: 4invalid UTF-8 >>> sequence." >>> >>> Let's say that I am working with a data that contains names with accented >>> charaters from all over the world and they are giving me problems. ie. >>> >>> ... >>> ... >>> 0 forms took 0.397589 sec || Avg forms/sec = 5.34942 >>> ---------------------------------------------------------------- >>> -- 725 >>> Counting forms for yrajau (Rajau, Yannis) -- >>> Application : Qty Deleted Left Total >>> Distribute : 1 0 1 840 >>> Total Forms : 1 0 0 2461 >>> 1 forms took 0.327413 sec || Avg forms/sec = 5.34778 >>> ---------------------------------------------------------------- >>> -- 726 >>> Counting forms for CGiunta (Giunta, Cosmo A) -- >>> Application : Qty Deleted Left Total >>> Distribute : 6 0 6 846 >>> Total Forms : 6 0 0 2467 >>> 6 forms took 0.589351 sec || Avg forms/sec = 5.35397 >>> ---------------------------------------------------------------- >>> -- 727 >>> Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence >>> >>> ... >>> ... >>> >>> So, I need to be able to change that charater in order to print it. The >>> character causing the problem is a "é" which we already have figured out >>> how to >>> save. >> >> How are you saving it? in what format/encoding? > > I don't save it. A software using IE as client allows for data entry and that's > how josé was entered. I am just dumping lots of xml from that server and it's > always breaks on josé. Then the question is "What encoding does it save the character data in?" >>> But, I have lots of data that has some of these charaters and it's >>> causing problems for writefln. Any ideas how to change a non-UTF-8 >>> string to a >>> UTF-8 string? >> >> If you had saved it in utf-8, you could simply load it and print it. As >> this isn't working, I assume you've saved it in another encoding. > > But I didn't. It must be WindoZE or Windows, as others call it. Windows has nothing to do with the problem AFAICS. A program "A software using IE as client" has saved the data in a certain encoding. You're reading that data, into a char[], and then printing it with writef, which finds an invalid UTF-8 character, because the data isn't UTF-8 encoded, it's something else. > There are two > ways of entering an é on the computer. > 1. Using the ALT key + 130 on the number keys on the right side of the keyboard > or having two keyboards on your system and changing keyboards when needed. Sure, and when you enter that 'é' the program you enter it into has _lots_ of different options as to how to encode it. UTF-8 is the option you need it to take, or, you need to transcode from the option it uses, to UTF-8. Regan |
March 01, 2005 Re: Error: 4invalid UTF-8 sequence | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | In article <opsmzb2xhn23k2f5@ally>, Regan Heath says... > >On Tue, 1 Mar 2005 21:35:27 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote: >> In article <opsmy79sem23k2f5@ally>, Regan Heath says... >>> >>> On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman >>> <jicman_member@pathlink.com> wrote: >>>> Greetings! And sorry about the revisit of "Error: 4invalid UTF-8 sequence." >>>> >>>> Let's say that I am working with a data that contains names with >>>> accented >>>> charaters from all over the world and they are giving me problems. ie. >>>> >>>> ... >>>> ... >>>> 0 forms took 0.397589 sec || Avg forms/sec = 5.34942 >>>> ---------------------------------------------------------------- >>>> -- 725 >>>> Counting forms for yrajau (Rajau, Yannis) -- >>>> Application : Qty Deleted Left Total >>>> Distribute : 1 0 1 840 >>>> Total Forms : 1 0 0 2461 >>>> 1 forms took 0.327413 sec || Avg forms/sec = 5.34778 >>>> ---------------------------------------------------------------- >>>> -- 726 >>>> Counting forms for CGiunta (Giunta, Cosmo A) -- >>>> Application : Qty Deleted Left Total >>>> Distribute : 6 0 6 846 >>>> Total Forms : 6 0 0 2467 >>>> 6 forms took 0.589351 sec || Avg forms/sec = 5.35397 >>>> ---------------------------------------------------------------- >>>> -- 727 >>>> Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence >>>> >>>> ... >>>> ... >>>> >>>> So, I need to be able to change that charater in order to print it. >>>> The >>>> character causing the problem is a "é" which we already have figured >>>> out >>>> how to >>>> save. >>> >>> How are you saving it? in what format/encoding? >> >> I don't save it. A software using IE as client allows for data entry >> and that's >> how josé was entered. I am just dumping lots of xml from that server >> and it's >> always breaks on josé. > >Then the question is "What encoding does it save the character data in?" Here is a response from the server: HTTP/1.1 200 OK Date: Tue, 01 Mar 2005 22:19:06 GMT Server: FlowPort Web Server/FlowPort 2.2.1.88 created 6/3/03 4:07 AM MIME-version: 1.0 Content-Type: application/xml <?xml version="1.0" encoding="iso-8859-1"?> [blah- clip -blah] <UserInfo> <UserName>jcabrera</UserName> <LastName>cabrera</LastName> <FirstName>josError: 4invalid UTF-8 sequence So, it's iso-8859-1. Maybe I could do my post and accept only UTF-8. That could work. > >>>> But, I have lots of data that has some of these charaters and it's >>>> causing problems for writefln. Any ideas how to change a non-UTF-8 >>>> string to a >>>> UTF-8 string? >>> >>> If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding. >> >> But I didn't. It must be WindoZE or Windows, as others call it. > >Windows has nothing to do with the problem AFAICS. > >A program "A software using IE as client" has saved the data in a certain encoding. > >You're reading that data, into a char[], and then printing it with writef, which finds an invalid UTF-8 character, because the data isn't UTF-8 encoded, it's something else. > >> There are two >> ways of entering an é on the computer. >> 1. Using the ALT key + 130 on the number keys on the right side of the >> keyboard >> or having two keyboards on your system and changing keyboards when >> needed. > >Sure, and when you enter that 'é' the program you enter it into has _lots_ of different options as to how to encode it. UTF-8 is the option you need it to take, or, you need to transcode from the option it uses, to UTF-8. > >Regan again, thanks. |
March 01, 2005 Re: Error: 4invalid UTF-8 sequence | ||||
---|---|---|---|---|
| ||||
Posted in reply to jicman | jicman wrote:
> So, it's iso-8859-1. Maybe I could do my post and accept only UTF-8. That
> could work.
You're in luck then. It's by far the simplest to convert to UTF...
--anders
|
March 01, 2005 Re: Error: 4invalid UTF-8 sequence | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | Anders_F_Bj=F6rklund?= says... > >jicman wrote: > >> So, it's iso-8859-1. Maybe I could do my post and accept only UTF-8. That could work. > >You're in luck then. It's by far the simplest to convert to UTF... > >--anders I don't have time right now... (time constraint!), but did came up with this little function for anyone out there to use, for a quick "print patching": char[] CheckForUTF8(char[] name) { char[] outStr = null; foreach(char c;name) if(std.ctype.isascii(c) > 0) outStr ~= c; else outStr ~= "+"; return outStr; } it will replace the offending character to a + and allow printing. :-) Hey, I didn't say it was pretty. :-) It just allows me to print. So, now the output looks like: ---------------------------------------------------------------- -- 6 Counting forms for jcabrera (cabrera, jos+ isa+as) -- Application : Qty Deleted Left Total DocumentToken : 2589 0 2589 2596 Distribute : 7 0 7 19 Total Forms : 2596 0 0 2615 2596 forms took 29.1392 sec || Avg forms/sec = 87.989 ---------------------------------------------------------------- Pretty, uh? :-) thanks for all the help and info. jic |
Copyright © 1999-2021 by the D Language Foundation