invalid utf-8 sequence

Jan 07, 2009

james

Jan 07, 2009

Jarrett Billingsley

Jan 07, 2009

Jan 07, 2009

Jan 07, 2009

Jan 07, 2009

Jan 07, 2009

Jan 07, 2009

On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4@gmail.com> wrote: > im writing an indexer, but im having a problem because on some file, when i read gives this error > > Error 4: invalid UTF-8 sequence > > is there a way to fix it. > You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P

james wrote: > im writing an indexer, but im having a problem because on some file, when i read gives this error > > Error 4: invalid UTF-8 sequence > > is there a way to fix it. Probably, but since you've decided not to post your code, nobody can tell you for sure what that way is. Moreover, what is giving this error - the compiler, or your compiled program? Stewart.

Jarrett Billingsley Wrote: > On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4@gmail.com> wrote: > > im writing an indexer, but im having a problem because on some file, when i read gives this error > > > > Error 4: invalid UTF-8 sequence > > > > is there a way to fix it. > > > > You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P is there any library or function that can automatically convert these unknown html charset into UTF-8

On Tue, Jan 6, 2009 at 9:20 PM, james <Jamesg4@gmail.com> wrote: > Jarrett Billingsley Wrote: > >> On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4@gmail.com> wrote: >> > im writing an indexer, but im having a problem because on some file, when i read gives this error >> > >> > Error 4: invalid UTF-8 sequence >> > >> > is there a way to fix it. >> > >> >> You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1. You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range. If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P > > is there any library or function that can automatically convert these unknown html charset into UTF-8 Not that I know of, for D anyway.

January 07, 2009

Re: invalid utf-8 sequence

Posted by james
in reply to Jarrett Billingsley

Permalink

james

Posted in reply to Jarrett Billingsley

Permalink

Jarrett Billingsley Wrote:

> On Tue, Jan 6, 2009 at 9:20 PM, james <Jamesg4@gmail.com> wrote:
> > Jarrett Billingsley Wrote:
> >
> >> On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4@gmail.com> wrote:
> >> > im writing an indexer, but im having a problem because on some file, when i read gives this error
> >> >
> >> > Error 4: invalid UTF-8 sequence
> >> >
> >> > is there a way to fix it.
> >> >
> >>
> >> You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1.  You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range.  If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
> >
> > is there any library or function that can automatically convert these unknown html charset into UTF-8
> 
> Not that I know of, for D anyway.

i just found out about a function 'UnicodeFile' in tango, but im using D1.0 and phobos, maybe i should write one of my own.

On Tue, Jan 6, 2009 at 10:34 PM, james <Jamesg4@gmail.com> wrote: >> Not that I know of, for D anyway. > > i just found out about a function 'UnicodeFile' in tango, but im using D1.0 and phobos, maybe i should write one of my own. > It wouldn't help you anyway. UnicodeFile reads.. uh, Unicode files. Your file is _not_ Unicode.

james wrote: > Jarrett Billingsley Wrote: > >> On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4@gmail.com> wrote: >>> im writing an indexer, but im having a problem because on some file, when i read gives this error >>> >>> Error 4: invalid UTF-8 sequence >>> >>> is there a way to fix it. >> >> You're probably reading a file that's encoded in some non-Unicode >> encoding, like Latin-1. You could read in the file data as byte[] >> instead of as char[], but that still doesn't deal with the problem >> that you have characters in your file that are outside the ASCII >> range. If you know what encoding your file uses, you could do some >> transformations on it to turn it into valid Unicode, or you could just >> ignore characters outside the ASCII range :P > > is there any library or function that can automatically convert these unknown html charset into UTF-8 You mean that tries to work out what character set a file is in and then translates it? (What is the current state of the art of character set detection heuristics?) Stewart.

Forums