Thread overview
invalid utf-8 sequence
Jan 07, 2009
james
Jan 07, 2009
james
Jan 07, 2009
james
Jan 07, 2009
Stewart Gordon
Jan 07, 2009
Stewart Gordon
January 07, 2009
im writing an indexer, but im having a problem because on some file, when i read gives this error

Error 4: invalid UTF-8 sequence

is there a way to fix it.
January 07, 2009
On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4@gmail.com> wrote:
> im writing an indexer, but im having a problem because on some file, when i read gives this error
>
> Error 4: invalid UTF-8 sequence
>
> is there a way to fix it.
>

You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1.  You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range.  If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
January 07, 2009
james wrote:
> im writing an indexer, but im having a problem because on some file, when i read gives this error   
> 
> Error 4: invalid UTF-8 sequence
> 
> is there a way to fix it.

Probably, but since you've decided not to post your code, nobody can tell you for sure what that way is.

Moreover, what is giving this error - the compiler, or your compiled program?

Stewart.
January 07, 2009
Jarrett Billingsley Wrote:

> On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4@gmail.com> wrote:
> > im writing an indexer, but im having a problem because on some file, when i read gives this error
> >
> > Error 4: invalid UTF-8 sequence
> >
> > is there a way to fix it.
> >
> 
> You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1.  You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range.  If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P

is there any library or function that can automatically convert these unknown html charset into UTF-8

January 07, 2009
On Tue, Jan 6, 2009 at 9:20 PM, james <Jamesg4@gmail.com> wrote:
> Jarrett Billingsley Wrote:
>
>> On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4@gmail.com> wrote:
>> > im writing an indexer, but im having a problem because on some file, when i read gives this error
>> >
>> > Error 4: invalid UTF-8 sequence
>> >
>> > is there a way to fix it.
>> >
>>
>> You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1.  You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range.  If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
>
> is there any library or function that can automatically convert these unknown html charset into UTF-8

Not that I know of, for D anyway.
January 07, 2009
Jarrett Billingsley Wrote:

> On Tue, Jan 6, 2009 at 9:20 PM, james <Jamesg4@gmail.com> wrote:
> > Jarrett Billingsley Wrote:
> >
> >> On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4@gmail.com> wrote:
> >> > im writing an indexer, but im having a problem because on some file, when i read gives this error
> >> >
> >> > Error 4: invalid UTF-8 sequence
> >> >
> >> > is there a way to fix it.
> >> >
> >>
> >> You're probably reading a file that's encoded in some non-Unicode encoding, like Latin-1.  You could read in the file data as byte[] instead of as char[], but that still doesn't deal with the problem that you have characters in your file that are outside the ASCII range.  If you know what encoding your file uses, you could do some transformations on it to turn it into valid Unicode, or you could just ignore characters outside the ASCII range :P
> >
> > is there any library or function that can automatically convert these unknown html charset into UTF-8
> 
> Not that I know of, for D anyway.

i just found out about a function 'UnicodeFile' in tango, but im using D1.0 and phobos, maybe i should write one of my own.
January 07, 2009
On Tue, Jan 6, 2009 at 10:34 PM, james <Jamesg4@gmail.com> wrote:
>> Not that I know of, for D anyway.
>
> i just found out about a function 'UnicodeFile' in tango, but im using D1.0 and phobos, maybe i should write one of my own.
>

It wouldn't help you anyway.  UnicodeFile reads.. uh, Unicode files. Your file is _not_ Unicode.
January 07, 2009
james wrote:
> Jarrett Billingsley Wrote:
> 
>> On Tue, Jan 6, 2009 at 8:04 PM, james <Jamesg4@gmail.com> wrote:
>>> im writing an indexer, but im having a problem because on some file, when i read gives this error
>>>
>>> Error 4: invalid UTF-8 sequence
>>>
>>> is there a way to fix it.
>>
>> You're probably reading a file that's encoded in some non-Unicode
>> encoding, like Latin-1.  You could read in the file data as byte[]
>> instead of as char[], but that still doesn't deal with the problem
>> that you have characters in your file that are outside the ASCII
>> range.  If you know what encoding your file uses, you could do some
>> transformations on it to turn it into valid Unicode, or you could just
>> ignore characters outside the ASCII range :P
> 
> is there any library or function that can automatically convert these unknown html charset into UTF-8

You mean that tries to work out what character set a file is in and then translates it?

(What is the current state of the art of character set detection heuristics?)

Stewart.