Thread overview
Error: 4invalid UTF-8 sequence :: How can I catch this?? (or otherwise handle it)
Oct 22, 2009
Charles Hixson
Oct 22, 2009
Daniel Keep
Oct 22, 2009
Charles Hixson
Nov 01, 2009
Charles Hixson
Oct 22, 2009
Charles Hixson
October 22, 2009
I want to read a bunch of files, and if the aren't UTF, then I want to list their names for conversion, or other processing.  How should this be handled??

try..catch..finally blocks just ignore this error.
October 22, 2009
Charles Hixson wrote:
> I want to read a bunch of files, and if the aren't UTF, then I want to list their names for conversion, or other processing.  How should this be handled??
> 
> try..catch..finally blocks just ignore this error.

> type stuff.d
import std.stdio;
import std.utf;

void main()
{
    try
    {
        writefln("A B \xfe C");
    }
    catch( UtfException e )
    {
        writefln("I caught a %s!", e);
    }
}

> dmd stuff && stuff
A B I caught a 4invalid UTF-8 sequence!

Works for me.
October 22, 2009
Daniel Keep wrote:
> Charles Hixson wrote:
>> I want to read a bunch of files, and if the aren't UTF, then I want to
>> list their names for conversion, or other processing.  How should this
>> be handled??
>>
>> try..catch..finally blocks just ignore this error.
> 
>> type stuff.d
> import std.stdio;
> import std.utf;
> 
> void main()
> {
>     try
>     {
>         writefln("A B \xfe C");
>     }
>     catch( UtfException e )
>     {
>         writefln("I caught a %s!", e);
>     }
> }
> 
>> dmd stuff && stuff
> A B I caught a 4invalid UTF-8 sequence!
> 
> Works for me.

Sorry, the error is on the read.  The code I tried to use was:

try	{	lin	=	fil.readLine;	}
catch
{  writefln("File <<" ~ filIter [curFilNdx] ~ ">> is not a valid UTF file.");
   fil.close;
   getLine;
  return;
}
finally
{	}
debug (9) writefln ("lin = <<" ~ lin ~ ">>");
try
{ validate (lin); }
catch	(UtfException ue)
{  writefln ("File <<" ~ filIter [curFilNdx] ~ ">> is not a valid UTF file.");
   fil.close;
   getLine;
   return;
}

where fil is a File and getLine is one of my routines that automatically switches to the next file if the current file has been closed.
October 22, 2009
Charles Hixson wrote:
> I want to read a bunch of files, and if the aren't UTF, then I want to list their names for conversion, or other processing.  How should this be handled??
> 
> try..catch..finally blocks just ignore this error.
OK.
One approach that occurs to me is to read the data in as a byte stream, break it into lines, and validate the lines.  But validate requires an array of chars, so this seems to put me right back where I was.  Unless, perhaps, I can cast an array of bytes into an array of chars without having throw an "Error: 4invalid UTF-8 sequence", then validate the entire array.  But if I do that, I won't know where the break should be, so I might only get half of a legitimate UTF-8 character, and so it would legitimately throw UTFException, even though the file was good.

I'm sure there are ways around that, but it really seems a round-about way to proceed for something that should be easy.

P.S.:  As before, the actual code that throws the error is:

try    {    lin    =    fil.readLine;    }
catch
{  writefln("File <<" ~ filIter [curFilNdx] ~ ">> is not a valid UTF file.");
   fil.close;
   getLine;
  return;
}
finally
{    }
debug (9) writefln ("lin = <<" ~ lin ~ ">>");
try
{ validate (lin); }
catch    (UtfException ue)
{  writefln ("File <<" ~ filIter [curFilNdx] ~ ">> is not a valid UTF file.");
   fil.close;
   getLine;
   return;
}

where fil is a File and getLine is one of my routines that automatically switches to the next file if the current file has been closed.
November 01, 2009
Charles Hixson wrote:
> Daniel Keep wrote:
>> Charles Hixson wrote:
>>> I want to read a bunch of files, and if the aren't UTF, then I want to
>>> list their names for conversion, or other processing. How should this
>>> be handled??
>>>
>>> try..catch..finally blocks just ignore this error.
>>
>>> type stuff.d
>> import std.stdio;
>> import std.utf;
>>
>> void main()
>> {
>> try
>> {
>> writefln("A B \xfe C");
>> }
>> catch( UtfException e )
>> {
>> writefln("I caught a %s!", e);
>> }
>> }
>>
>>> dmd stuff && stuff
>> A B I caught a 4invalid UTF-8 sequence!
>>
>> Works for me.
>
> Sorry, the error is on the read. The code I tried to use was:
>
> try { lin = fil.readLine; }
> catch
> { writefln("File <<" ~ filIter [curFilNdx] ~ ">> is not a valid UTF
> file.");
> fil.close;
> getLine;
> return;
> }
> finally
> { }
> debug (9) writefln ("lin = <<" ~ lin ~ ">>");
> try
> { validate (lin); }
> catch (UtfException ue)
> { writefln ("File <<" ~ filIter [curFilNdx] ~ ">> is not a valid UTF
> file.");
> fil.close;
> getLine;
> return;
> }
>
> where fil is a File and getLine is one of my routines that automatically
> switches to the next file if the current file has been closed.

For some reason when I explicitly put the (UtfException ue) on the catch statement that I'd been trying to use to catch everything (i.e., just a blank catch) it works.

I'm not sure whether I misunderstand how the unlabeled catch works in D, or whether something really strange is going on.  The documentation seems to say that an unlabeled catch statement catches everything, but it doesn't catch the UtfException.  When the UtfException is explicitly listed it works.  (Admittedly I altered the code a lot, trying lots of different things, before I tried just using an explicit:
   catch (UtfException ue)

What I finally ended up with that worked was
  while (!curFile.eof)
  {  ...
     try
     {  s  =  curFile.readLine;
        std.utf.validate (s);
     }
     catch  (UtfException ue)
     {  writef ("\n  err at <<" ~ fileName ~ ">>line "
                 ~ std.string.toString (line));
        if (++errs > 3)	
        {  writefln ("\ntoo many errs");	
           break;	
        }
     }
  }
with curFile a std.File.  I don't know whether a BufferedFile would have worked.