Thread overview
read files ... continued
May 13, 2007
Jan Hanselaer
May 13, 2007
Daniel Keep
May 13, 2007
Jan Hanselaer
May 13, 2007
Lars Ivar Igesund
May 13, 2007
Carlos Santander
May 13, 2007
Woops ... sent before done writing ... sorry

Hi

I'm writing an application that reads all kind of text files.
I'm not really familiar with the filetypes.
For the moment I read them with a BufferedFile.
I read the lines with readLine()

Stream br = new BufferedFile(fileName);
char[] line = br.readLine();

But that causes a lot of trouble. I managed to figure out how to read a file
his BOM and so It'll also be possible I presume to convert them to a type
(UTF8 for example) that I always use. (I'm checking that later)
But for a lot of files when I check the BOM I get result -1 (meaning the
type is not known).
http://www.digitalmars.com/d/phobos/std_stream.html
The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)

For a lot of text files on my system (windows) the type is ANSI, and there's
no problem reading them with BufferedFile if there are no special signs in
it.
But if there's an accent or something (for example 'é'), than it's an
invalid UTF sequence. I cannot convert the text because the BOM for this
files is also unknown.

Anyone has an idea of how to catch this sort of files (and convert them?) Or is there a stream that takes into account the filetype by itself? Would be very handy ...

It's an application I wrote in Java I'm now trying in D. In Java I used a BufferedReader on A FileReader and there all goes well. Sometimes files are not read well, but no faults like this invalid UTF-sequence in D.

If someone unterstands my problem out of all this confusing talk (that's because I'm rather confused myself) ... I'd be glad :p

Thanks!



May 13, 2007

Jan Hanselaer wrote:
> Woops ... sent before done writing ... sorry
> 
> Hi
> 
> I'm writing an application that reads all kind of text files.
> I'm not really familiar with the filetypes.
> For the moment I read them with a BufferedFile.
> I read the lines with readLine()
> 
> Stream br = new BufferedFile(fileName);
> char[] line = br.readLine();
> 
> But that causes a lot of trouble. I managed to figure out how to read a file
> his BOM and so It'll also be possible I presume to convert them to a type
> (UTF8 for example) that I always use. (I'm checking that later)
> But for a lot of files when I check the BOM I get result -1 (meaning the
> type is not known).
> http://www.digitalmars.com/d/phobos/std_stream.html
> The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)
> 
> For a lot of text files on my system (windows) the type is ANSI, and there's
> no problem reading them with BufferedFile if there are no special signs in
> it.
> But if there's an accent or something (for example '�'), than it's an
> invalid UTF sequence. I cannot convert the text because the BOM for this
> files is also unknown.
> 
> Anyone has an idea of how to catch this sort of files (and convert them?) Or is there a stream that takes into account the filetype by itself? Would be very handy ...
> 
> It's an application I wrote in Java I'm now trying in D. In Java I used a BufferedReader on A FileReader and there all goes well. Sometimes files are not read well, but no faults like this invalid UTF-sequence in D.
> 
> If someone unterstands my problem out of all this confusing talk (that's because I'm rather confused myself) ... I'd be glad :p
> 
> Thanks!

Basically, the problem is that if the file is in something other than ASCII, UTF-8, UTF-16 or UTF-32, then there's no way for D to work out what it's supposed to be.

There are various methods for autodetecting the codepage of a piece of text, but none of them are foolproof.  Hence why there isn't a stream to do this for you; it's a nasty, horrible problem that no one wants to solve... at least, I know I don't. :)

Incidentally, notice that the example accent you provided didn't show up; presumably because your mail reader doesn't know how to use codepages properly. :3

(Checks headers) Outlook Express—why am I not surprised :P

Anyway, if you need to open files that aren't in a usable encoding, there's a few things you can do:

1. Read the text as ASCII, and discard all characters that lie outside
of the 7-bit range.
2. Add an option somewhere, or perhaps a tag to the file, to indicate
what the code page is.
3. Find and use one of those auto-detection algorithms.

In any case, you'll need a library for converting between codepages.  I *think* that either Tango or Mango has one, but I'm not sure.

<shameless-plug>
Also, if you need more clarification on how text in D works, you can
give this a read: http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
</shameless-plug>

Hope this has been of at least some help.

	-- Daniel

-- 
int getRandomNumber()
{
    return 4; // chosen by fair dice roll.
              // guaranteed to be random.
}

http://xkcd.com/

v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP  http://hackerkey.com/
May 13, 2007
"Daniel Keep" <daniel.keep.lists@gmail.com> schreef in bericht news:f26qq0$1d16$1@digitalmars.com...
>
>
> Jan Hanselaer wrote:
>> Woops ... sent before done writing ... sorry
>>
>> Hi
>>
>> I'm writing an application that reads all kind of text files.
>> I'm not really familiar with the filetypes.
>> For the moment I read them with a BufferedFile.
>> I read the lines with readLine()
>>
>> Stream br = new BufferedFile(fileName);
>> char[] line = br.readLine();
>>
>> But that causes a lot of trouble. I managed to figure out how to read a
>> file
>> his BOM and so It'll also be possible I presume to convert them to a type
>> (UTF8 for example) that I always use. (I'm checking that later)
>> But for a lot of files when I check the BOM I get result -1 (meaning the
>> type is not known).
>> http://www.digitalmars.com/d/phobos/std_stream.html
>> The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)
>>
>> For a lot of text files on my system (windows) the type is ANSI, and
>> there's
>> no problem reading them with BufferedFile if there are no special signs
>> in
>> it.
>> But if there's an accent or something (for example '?'), than it's an
>> invalid UTF sequence. I cannot convert the text because the BOM for this
>> files is also unknown.
>>
>> Anyone has an idea of how to catch this sort of files (and convert them?)
>> Or
>> is there a stream that takes into account the filetype by itself? Would
>> be
>> very handy ...
>>
>> It's an application I wrote in Java I'm now trying in D. In Java I used a
>> BufferedReader on A FileReader and there all goes well. Sometimes files
>> are
>> not read well, but no faults like this invalid UTF-sequence in D.
>>
>> If someone unterstands my problem out of all this confusing talk (that's because I'm rather confused myself) ... I'd be glad :p
>>
>> Thanks!
>
> Basically, the problem is that if the file is in something other than ASCII, UTF-8, UTF-16 or UTF-32, then there's no way for D to work out what it's supposed to be.
>
> There are various methods for autodetecting the codepage of a piece of text, but none of them are foolproof.  Hence why there isn't a stream to do this for you; it's a nasty, horrible problem that no one wants to solve... at least, I know I don't. :)
>
> Incidentally, notice that the example accent you provided didn't show up; presumably because your mail reader doesn't know how to use codepages properly. :3
>
> (Checks headers) Outlook Express-why am I not surprised :P
>
> Anyway, if you need to open files that aren't in a usable encoding, there's a few things you can do:
>
> 1. Read the text as ASCII, and discard all characters that lie outside
> of the 7-bit range.
> 2. Add an option somewhere, or perhaps a tag to the file, to indicate
> what the code page is.
> 3. Find and use one of those auto-detection algorithms.
>
> In any case, you'll need a library for converting between codepages.  I *think* that either Tango or Mango has one, but I'm not sure.
>
> <shameless-plug>
> Also, if you need more clarification on how text in D works, you can
> give this a read:
> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
> </shameless-plug>
>
> Hope this has been of at least some help.

Yes, thanks a lot. At least now I understand it more. But it's a pity there
isn't a stream doing all the work.
It's going to be very difficult to read the different files in a proper way.

>
> -- Daniel
>
> -- 
> int getRandomNumber()
> {
>    return 4; // chosen by fair dice roll.
>              // guaranteed to be random.
> }
>
> http://xkcd.com/
>
> v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP  http://hackerkey.com/


May 13, 2007
Jan Hanselaer wrote:

> 
> "Daniel Keep" <daniel.keep.lists@gmail.com> schreef in bericht news:f26qq0$1d16$1@digitalmars.com...
>>
>>
>> Jan Hanselaer wrote:
>>> Woops ... sent before done writing ... sorry
>>>
>>> Hi
>>>
>>> I'm writing an application that reads all kind of text files.
>>> I'm not really familiar with the filetypes.
>>> For the moment I read them with a BufferedFile.
>>> I read the lines with readLine()
>>>
>>> Stream br = new BufferedFile(fileName);
>>> char[] line = br.readLine();
>>>
>>> But that causes a lot of trouble. I managed to figure out how to read a
>>> file
>>> his BOM and so It'll also be possible I presume to convert them to a
>>> type (UTF8 for example) that I always use. (I'm checking that later)
>>> But for a lot of files when I check the BOM I get result -1 (meaning the
>>> type is not known).
>>> http://www.digitalmars.com/d/phobos/std_stream.html
>>> The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)
>>>
>>> For a lot of text files on my system (windows) the type is ANSI, and
>>> there's
>>> no problem reading them with BufferedFile if there are no special signs
>>> in
>>> it.
>>> But if there's an accent or something (for example '?'), than it's an
>>> invalid UTF sequence. I cannot convert the text because the BOM for this
>>> files is also unknown.
>>>
>>> Anyone has an idea of how to catch this sort of files (and convert
>>> them?) Or
>>> is there a stream that takes into account the filetype by itself? Would
>>> be
>>> very handy ...
>>>
>>> It's an application I wrote in Java I'm now trying in D. In Java I used
>>> a BufferedReader on A FileReader and there all goes well. Sometimes
>>> files are
>>> not read well, but no faults like this invalid UTF-sequence in D.
>>>
>>> If someone unterstands my problem out of all this confusing talk (that's because I'm rather confused myself) ... I'd be glad :p
>>>
>>> Thanks!
>>
>> Basically, the problem is that if the file is in something other than ASCII, UTF-8, UTF-16 or UTF-32, then there's no way for D to work out what it's supposed to be.
>>
>> There are various methods for autodetecting the codepage of a piece of text, but none of them are foolproof.  Hence why there isn't a stream to do this for you; it's a nasty, horrible problem that no one wants to solve... at least, I know I don't. :)
>>
>> Incidentally, notice that the example accent you provided didn't show up; presumably because your mail reader doesn't know how to use codepages properly. :3
>>
>> (Checks headers) Outlook Express-why am I not surprised :P
>>
>> Anyway, if you need to open files that aren't in a usable encoding, there's a few things you can do:
>>
>> 1. Read the text as ASCII, and discard all characters that lie outside
>> of the 7-bit range.
>> 2. Add an option somewhere, or perhaps a tag to the file, to indicate
>> what the code page is.
>> 3. Find and use one of those auto-detection algorithms.
>>
>> In any case, you'll need a library for converting between codepages.  I *think* that either Tango or Mango has one, but I'm not sure.
>>
>> <shameless-plug>
>> Also, if you need more clarification on how text in D works, you can
>> give this a read:
>> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
>> </shameless-plug>
>>
>> Hope this has been of at least some help.
> 
> Yes, thanks a lot. At least now I understand it more. But it's a pity
> there isn't a stream doing all the work.
> It's going to be very difficult to read the different files in a proper
> way.

Currently Mango has bindings for IBM's ICU library, which may be the most comprehensive solution for this type of text handling.

--- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango
May 13, 2007
Jan Hanselaer escribió:
> 
> Yes, thanks a lot. At least now I understand it more. But it's a pity there isn't a stream doing all the work.
> It's going to be very difficult to read the different files in a proper way.
> 

Any particular reason why you can't use EndianStream?

-- 
Carlos Santander Bernal