Reading ASCII file with some codes above 127 (exten ascii) (page 2)

May 23, 2012

Re: Reading ASCII file with some codes above 127 (exten ascii)

Posted by Graham Fawcett
in reply to Paul

Permalink

Graham Fawcett

Posted in reply to Paul

Permalink

On Wednesday, 23 May 2012 at 19:09:29 UTC, Paul wrote:
> On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
>> On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
>>> On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
>>>> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>>>>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>>>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>>>>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points.  I'm pretty sure it all boils down to the these extended ascii codes.
>>>>>>>
>>>>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>>>>>>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>>>>>>
>>>>>>> There must be a simple solution to this.
>>>>>>
>>>>>> This seems to work:
>>>>>>
>>>>>>
>>>>>> import std.stdio, std.file, std.encoding;
>>>>>>
>>>>>> void main()
>>>>>> {
>>>>>> auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>>>> string s;
>>>>>> transcode(latin, s);
>>>>>> writeln(s);
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Graham
>>>>>
>>>>> I thought I was in good shape with your above suggestion.  I does help me read and process text.  But when I go to print it out I have problems.
>>>>>
>>>>> Here is my input file:
>>>>> °F
>>>>>
>>>>> Here is my code:
>>>>> import std.stdio;
>>>>> import std.string;
>>>>> import std.file;
>>>>> import std.encoding;
>>>>>
>>>>> // Main function
>>>>> void main(){
>>>>> auto fout = File("out.txt","w");
>>>>> auto latinS = cast(Latin1String) read("in.txt");
>>>>> string uniS;
>>>>> transcode(latinS, uniS);
>>>>> foreach(line; uniS.splitLines()){
>>>>>    transcode(line, latinS);
>>>>>    fout.writeln(line);
>>>>>    fout.writeln(latinS);
>>>>> }
>>>>> }
>>>>>
>>>>> Here is the output:
>>>>> Â°F
>>>>> [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]
>>>>>
>>>>> If I print the Unicode string I get an extra weird character.
>>>>> If I print the Unicode string retranslated to Latin1, it get weird pseudo-code.
>>>>> Can you help?
>>>>
>>>> I tried the program and it seemed to work for me.
>>>>
>>>> What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.)
>>>>
>>>> If you're on a Unix system, what does "file in.txt out.txt" report?
>>>>
>>>> Graham
>>>
>>> Hmmm.  I'm not communicating well.
>>> I want to read and write ASCII.  The only reason I'm converting to Unicode is because D needs it (as I understand).
>>>
>>> Yes if I open Â°F in notepad++ and tell notepad++ that it is UTF-8, it shows °F.
>>>
>>> I want to:
>>> 1) Read an ascii file that may have codes above 127.
>>> 2) Convert to unicode so D funcs like .splitLines() can work with it.
>>> 3) Convert back to ascii so that stuff like °F writes out as it was read in.
>>>
>>> If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII.  I thought my program was doing just that.
>>> Thanks for your assistance.
>>
>> To make sure we're on the same page -- ASCII is a 7-bit encoding, and any character above 127 is by definition not an ASCII character. At that point we're talking about an encoding other than ASCII, such as UTF-8 or Latin-1.
>>
>> If you're reading a file that has bytes > 127, you really have no choice but to specify (assume?) an encoding, Latin-1 for example. There's no guarantee your input file is Latin-1, though, and garbage-in will result in garbage-out.
>>
>> So I think what you're trying to do is
>>
>> 1. read a Latin-1 file, into unicode (internally in D)
>> 2. do splitLines(), etc., generating some result
>> 3. Convert the result back to latin-1, and output it.
>>
>> Is that right?
>> Graham
>
> Exactly.

This works, though it's ugly:


    foreach(line; uniS.splitLines()) {
       transcode(line, latinS);
       fout.writeln((cast(char[]) latinS));
    }

The Latin1String type, at the storage level, is a ubyte[]. By casting to char[], you can get a similar-to-string thing that writeln() can handle.

Graham

> The safest way is probably to read it as binary data (i.e. byte[]), then > do the conversion into UTF8, then process it, and finally convert it > back to latin-1 (in binary form) and output it. > > D assumes Unicode internally; if you try to read a Latin-1 file as > char[], you may be running into some implicit UTF conversions that are > corrupting the data. Best use byte[] for reading/writing, and do > conversions to/from UTF-8 internally for processing. > > > T You mean something like Era has done in the first reply? If that is so I have to say I'm really surprized. To write D so it natively expects and outputs unicode is one thing but not making a clean simple way to read extended ASCII chars (i.e. Latin1) and write them back out seems like an oversight. I think I'm (actually Graham) is close. Thanks for your feedback HS.

> > This works, though it's ugly: > > > foreach(line; uniS.splitLines()) { > transcode(line, latinS); > fout.writeln((cast(char[]) latinS)); > } > > The Latin1String type, at the storage level, is a ubyte[]. By casting to char[], you can get a similar-to-string thing that writeln() can handle. > > Graham Awesome! What a lesson! Thannk you! So if anyone is following this thread heres my code now. This reads a text file(encoded in Latin1 which is basic ascii with extended ascii codes), allows D to work with it in unicode, and then spits it back out as Latin1. I wonder about the speed between this method and Era's home-spun solution? import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln((cast(char[]) latinS)); } }

On Wednesday, 23 May 2012 at 21:02:27 UTC, Paul wrote: > I wonder about the speed between this method and Era's home-spun solution? My solution may have a flaw in it's lookup table; namely if I got one of the codes wrong. I used regex and a site to reference them all so I Hope it's right. I can't remember but I think it was from http://www.alanwood.net/demos/ansi.html The main reason I wrote it was there was no good explanations in the documentation of anywhere of how to use std.encoding and transcode. This meant I was stuck and needed some simple solution. I'm not sure if my solution is going to be faster, but it does do minimal object allocation/resizing/abstraction, and tries not to make a new string if it doesn't have to. Who knows? Perhaps it will be added to phobos once the table is verified.

On Wed, 23 May 2012 22:02:25 +0100, Paul <phshaffer@gmail.com> wrote: >> This works, though it's ugly: >> >> >> foreach(line; uniS.splitLines()) { >> transcode(line, latinS); >> fout.writeln((cast(char[]) latinS)); >> } >> >> The Latin1String type, at the storage level, is a ubyte[]. By casting to char[], you can get a similar-to-string thing that writeln() can handle. >> >> Graham > > Awesome! What a lesson! Thannk you! > > So if anyone is following this thread heres my code now. This reads a text file(encoded in Latin1 which is basic ascii with extended ascii codes), allows D to work with it in unicode, and then spits it back out as Latin1. > > I wonder about the speed between this method and Era's home-spun solution? > > import std.stdio; > import std.string; > import std.file; > import std.encoding; > > // Main function > void main(){ > auto fout = File("out.txt","w"); > auto latinS = cast(Latin1String) read("in.txt"); > string uniS; > transcode(latinS, uniS); > foreach(line; uniS.splitLines()){ > transcode(line, latinS); > fout.writeln((cast(char[]) latinS)); > } > } The only thing which would worry me about this code is the cast(char[]) in the final writeln.. I know some parts of phobos verify the char data is correct UTF-8 and this line casts latin-1 to char[] which can potentially create invalid UTF-8 data. That said, I had a really quick look at the phobos code for File.writeln and I'm not sure whether this function does any UTF-8 validation. I would be happier if the latin-1 was written as a stream of bytes with no assumed interpretation, IMO. R -- Using Opera's revolutionary email client: http://www.opera.com/mail/

On Thursday, 24 May 2012 at 19:47:06 UTC, era scarecrow wrote: > On Wednesday, 23 May 2012 at 21:02:27 UTC, Paul wrote: >> I wonder about the speed between this method and Era's home-spun solution? > Who knows? Perhaps it will be added to phobos once the table is verified. Well after taking to heart about a gc-less solution and doing a inputRange I re-wrote the entire thing. Of course to make it even faster/simpler a full lookup table conversion is used instead. Further reduction has made a very tiny simple filter. Curiously relooking at it there's actually very few codes that are there that really require special attention. If there's still any interest in this I can release it.

Forums