May 23, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Paul | On Wednesday, 23 May 2012 at 19:09:29 UTC, Paul wrote:
> On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
>> On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
>>> On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
>>>> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>>>>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>>>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>>>>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes.
>>>>>>>
>>>>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>>>>>>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>>>>>>
>>>>>>> There must be a simple solution to this.
>>>>>>
>>>>>> This seems to work:
>>>>>>
>>>>>>
>>>>>> import std.stdio, std.file, std.encoding;
>>>>>>
>>>>>> void main()
>>>>>> {
>>>>>> auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>>>> string s;
>>>>>> transcode(latin, s);
>>>>>> writeln(s);
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Graham
>>>>>
>>>>> I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems.
>>>>>
>>>>> Here is my input file:
>>>>> °F
>>>>>
>>>>> Here is my code:
>>>>> import std.stdio;
>>>>> import std.string;
>>>>> import std.file;
>>>>> import std.encoding;
>>>>>
>>>>> // Main function
>>>>> void main(){
>>>>> auto fout = File("out.txt","w");
>>>>> auto latinS = cast(Latin1String) read("in.txt");
>>>>> string uniS;
>>>>> transcode(latinS, uniS);
>>>>> foreach(line; uniS.splitLines()){
>>>>> transcode(line, latinS);
>>>>> fout.writeln(line);
>>>>> fout.writeln(latinS);
>>>>> }
>>>>> }
>>>>>
>>>>> Here is the output:
>>>>> °F
>>>>> [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]
>>>>>
>>>>> If I print the Unicode string I get an extra weird character.
>>>>> If I print the Unicode string retranslated to Latin1, it get weird pseudo-code.
>>>>> Can you help?
>>>>
>>>> I tried the program and it seemed to work for me.
>>>>
>>>> What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.)
>>>>
>>>> If you're on a Unix system, what does "file in.txt out.txt" report?
>>>>
>>>> Graham
>>>
>>> Hmmm. I'm not communicating well.
>>> I want to read and write ASCII. The only reason I'm converting to Unicode is because D needs it (as I understand).
>>>
>>> Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F.
>>>
>>> I want to:
>>> 1) Read an ascii file that may have codes above 127.
>>> 2) Convert to unicode so D funcs like .splitLines() can work with it.
>>> 3) Convert back to ascii so that stuff like °F writes out as it was read in.
>>>
>>> If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII. I thought my program was doing just that.
>>> Thanks for your assistance.
>>
>> To make sure we're on the same page -- ASCII is a 7-bit encoding, and any character above 127 is by definition not an ASCII character. At that point we're talking about an encoding other than ASCII, such as UTF-8 or Latin-1.
>>
>> If you're reading a file that has bytes > 127, you really have no choice but to specify (assume?) an encoding, Latin-1 for example. There's no guarantee your input file is Latin-1, though, and garbage-in will result in garbage-out.
>>
>> So I think what you're trying to do is
>>
>> 1. read a Latin-1 file, into unicode (internally in D)
>> 2. do splitLines(), etc., generating some result
>> 3. Convert the result back to latin-1, and output it.
>>
>> Is that right?
>> Graham
>
> Exactly.
This works, though it's ugly:
foreach(line; uniS.splitLines()) {
transcode(line, latinS);
fout.writeln((cast(char[]) latinS));
}
The Latin1String type, at the storage level, is a ubyte[]. By casting to char[], you can get a similar-to-string thing that writeln() can handle.
Graham
|
May 23, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh |
> The safest way is probably to read it as binary data (i.e. byte[]), then
> do the conversion into UTF8, then process it, and finally convert it
> back to latin-1 (in binary form) and output it.
>
> D assumes Unicode internally; if you try to read a Latin-1 file as
> char[], you may be running into some implicit UTF conversions that are
> corrupting the data. Best use byte[] for reading/writing, and do
> conversions to/from UTF-8 internally for processing.
>
>
> T
You mean something like Era has done in the first reply?
If that is so I have to say I'm really surprized. To write D so it natively expects and outputs unicode is one thing but not making a clean simple way to read extended ASCII chars (i.e. Latin1) and write them back out seems like an oversight.
I think I'm (actually Graham) is close.
Thanks for your feedback HS.
|
May 23, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Graham Fawcett | >
> This works, though it's ugly:
>
>
> foreach(line; uniS.splitLines()) {
> transcode(line, latinS);
> fout.writeln((cast(char[]) latinS));
> }
>
> The Latin1String type, at the storage level, is a ubyte[]. By casting to char[], you can get a similar-to-string thing that writeln() can handle.
>
> Graham
Awesome! What a lesson! Thannk you!
So if anyone is following this thread heres my code now. This reads a text file(encoded in Latin1 which is basic ascii with extended ascii codes), allows D to work with it in unicode, and then spits it back out as Latin1.
I wonder about the speed between this method and Era's home-spun solution?
import std.stdio;
import std.string;
import std.file;
import std.encoding;
// Main function
void main(){
auto fout = File("out.txt","w");
auto latinS = cast(Latin1String) read("in.txt");
string uniS;
transcode(latinS, uniS);
foreach(line; uniS.splitLines()){
transcode(line, latinS);
fout.writeln((cast(char[]) latinS));
}
}
|
May 24, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Paul | On Wednesday, 23 May 2012 at 21:02:27 UTC, Paul wrote: > I wonder about the speed between this method and Era's home-spun solution? My solution may have a flaw in it's lookup table; namely if I got one of the codes wrong. I used regex and a site to reference them all so I Hope it's right. I can't remember but I think it was from http://www.alanwood.net/demos/ansi.html The main reason I wrote it was there was no good explanations in the documentation of anywhere of how to use std.encoding and transcode. This meant I was stuck and needed some simple solution. I'm not sure if my solution is going to be faster, but it does do minimal object allocation/resizing/abstraction, and tries not to make a new string if it doesn't have to. Who knows? Perhaps it will be added to phobos once the table is verified. |
May 25, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Paul | On Wed, 23 May 2012 22:02:25 +0100, Paul <phshaffer@gmail.com> wrote: >> This works, though it's ugly: >> >> >> foreach(line; uniS.splitLines()) { >> transcode(line, latinS); >> fout.writeln((cast(char[]) latinS)); >> } >> >> The Latin1String type, at the storage level, is a ubyte[]. By casting to char[], you can get a similar-to-string thing that writeln() can handle. >> >> Graham > > Awesome! What a lesson! Thannk you! > > So if anyone is following this thread heres my code now. This reads a text file(encoded in Latin1 which is basic ascii with extended ascii codes), allows D to work with it in unicode, and then spits it back out as Latin1. > > I wonder about the speed between this method and Era's home-spun solution? > > import std.stdio; > import std.string; > import std.file; > import std.encoding; > > // Main function > void main(){ > auto fout = File("out.txt","w"); > auto latinS = cast(Latin1String) read("in.txt"); > string uniS; > transcode(latinS, uniS); > foreach(line; uniS.splitLines()){ > transcode(line, latinS); > fout.writeln((cast(char[]) latinS)); > } > } The only thing which would worry me about this code is the cast(char[]) in the final writeln.. I know some parts of phobos verify the char data is correct UTF-8 and this line casts latin-1 to char[] which can potentially create invalid UTF-8 data. That said, I had a really quick look at the phobos code for File.writeln and I'm not sure whether this function does any UTF-8 validation. I would be happier if the latin-1 was written as a stream of bytes with no assumed interpretation, IMO. R -- Using Opera's revolutionary email client: http://www.opera.com/mail/ |
May 16, 2016 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to era scarecrow | On Thursday, 24 May 2012 at 19:47:06 UTC, era scarecrow wrote:
> On Wednesday, 23 May 2012 at 21:02:27 UTC, Paul wrote:
>> I wonder about the speed between this method and Era's home-spun solution?
> Who knows? Perhaps it will be added to phobos once the table is verified.
Well after taking to heart about a gc-less solution and doing a inputRange I re-wrote the entire thing. Of course to make it even faster/simpler a full lookup table conversion is used instead. Further reduction has made a very tiny simple filter.
Curiously relooking at it there's actually very few codes that are there that really require special attention. If there's still any interest in this I can release it.
|
Copyright © 1999-2021 by the D Language Foundation