Jump to page: 1 2
Thread overview
Reading ASCII file with some codes above 127 (exten ascii)
May 13, 2012
Paul
May 13, 2012
Era Scarecrow
May 14, 2012
Graham Fawcett
May 17, 2012
Paul
May 23, 2012
Paul
May 23, 2012
Graham Fawcett
May 23, 2012
Paul
May 23, 2012
Graham Fawcett
May 23, 2012
Paul
May 23, 2012
H. S. Teoh
May 23, 2012
Paul
May 23, 2012
Graham Fawcett
May 23, 2012
Paul
May 24, 2012
era scarecrow
May 25, 2012
Regan Heath
May 13, 2012
I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points.  I'm pretty sure it all boils down to the these extended ascii codes.

Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? 
 I've messed with the std.encoding module but really can't figure out what I need to do.

There must be a simple solution to this.
May 13, 2012
On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points.  I'm pretty sure it all boils down to the these extended ascii codes.
>
> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
> I've messed with the std.encoding module but really can't figure out what I need to do.
>
> There must be a simple solution to this.

 Same here. I've ended up writing a custom array converter that if there's any 128+ codes it converts it and returns a new array. Maybe this is wrong, but for me it works.

import std.utf;
import std.ascii;

//conversion table of ascii (latin-1?) to unicode for text compares.
//only 128-255
private immutable wchar[] extAscii = [
  0x20AC, 0x0081, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021,
  0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0x008D, 0x017D, 0x008F,
  0x0090, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014,
  0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0x009D, 0x017E, 0x0178,
  0x00A0, 0x00A1, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7,
  0x00A8, 0x00A9, 0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF,
  0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7,
  0x00B8, 0x00B9, 0x00BA, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x00BF,
  0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x00C7,
  0x00C8, 0x00C9, 0x00CA, 0x00CB, 0x00CC, 0x00CD, 0x00CE, 0x00CF,
  0x00D0, 0x00D1, 0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7,
  0x00D8, 0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF,
  0x00E0, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x00E7,
  0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF,
  0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x00F7,
  0x00F8, 0x00F9, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x00FF];

/**since I can't find a good explanation of conversion, this is custom made.
   if it doesn't need to be converted, it returns the original buffer*/
char[] ascii2char(ubyte[] input) {
  char[] o;

  foreach(i, b; input) {
    if (b & 0x80) {
      if (!o.length)
        o = cast(char[]) input[0 .. i];

      encode(o, extAscii[b - 0x80]);
    } else if (o.length)
      o ~= b;
  }

  return o.length ? o : cast(char[]) input;
}
May 14, 2012
On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points.  I'm pretty sure it all boils down to the these extended ascii codes.
>
> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>  I've messed with the std.encoding module but really can't figure out what I need to do.
>
> There must be a simple solution to this.

This seems to work:


import std.stdio, std.file, std.encoding;

void main()
{
    auto latin = cast(Latin1String) read("/tmp/hi.8859");
    string s;
    transcode(latin, s);
    writeln(s);
}


Graham
May 17, 2012
On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points.  I'm pretty sure it all boils down to the these extended ascii codes.
>>
>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>
>> There must be a simple solution to this.
>
> This seems to work:
>
>
> import std.stdio, std.file, std.encoding;
>
> void main()
> {
>     auto latin = cast(Latin1String) read("/tmp/hi.8859");
>     string s;
>     transcode(latin, s);
>     writeln(s);
> }
>
>
> Graham

Awesome! Thanks a million!
May 23, 2012
On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points.  I'm pretty sure it all boils down to the these extended ascii codes.
>>
>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>
>> There must be a simple solution to this.
>
> This seems to work:
>
>
> import std.stdio, std.file, std.encoding;
>
> void main()
> {
>     auto latin = cast(Latin1String) read("/tmp/hi.8859");
>     string s;
>     transcode(latin, s);
>     writeln(s);
> }
>
>
> Graham

I thought I was in good shape with your above suggestion.  I does help me read and process text.  But when I go to print it out I have problems.

Here is my input file:
°F

Here is my code:
import std.stdio;
import std.string;
import std.file;
import std.encoding;

// Main function
void main(){
    auto fout = File("out.txt","w");
    auto latinS = cast(Latin1String) read("in.txt");
    string uniS;
    transcode(latinS, uniS);
    foreach(line; uniS.splitLines()){
       transcode(line, latinS);
       fout.writeln(line);
       fout.writeln(latinS);
    }
}

Here is the output:
°F
[cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]

If I print the Unicode string I get an extra weird character.  If I print the Unicode string retranslated to Latin1, it get weird pseudo-code.
Can you help?
May 23, 2012
On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points.  I'm pretty sure it all boils down to the these extended ascii codes.
>>>
>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>>
>>> There must be a simple solution to this.
>>
>> This seems to work:
>>
>>
>> import std.stdio, std.file, std.encoding;
>>
>> void main()
>> {
>>    auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>    string s;
>>    transcode(latin, s);
>>    writeln(s);
>> }
>>
>>
>> Graham
>
> I thought I was in good shape with your above suggestion.  I does help me read and process text.  But when I go to print it out I have problems.
>
> Here is my input file:
> °F
>
> Here is my code:
> import std.stdio;
> import std.string;
> import std.file;
> import std.encoding;
>
> // Main function
> void main(){
>     auto fout = File("out.txt","w");
>     auto latinS = cast(Latin1String) read("in.txt");
>     string uniS;
>     transcode(latinS, uniS);
>     foreach(line; uniS.splitLines()){
>        transcode(line, latinS);
>        fout.writeln(line);
>        fout.writeln(latinS);
>     }
> }
>
> Here is the output:
> °F
> [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]
>
> If I print the Unicode string I get an extra weird character.  If I print the Unicode string retranslated to Latin1, it get weird pseudo-code.
> Can you help?

I tried the program and it seemed to work for me.

What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.)

If you're on a Unix system, what does "file in.txt out.txt" report?

Graham

May 23, 2012
On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points.  I'm pretty sure it all boils down to the these extended ascii codes.
>>>>
>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>>>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>>>
>>>> There must be a simple solution to this.
>>>
>>> This seems to work:
>>>
>>>
>>> import std.stdio, std.file, std.encoding;
>>>
>>> void main()
>>> {
>>>   auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>   string s;
>>>   transcode(latin, s);
>>>   writeln(s);
>>> }
>>>
>>>
>>> Graham
>>
>> I thought I was in good shape with your above suggestion.  I does help me read and process text.  But when I go to print it out I have problems.
>>
>> Here is my input file:
>> °F
>>
>> Here is my code:
>> import std.stdio;
>> import std.string;
>> import std.file;
>> import std.encoding;
>>
>> // Main function
>> void main(){
>>    auto fout = File("out.txt","w");
>>    auto latinS = cast(Latin1String) read("in.txt");
>>    string uniS;
>>    transcode(latinS, uniS);
>>    foreach(line; uniS.splitLines()){
>>       transcode(line, latinS);
>>       fout.writeln(line);
>>       fout.writeln(latinS);
>>    }
>> }
>>
>> Here is the output:
>> °F
>> [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]
>>
>> If I print the Unicode string I get an extra weird character.  If I print the Unicode string retranslated to Latin1, it get weird pseudo-code.
>> Can you help?
>
> I tried the program and it seemed to work for me.
>
> What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.)
>
> If you're on a Unix system, what does "file in.txt out.txt" report?
>
> Graham

Hmmm.  I'm not communicating well.
I want to read and write ASCII.  The only reason I'm converting to Unicode is because D needs it (as I understand).

Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F.

I want to:
1) Read an ascii file that may have codes above 127.
2) Convert to unicode so D funcs like .splitLines() can work with it.
3) Convert back to ascii so that stuff like °F writes out as it was read in.

If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII.  I thought my program was doing just that.
Thanks for your assistance.
May 23, 2012
On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
> On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
>> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points.  I'm pretty sure it all boils down to the these extended ascii codes.
>>>>>
>>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>>>>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>>>>
>>>>> There must be a simple solution to this.
>>>>
>>>> This seems to work:
>>>>
>>>>
>>>> import std.stdio, std.file, std.encoding;
>>>>
>>>> void main()
>>>> {
>>>>  auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>>  string s;
>>>>  transcode(latin, s);
>>>>  writeln(s);
>>>> }
>>>>
>>>>
>>>> Graham
>>>
>>> I thought I was in good shape with your above suggestion.  I does help me read and process text.  But when I go to print it out I have problems.
>>>
>>> Here is my input file:
>>> °F
>>>
>>> Here is my code:
>>> import std.stdio;
>>> import std.string;
>>> import std.file;
>>> import std.encoding;
>>>
>>> // Main function
>>> void main(){
>>>   auto fout = File("out.txt","w");
>>>   auto latinS = cast(Latin1String) read("in.txt");
>>>   string uniS;
>>>   transcode(latinS, uniS);
>>>   foreach(line; uniS.splitLines()){
>>>      transcode(line, latinS);
>>>      fout.writeln(line);
>>>      fout.writeln(latinS);
>>>   }
>>> }
>>>
>>> Here is the output:
>>> °F
>>> [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]
>>>
>>> If I print the Unicode string I get an extra weird character.
>>>  If I print the Unicode string retranslated to Latin1, it get weird pseudo-code.
>>> Can you help?
>>
>> I tried the program and it seemed to work for me.
>>
>> What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.)
>>
>> If you're on a Unix system, what does "file in.txt out.txt" report?
>>
>> Graham
>
> Hmmm.  I'm not communicating well.
> I want to read and write ASCII.  The only reason I'm converting to Unicode is because D needs it (as I understand).
>
> Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F.
>
> I want to:
> 1) Read an ascii file that may have codes above 127.
> 2) Convert to unicode so D funcs like .splitLines() can work with it.
> 3) Convert back to ascii so that stuff like °F writes out as it was read in.
>
> If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII.  I thought my program was doing just that.
> Thanks for your assistance.

To make sure we're on the same page -- ASCII is a 7-bit encoding, and any character above 127 is by definition not an ASCII character. At that point we're talking about an encoding other than ASCII, such as UTF-8 or Latin-1.

If you're reading a file that has bytes > 127, you really have no choice but to specify (assume?) an encoding, Latin-1 for example. There's no guarantee your input file is Latin-1, though, and garbage-in will result in garbage-out.

So I think what you're trying to do is

1. read a Latin-1 file, into unicode (internally in D)
2. do splitLines(), etc., generating some result
3. Convert the result back to latin-1, and output it.

Is that right?
Graham


May 23, 2012
On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
> On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
>> On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
>>> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>>>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>>>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points.  I'm pretty sure it all boils down to the these extended ascii codes.
>>>>>>
>>>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>>>>>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>>>>>
>>>>>> There must be a simple solution to this.
>>>>>
>>>>> This seems to work:
>>>>>
>>>>>
>>>>> import std.stdio, std.file, std.encoding;
>>>>>
>>>>> void main()
>>>>> {
>>>>> auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>>> string s;
>>>>> transcode(latin, s);
>>>>> writeln(s);
>>>>> }
>>>>>
>>>>>
>>>>> Graham
>>>>
>>>> I thought I was in good shape with your above suggestion.  I does help me read and process text.  But when I go to print it out I have problems.
>>>>
>>>> Here is my input file:
>>>> °F
>>>>
>>>> Here is my code:
>>>> import std.stdio;
>>>> import std.string;
>>>> import std.file;
>>>> import std.encoding;
>>>>
>>>> // Main function
>>>> void main(){
>>>>  auto fout = File("out.txt","w");
>>>>  auto latinS = cast(Latin1String) read("in.txt");
>>>>  string uniS;
>>>>  transcode(latinS, uniS);
>>>>  foreach(line; uniS.splitLines()){
>>>>     transcode(line, latinS);
>>>>     fout.writeln(line);
>>>>     fout.writeln(latinS);
>>>>  }
>>>> }
>>>>
>>>> Here is the output:
>>>> °F
>>>> [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]
>>>>
>>>> If I print the Unicode string I get an extra weird character.
>>>> If I print the Unicode string retranslated to Latin1, it get weird pseudo-code.
>>>> Can you help?
>>>
>>> I tried the program and it seemed to work for me.
>>>
>>> What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.)
>>>
>>> If you're on a Unix system, what does "file in.txt out.txt" report?
>>>
>>> Graham
>>
>> Hmmm.  I'm not communicating well.
>> I want to read and write ASCII.  The only reason I'm converting to Unicode is because D needs it (as I understand).
>>
>> Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F.
>>
>> I want to:
>> 1) Read an ascii file that may have codes above 127.
>> 2) Convert to unicode so D funcs like .splitLines() can work with it.
>> 3) Convert back to ascii so that stuff like °F writes out as it was read in.
>>
>> If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII.  I thought my program was doing just that.
>> Thanks for your assistance.
>
> To make sure we're on the same page -- ASCII is a 7-bit encoding, and any character above 127 is by definition not an ASCII character. At that point we're talking about an encoding other than ASCII, such as UTF-8 or Latin-1.
>
> If you're reading a file that has bytes > 127, you really have no choice but to specify (assume?) an encoding, Latin-1 for example. There's no guarantee your input file is Latin-1, though, and garbage-in will result in garbage-out.
>
> So I think what you're trying to do is
>
> 1. read a Latin-1 file, into unicode (internally in D)
> 2. do splitLines(), etc., generating some result
> 3. Convert the result back to latin-1, and output it.
>
> Is that right?
> Graham

Exactly.
May 23, 2012
On Wed, May 23, 2012 at 09:09:27PM +0200, Paul wrote:
> On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
[...]
> >So I think what you're trying to do is
> >
> >1. read a Latin-1 file, into unicode (internally in D)
> >2. do splitLines(), etc., generating some result
> >3. Convert the result back to latin-1, and output it.
> >
> >Is that right?
> >Graham
> 
> Exactly.

The safest way is probably to read it as binary data (i.e. byte[]), then
do the conversion into UTF8, then process it, and finally convert it
back to latin-1 (in binary form) and output it.

D assumes Unicode internally; if you try to read a Latin-1 file as char[], you may be running into some implicit UTF conversions that are corrupting the data. Best use byte[] for reading/writing, and do conversions to/from UTF-8 internally for processing.


T

-- 
Doubt is a self-fulfilling prophecy.
« First   ‹ Prev
1 2