View mode: basic / threaded / horizontal-split · Log in · Help
May 13, 2012
Reading ASCII file with some codes above 127 (exten ascii)
I am reading a file that has a few extended ASCII codes (e.g. 
degree symdol). Depending on how I read the file in and what I do 
with it the error shows up at different points.  I'm pretty sure 
it all boils down to the these extended ascii codes.

Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? 
 I've messed with the std.encoding module but really can't figure 
out what I need to do.

There must be a simple solution to this.
May 13, 2012
Re: Reading ASCII file with some codes above 127 (exten ascii)
On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
> I am reading a file that has a few extended ASCII codes (e.g. 
> degree symdol). Depending on how I read the file in and what I 
> do with it the error shows up at different points.  I'm pretty 
> sure it all boils down to the these extended ascii codes.
>
> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
> file?
> I've messed with the std.encoding module but really can't 
> figure out what I need to do.
>
> There must be a simple solution to this.

 Same here. I've ended up writing a custom array converter that 
if there's any 128+ codes it converts it and returns a new array. 
Maybe this is wrong, but for me it works.

import std.utf;
import std.ascii;

//conversion table of ascii (latin-1?) to unicode for text 
compares.
//only 128-255
private immutable wchar[] extAscii = [
  0x20AC, 0x0081, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021,
  0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0x008D, 0x017D, 0x008F,
  0x0090, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014,
  0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0x009D, 0x017E, 0x0178,
  0x00A0, 0x00A1, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7,
  0x00A8, 0x00A9, 0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF,
  0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7,
  0x00B8, 0x00B9, 0x00BA, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x00BF,
  0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x00C7,
  0x00C8, 0x00C9, 0x00CA, 0x00CB, 0x00CC, 0x00CD, 0x00CE, 0x00CF,
  0x00D0, 0x00D1, 0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7,
  0x00D8, 0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF,
  0x00E0, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x00E7,
  0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF,
  0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x00F7,
  0x00F8, 0x00F9, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x00FF];

/**since I can't find a good explanation of conversion, this is 
custom made.
   if it doesn't need to be converted, it returns the original 
buffer*/
char[] ascii2char(ubyte[] input) {
  char[] o;

  foreach(i, b; input) {
    if (b & 0x80) {
      if (!o.length)
        o = cast(char[]) input[0 .. i];

      encode(o, extAscii[b - 0x80]);
    } else if (o.length)
      o ~= b;
  }

  return o.length ? o : cast(char[]) input;
}
May 14, 2012
Re: Reading ASCII file with some codes above 127 (exten ascii)
On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
> I am reading a file that has a few extended ASCII codes (e.g. 
> degree symdol). Depending on how I read the file in and what I 
> do with it the error shows up at different points.  I'm pretty 
> sure it all boils down to the these extended ascii codes.
>
> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
> file?
>  I've messed with the std.encoding module but really can't 
> figure out what I need to do.
>
> There must be a simple solution to this.

This seems to work:


import std.stdio, std.file, std.encoding;

void main()
{
    auto latin = cast(Latin1String) read("/tmp/hi.8859");
    string s;
    transcode(latin, s);
    writeln(s);
}


Graham
May 17, 2012
Re: Reading ASCII file with some codes above 127 (exten ascii)
On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>> I am reading a file that has a few extended ASCII codes (e.g. 
>> degree symdol). Depending on how I read the file in and what I 
>> do with it the error shows up at different points.  I'm pretty 
>> sure it all boils down to the these extended ascii codes.
>>
>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
>> file?
>> I've messed with the std.encoding module but really can't 
>> figure out what I need to do.
>>
>> There must be a simple solution to this.
>
> This seems to work:
>
>
> import std.stdio, std.file, std.encoding;
>
> void main()
> {
>     auto latin = cast(Latin1String) read("/tmp/hi.8859");
>     string s;
>     transcode(latin, s);
>     writeln(s);
> }
>
>
> Graham

Awesome! Thanks a million!
May 23, 2012
Re: Reading ASCII file with some codes above 127 (exten ascii)
On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>> I am reading a file that has a few extended ASCII codes (e.g. 
>> degree symdol). Depending on how I read the file in and what I 
>> do with it the error shows up at different points.  I'm pretty 
>> sure it all boils down to the these extended ascii codes.
>>
>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
>> file?
>> I've messed with the std.encoding module but really can't 
>> figure out what I need to do.
>>
>> There must be a simple solution to this.
>
> This seems to work:
>
>
> import std.stdio, std.file, std.encoding;
>
> void main()
> {
>     auto latin = cast(Latin1String) read("/tmp/hi.8859");
>     string s;
>     transcode(latin, s);
>     writeln(s);
> }
>
>
> Graham

I thought I was in good shape with your above suggestion.  I does 
help me read and process text.  But when I go to print it out I 
have problems.

Here is my input file:
°F

Here is my code:
import std.stdio;
import std.string;
import std.file;
import std.encoding;

// Main function
void main(){
    auto fout = File("out.txt","w");
    auto latinS = cast(Latin1String) read("in.txt");
    string uniS;
    transcode(latinS, uniS);
    foreach(line; uniS.splitLines()){
       transcode(line, latinS);
       fout.writeln(line);
       fout.writeln(latinS);
    }
}

Here is the output:
°F
[cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]

If I print the Unicode string I get an extra weird character.  If 
I print the Unicode string retranslated to Latin1, it get weird 
pseudo-code.
Can you help?
May 23, 2012
Re: Reading ASCII file with some codes above 127 (exten ascii)
On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>> I am reading a file that has a few extended ASCII codes (e.g. 
>>> degree symdol). Depending on how I read the file in and what 
>>> I do with it the error shows up at different points.  I'm 
>>> pretty sure it all boils down to the these extended ascii 
>>> codes.
>>>
>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
>>> file?
>>> I've messed with the std.encoding module but really can't 
>>> figure out what I need to do.
>>>
>>> There must be a simple solution to this.
>>
>> This seems to work:
>>
>>
>> import std.stdio, std.file, std.encoding;
>>
>> void main()
>> {
>>    auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>    string s;
>>    transcode(latin, s);
>>    writeln(s);
>> }
>>
>>
>> Graham
>
> I thought I was in good shape with your above suggestion.  I 
> does help me read and process text.  But when I go to print it 
> out I have problems.
>
> Here is my input file:
> °F
>
> Here is my code:
> import std.stdio;
> import std.string;
> import std.file;
> import std.encoding;
>
> // Main function
> void main(){
>     auto fout = File("out.txt","w");
>     auto latinS = cast(Latin1String) read("in.txt");
>     string uniS;
>     transcode(latinS, uniS);
>     foreach(line; uniS.splitLines()){
>        transcode(line, latinS);
>        fout.writeln(line);
>        fout.writeln(latinS);
>     }
> }
>
> Here is the output:
> °F
> [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]
>
> If I print the Unicode string I get an extra weird character.  
> If I print the Unicode string retranslated to Latin1, it get 
> weird pseudo-code.
> Can you help?

I tried the program and it seemed to work for me.

What program are you using to read "out.txt"? Are you sure it 
supports UTF-8, and knows to open the file as UTF-8? (This looks 
suspiciously like a tool's attempt to misinterpret a UTF-8 string 
as Latin-1.)

If you're on a Unix system, what does "file in.txt out.txt" 
report?

Graham
May 23, 2012
Re: Reading ASCII file with some codes above 127 (exten ascii)
On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>> I am reading a file that has a few extended ASCII codes 
>>>> (e.g. degree symdol). Depending on how I read the file in 
>>>> and what I do with it the error shows up at different 
>>>> points.  I'm pretty sure it all boils down to the these 
>>>> extended ascii codes.
>>>>
>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
>>>> file?
>>>> I've messed with the std.encoding module but really can't 
>>>> figure out what I need to do.
>>>>
>>>> There must be a simple solution to this.
>>>
>>> This seems to work:
>>>
>>>
>>> import std.stdio, std.file, std.encoding;
>>>
>>> void main()
>>> {
>>>   auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>   string s;
>>>   transcode(latin, s);
>>>   writeln(s);
>>> }
>>>
>>>
>>> Graham
>>
>> I thought I was in good shape with your above suggestion.  I 
>> does help me read and process text.  But when I go to print it 
>> out I have problems.
>>
>> Here is my input file:
>> °F
>>
>> Here is my code:
>> import std.stdio;
>> import std.string;
>> import std.file;
>> import std.encoding;
>>
>> // Main function
>> void main(){
>>    auto fout = File("out.txt","w");
>>    auto latinS = cast(Latin1String) read("in.txt");
>>    string uniS;
>>    transcode(latinS, uniS);
>>    foreach(line; uniS.splitLines()){
>>       transcode(line, latinS);
>>       fout.writeln(line);
>>       fout.writeln(latinS);
>>    }
>> }
>>
>> Here is the output:
>> °F
>> [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]
>>
>> If I print the Unicode string I get an extra weird character.  
>> If I print the Unicode string retranslated to Latin1, it get 
>> weird pseudo-code.
>> Can you help?
>
> I tried the program and it seemed to work for me.
>
> What program are you using to read "out.txt"? Are you sure it 
> supports UTF-8, and knows to open the file as UTF-8? (This 
> looks suspiciously like a tool's attempt to misinterpret a 
> UTF-8 string as Latin-1.)
>
> If you're on a Unix system, what does "file in.txt out.txt" 
> report?
>
> Graham

Hmmm.  I'm not communicating well.
I want to read and write ASCII.  The only reason I'm converting 
to Unicode is because D needs it (as I understand).

Yes if I open °F in notepad++ and tell notepad++ that it is 
UTF-8, it shows °F.

I want to:
1) Read an ascii file that may have codes above 127.
2) Convert to unicode so D funcs like .splitLines() can work with 
it.
3) Convert back to ascii so that stuff like °F writes out as it 
was read in.

If I open in.txt and out.txt in an ascii editor, °F should look 
the same in both files with the editor encoding the files as 
ANSI/ASCII.  I thought my program was doing just that.
Thanks for your assistance.
May 23, 2012
Re: Reading ASCII file with some codes above 127 (exten ascii)
On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
> On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
>> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>>> I am reading a file that has a few extended ASCII codes 
>>>>> (e.g. degree symdol). Depending on how I read the file in 
>>>>> and what I do with it the error shows up at different 
>>>>> points.  I'm pretty sure it all boils down to the these 
>>>>> extended ascii codes.
>>>>>
>>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
>>>>> file?
>>>>> I've messed with the std.encoding module but really can't 
>>>>> figure out what I need to do.
>>>>>
>>>>> There must be a simple solution to this.
>>>>
>>>> This seems to work:
>>>>
>>>>
>>>> import std.stdio, std.file, std.encoding;
>>>>
>>>> void main()
>>>> {
>>>>  auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>>  string s;
>>>>  transcode(latin, s);
>>>>  writeln(s);
>>>> }
>>>>
>>>>
>>>> Graham
>>>
>>> I thought I was in good shape with your above suggestion.  I 
>>> does help me read and process text.  But when I go to print 
>>> it out I have problems.
>>>
>>> Here is my input file:
>>> °F
>>>
>>> Here is my code:
>>> import std.stdio;
>>> import std.string;
>>> import std.file;
>>> import std.encoding;
>>>
>>> // Main function
>>> void main(){
>>>   auto fout = File("out.txt","w");
>>>   auto latinS = cast(Latin1String) read("in.txt");
>>>   string uniS;
>>>   transcode(latinS, uniS);
>>>   foreach(line; uniS.splitLines()){
>>>      transcode(line, latinS);
>>>      fout.writeln(line);
>>>      fout.writeln(latinS);
>>>   }
>>> }
>>>
>>> Here is the output:
>>> °F
>>> [cast(immutable(Latin1Char))176, 
>>> cast(immutable(Latin1Char))70]
>>>
>>> If I print the Unicode string I get an extra weird character.
>>>  If I print the Unicode string retranslated to Latin1, it get 
>>> weird pseudo-code.
>>> Can you help?
>>
>> I tried the program and it seemed to work for me.
>>
>> What program are you using to read "out.txt"? Are you sure it 
>> supports UTF-8, and knows to open the file as UTF-8? (This 
>> looks suspiciously like a tool's attempt to misinterpret a 
>> UTF-8 string as Latin-1.)
>>
>> If you're on a Unix system, what does "file in.txt out.txt" 
>> report?
>>
>> Graham
>
> Hmmm.  I'm not communicating well.
> I want to read and write ASCII.  The only reason I'm converting 
> to Unicode is because D needs it (as I understand).
>
> Yes if I open °F in notepad++ and tell notepad++ that it is 
> UTF-8, it shows °F.
>
> I want to:
> 1) Read an ascii file that may have codes above 127.
> 2) Convert to unicode so D funcs like .splitLines() can work 
> with it.
> 3) Convert back to ascii so that stuff like °F writes out as 
> it was read in.
>
> If I open in.txt and out.txt in an ascii editor, °F should 
> look the same in both files with the editor encoding the files 
> as ANSI/ASCII.  I thought my program was doing just that.
> Thanks for your assistance.

To make sure we're on the same page -- ASCII is a 7-bit encoding, 
and any character above 127 is by definition not an ASCII 
character. At that point we're talking about an encoding other 
than ASCII, such as UTF-8 or Latin-1.

If you're reading a file that has bytes > 127, you really have no 
choice but to specify (assume?) an encoding, Latin-1 for example. 
There's no guarantee your input file is Latin-1, though, and 
garbage-in will result in garbage-out.

So I think what you're trying to do is

1. read a Latin-1 file, into unicode (internally in D)
2. do splitLines(), etc., generating some result
3. Convert the result back to latin-1, and output it.

Is that right?
Graham
May 23, 2012
Re: Reading ASCII file with some codes above 127 (exten ascii)
On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
> On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
>> On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett 
>> wrote:
>>> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>>>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>>>> I am reading a file that has a few extended ASCII codes 
>>>>>> (e.g. degree symdol). Depending on how I read the file in 
>>>>>> and what I do with it the error shows up at different 
>>>>>> points.  I'm pretty sure it all boils down to the these 
>>>>>> extended ascii codes.
>>>>>>
>>>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 
>>>>>> 8859-1 file?
>>>>>> I've messed with the std.encoding module but really can't 
>>>>>> figure out what I need to do.
>>>>>>
>>>>>> There must be a simple solution to this.
>>>>>
>>>>> This seems to work:
>>>>>
>>>>>
>>>>> import std.stdio, std.file, std.encoding;
>>>>>
>>>>> void main()
>>>>> {
>>>>> auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>>> string s;
>>>>> transcode(latin, s);
>>>>> writeln(s);
>>>>> }
>>>>>
>>>>>
>>>>> Graham
>>>>
>>>> I thought I was in good shape with your above suggestion.  I 
>>>> does help me read and process text.  But when I go to print 
>>>> it out I have problems.
>>>>
>>>> Here is my input file:
>>>> °F
>>>>
>>>> Here is my code:
>>>> import std.stdio;
>>>> import std.string;
>>>> import std.file;
>>>> import std.encoding;
>>>>
>>>> // Main function
>>>> void main(){
>>>>  auto fout = File("out.txt","w");
>>>>  auto latinS = cast(Latin1String) read("in.txt");
>>>>  string uniS;
>>>>  transcode(latinS, uniS);
>>>>  foreach(line; uniS.splitLines()){
>>>>     transcode(line, latinS);
>>>>     fout.writeln(line);
>>>>     fout.writeln(latinS);
>>>>  }
>>>> }
>>>>
>>>> Here is the output:
>>>> °F
>>>> [cast(immutable(Latin1Char))176, 
>>>> cast(immutable(Latin1Char))70]
>>>>
>>>> If I print the Unicode string I get an extra weird character.
>>>> If I print the Unicode string retranslated to Latin1, it get 
>>>> weird pseudo-code.
>>>> Can you help?
>>>
>>> I tried the program and it seemed to work for me.
>>>
>>> What program are you using to read "out.txt"? Are you sure it 
>>> supports UTF-8, and knows to open the file as UTF-8? (This 
>>> looks suspiciously like a tool's attempt to misinterpret a 
>>> UTF-8 string as Latin-1.)
>>>
>>> If you're on a Unix system, what does "file in.txt out.txt" 
>>> report?
>>>
>>> Graham
>>
>> Hmmm.  I'm not communicating well.
>> I want to read and write ASCII.  The only reason I'm 
>> converting to Unicode is because D needs it (as I understand).
>>
>> Yes if I open °F in notepad++ and tell notepad++ that it is 
>> UTF-8, it shows °F.
>>
>> I want to:
>> 1) Read an ascii file that may have codes above 127.
>> 2) Convert to unicode so D funcs like .splitLines() can work 
>> with it.
>> 3) Convert back to ascii so that stuff like °F writes out as 
>> it was read in.
>>
>> If I open in.txt and out.txt in an ascii editor, °F should 
>> look the same in both files with the editor encoding the files 
>> as ANSI/ASCII.  I thought my program was doing just that.
>> Thanks for your assistance.
>
> To make sure we're on the same page -- ASCII is a 7-bit 
> encoding, and any character above 127 is by definition not an 
> ASCII character. At that point we're talking about an encoding 
> other than ASCII, such as UTF-8 or Latin-1.
>
> If you're reading a file that has bytes > 127, you really have 
> no choice but to specify (assume?) an encoding, Latin-1 for 
> example. There's no guarantee your input file is Latin-1, 
> though, and garbage-in will result in garbage-out.
>
> So I think what you're trying to do is
>
> 1. read a Latin-1 file, into unicode (internally in D)
> 2. do splitLines(), etc., generating some result
> 3. Convert the result back to latin-1, and output it.
>
> Is that right?
> Graham

Exactly.
May 23, 2012
Re: Reading ASCII file with some codes above 127 (exten ascii)
On Wed, May 23, 2012 at 09:09:27PM +0200, Paul wrote:
> On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
[...]
> >So I think what you're trying to do is
> >
> >1. read a Latin-1 file, into unicode (internally in D)
> >2. do splitLines(), etc., generating some result
> >3. Convert the result back to latin-1, and output it.
> >
> >Is that right?
> >Graham
> 
> Exactly.

The safest way is probably to read it as binary data (i.e. byte[]), then
do the conversion into UTF8, then process it, and finally convert it
back to latin-1 (in binary form) and output it.

D assumes Unicode internally; if you try to read a Latin-1 file as
char[], you may be running into some implicit UTF conversions that are
corrupting the data. Best use byte[] for reading/writing, and do
conversions to/from UTF-8 internally for processing.


T

-- 
Doubt is a self-fulfilling prophecy.
« First   ‹ Prev
1 2
Top | Discussion index | About this forum | D home