Thread overview | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
May 13, 2012 Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes. Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? I've messed with the std.encoding module but really can't figure out what I need to do. There must be a simple solution to this. |
May 13, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Paul | On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes.
>
> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
> I've messed with the std.encoding module but really can't figure out what I need to do.
>
> There must be a simple solution to this.
Same here. I've ended up writing a custom array converter that if there's any 128+ codes it converts it and returns a new array. Maybe this is wrong, but for me it works.
import std.utf;
import std.ascii;
//conversion table of ascii (latin-1?) to unicode for text compares.
//only 128-255
private immutable wchar[] extAscii = [
0x20AC, 0x0081, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021,
0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0x008D, 0x017D, 0x008F,
0x0090, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014,
0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0x009D, 0x017E, 0x0178,
0x00A0, 0x00A1, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7,
0x00A8, 0x00A9, 0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF,
0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7,
0x00B8, 0x00B9, 0x00BA, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x00BF,
0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x00C7,
0x00C8, 0x00C9, 0x00CA, 0x00CB, 0x00CC, 0x00CD, 0x00CE, 0x00CF,
0x00D0, 0x00D1, 0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7,
0x00D8, 0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF,
0x00E0, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x00E7,
0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF,
0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x00F7,
0x00F8, 0x00F9, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x00FF];
/**since I can't find a good explanation of conversion, this is custom made.
if it doesn't need to be converted, it returns the original buffer*/
char[] ascii2char(ubyte[] input) {
char[] o;
foreach(i, b; input) {
if (b & 0x80) {
if (!o.length)
o = cast(char[]) input[0 .. i];
encode(o, extAscii[b - 0x80]);
} else if (o.length)
o ~= b;
}
return o.length ? o : cast(char[]) input;
}
|
May 14, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Paul | On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes.
>
> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
> I've messed with the std.encoding module but really can't figure out what I need to do.
>
> There must be a simple solution to this.
This seems to work:
import std.stdio, std.file, std.encoding;
void main()
{
auto latin = cast(Latin1String) read("/tmp/hi.8859");
string s;
transcode(latin, s);
writeln(s);
}
Graham
|
May 17, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Graham Fawcett | On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes.
>>
>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>
>> There must be a simple solution to this.
>
> This seems to work:
>
>
> import std.stdio, std.file, std.encoding;
>
> void main()
> {
> auto latin = cast(Latin1String) read("/tmp/hi.8859");
> string s;
> transcode(latin, s);
> writeln(s);
> }
>
>
> Graham
Awesome! Thanks a million!
|
May 23, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Graham Fawcett | On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote: > On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote: >> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes. >> >> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? >> I've messed with the std.encoding module but really can't figure out what I need to do. >> >> There must be a simple solution to this. > > This seems to work: > > > import std.stdio, std.file, std.encoding; > > void main() > { > auto latin = cast(Latin1String) read("/tmp/hi.8859"); > string s; > transcode(latin, s); > writeln(s); > } > > > Graham I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help? |
May 23, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Paul | On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes.
>>>
>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>>
>>> There must be a simple solution to this.
>>
>> This seems to work:
>>
>>
>> import std.stdio, std.file, std.encoding;
>>
>> void main()
>> {
>> auto latin = cast(Latin1String) read("/tmp/hi.8859");
>> string s;
>> transcode(latin, s);
>> writeln(s);
>> }
>>
>>
>> Graham
>
> I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems.
>
> Here is my input file:
> °F
>
> Here is my code:
> import std.stdio;
> import std.string;
> import std.file;
> import std.encoding;
>
> // Main function
> void main(){
> auto fout = File("out.txt","w");
> auto latinS = cast(Latin1String) read("in.txt");
> string uniS;
> transcode(latinS, uniS);
> foreach(line; uniS.splitLines()){
> transcode(line, latinS);
> fout.writeln(line);
> fout.writeln(latinS);
> }
> }
>
> Here is the output:
> °F
> [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]
>
> If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code.
> Can you help?
I tried the program and it seemed to work for me.
What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.)
If you're on a Unix system, what does "file in.txt out.txt" report?
Graham
|
May 23, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Graham Fawcett | On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes.
>>>>
>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>>>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>>>
>>>> There must be a simple solution to this.
>>>
>>> This seems to work:
>>>
>>>
>>> import std.stdio, std.file, std.encoding;
>>>
>>> void main()
>>> {
>>> auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>> string s;
>>> transcode(latin, s);
>>> writeln(s);
>>> }
>>>
>>>
>>> Graham
>>
>> I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems.
>>
>> Here is my input file:
>> °F
>>
>> Here is my code:
>> import std.stdio;
>> import std.string;
>> import std.file;
>> import std.encoding;
>>
>> // Main function
>> void main(){
>> auto fout = File("out.txt","w");
>> auto latinS = cast(Latin1String) read("in.txt");
>> string uniS;
>> transcode(latinS, uniS);
>> foreach(line; uniS.splitLines()){
>> transcode(line, latinS);
>> fout.writeln(line);
>> fout.writeln(latinS);
>> }
>> }
>>
>> Here is the output:
>> °F
>> [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]
>>
>> If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code.
>> Can you help?
>
> I tried the program and it seemed to work for me.
>
> What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.)
>
> If you're on a Unix system, what does "file in.txt out.txt" report?
>
> Graham
Hmmm. I'm not communicating well.
I want to read and write ASCII. The only reason I'm converting to Unicode is because D needs it (as I understand).
Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F.
I want to:
1) Read an ascii file that may have codes above 127.
2) Convert to unicode so D funcs like .splitLines() can work with it.
3) Convert back to ascii so that stuff like °F writes out as it was read in.
If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII. I thought my program was doing just that.
Thanks for your assistance.
|
May 23, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Paul | On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
> On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
>> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes.
>>>>>
>>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>>>>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>>>>
>>>>> There must be a simple solution to this.
>>>>
>>>> This seems to work:
>>>>
>>>>
>>>> import std.stdio, std.file, std.encoding;
>>>>
>>>> void main()
>>>> {
>>>> auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>> string s;
>>>> transcode(latin, s);
>>>> writeln(s);
>>>> }
>>>>
>>>>
>>>> Graham
>>>
>>> I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems.
>>>
>>> Here is my input file:
>>> °F
>>>
>>> Here is my code:
>>> import std.stdio;
>>> import std.string;
>>> import std.file;
>>> import std.encoding;
>>>
>>> // Main function
>>> void main(){
>>> auto fout = File("out.txt","w");
>>> auto latinS = cast(Latin1String) read("in.txt");
>>> string uniS;
>>> transcode(latinS, uniS);
>>> foreach(line; uniS.splitLines()){
>>> transcode(line, latinS);
>>> fout.writeln(line);
>>> fout.writeln(latinS);
>>> }
>>> }
>>>
>>> Here is the output:
>>> °F
>>> [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]
>>>
>>> If I print the Unicode string I get an extra weird character.
>>> If I print the Unicode string retranslated to Latin1, it get weird pseudo-code.
>>> Can you help?
>>
>> I tried the program and it seemed to work for me.
>>
>> What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.)
>>
>> If you're on a Unix system, what does "file in.txt out.txt" report?
>>
>> Graham
>
> Hmmm. I'm not communicating well.
> I want to read and write ASCII. The only reason I'm converting to Unicode is because D needs it (as I understand).
>
> Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F.
>
> I want to:
> 1) Read an ascii file that may have codes above 127.
> 2) Convert to unicode so D funcs like .splitLines() can work with it.
> 3) Convert back to ascii so that stuff like °F writes out as it was read in.
>
> If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII. I thought my program was doing just that.
> Thanks for your assistance.
To make sure we're on the same page -- ASCII is a 7-bit encoding, and any character above 127 is by definition not an ASCII character. At that point we're talking about an encoding other than ASCII, such as UTF-8 or Latin-1.
If you're reading a file that has bytes > 127, you really have no choice but to specify (assume?) an encoding, Latin-1 for example. There's no guarantee your input file is Latin-1, though, and garbage-in will result in garbage-out.
So I think what you're trying to do is
1. read a Latin-1 file, into unicode (internally in D)
2. do splitLines(), etc., generating some result
3. Convert the result back to latin-1, and output it.
Is that right?
Graham
|
May 23, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Graham Fawcett | On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
> On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
>> On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
>>> On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
>>>> On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
>>>>> On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
>>>>>> I am reading a file that has a few extended ASCII codes (e.g. degree symdol). Depending on how I read the file in and what I do with it the error shows up at different points. I'm pretty sure it all boils down to the these extended ascii codes.
>>>>>>
>>>>>> Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file?
>>>>>> I've messed with the std.encoding module but really can't figure out what I need to do.
>>>>>>
>>>>>> There must be a simple solution to this.
>>>>>
>>>>> This seems to work:
>>>>>
>>>>>
>>>>> import std.stdio, std.file, std.encoding;
>>>>>
>>>>> void main()
>>>>> {
>>>>> auto latin = cast(Latin1String) read("/tmp/hi.8859");
>>>>> string s;
>>>>> transcode(latin, s);
>>>>> writeln(s);
>>>>> }
>>>>>
>>>>>
>>>>> Graham
>>>>
>>>> I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems.
>>>>
>>>> Here is my input file:
>>>> °F
>>>>
>>>> Here is my code:
>>>> import std.stdio;
>>>> import std.string;
>>>> import std.file;
>>>> import std.encoding;
>>>>
>>>> // Main function
>>>> void main(){
>>>> auto fout = File("out.txt","w");
>>>> auto latinS = cast(Latin1String) read("in.txt");
>>>> string uniS;
>>>> transcode(latinS, uniS);
>>>> foreach(line; uniS.splitLines()){
>>>> transcode(line, latinS);
>>>> fout.writeln(line);
>>>> fout.writeln(latinS);
>>>> }
>>>> }
>>>>
>>>> Here is the output:
>>>> °F
>>>> [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70]
>>>>
>>>> If I print the Unicode string I get an extra weird character.
>>>> If I print the Unicode string retranslated to Latin1, it get weird pseudo-code.
>>>> Can you help?
>>>
>>> I tried the program and it seemed to work for me.
>>>
>>> What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.)
>>>
>>> If you're on a Unix system, what does "file in.txt out.txt" report?
>>>
>>> Graham
>>
>> Hmmm. I'm not communicating well.
>> I want to read and write ASCII. The only reason I'm converting to Unicode is because D needs it (as I understand).
>>
>> Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F.
>>
>> I want to:
>> 1) Read an ascii file that may have codes above 127.
>> 2) Convert to unicode so D funcs like .splitLines() can work with it.
>> 3) Convert back to ascii so that stuff like °F writes out as it was read in.
>>
>> If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII. I thought my program was doing just that.
>> Thanks for your assistance.
>
> To make sure we're on the same page -- ASCII is a 7-bit encoding, and any character above 127 is by definition not an ASCII character. At that point we're talking about an encoding other than ASCII, such as UTF-8 or Latin-1.
>
> If you're reading a file that has bytes > 127, you really have no choice but to specify (assume?) an encoding, Latin-1 for example. There's no guarantee your input file is Latin-1, though, and garbage-in will result in garbage-out.
>
> So I think what you're trying to do is
>
> 1. read a Latin-1 file, into unicode (internally in D)
> 2. do splitLines(), etc., generating some result
> 3. Convert the result back to latin-1, and output it.
>
> Is that right?
> Graham
Exactly.
|
May 23, 2012 Re: Reading ASCII file with some codes above 127 (exten ascii) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Paul | On Wed, May 23, 2012 at 09:09:27PM +0200, Paul wrote: > On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote: [...] > >So I think what you're trying to do is > > > >1. read a Latin-1 file, into unicode (internally in D) > >2. do splitLines(), etc., generating some result > >3. Convert the result back to latin-1, and output it. > > > >Is that right? > >Graham > > Exactly. The safest way is probably to read it as binary data (i.e. byte[]), then do the conversion into UTF8, then process it, and finally convert it back to latin-1 (in binary form) and output it. D assumes Unicode internally; if you try to read a Latin-1 file as char[], you may be running into some implicit UTF conversions that are corrupting the data. Best use byte[] for reading/writing, and do conversions to/from UTF-8 internally for processing. T -- Doubt is a self-fulfilling prophecy. |
Copyright © 1999-2021 by the D Language Foundation