Thread overview | |||||||||
---|---|---|---|---|---|---|---|---|---|
|
February 13, 2011 Read non-UTF8 file | ||||
---|---|---|---|---|
| ||||
Hey guys, I've the following source: module filereader; import std.file; import std.stdio : writeln; void main(string[] args) { File f = new File("myFile.ext", FileMode.In); while(!f.eof()) { writeln(convertToUTF8(f.readLine())); } f.close(); } string convertToUTF8(char[] text) { string result; for (uint i=0; i<text.length; i++) { wchar ch = text[i]; if (ch < 0x80) { result ~= ch; } else { result ~= 0xC0 | (ch >> 6); result ~= 0x80 | (ch & 0x3F); } } return result; } It compiles and works as long as the returned char-array/string of f.readLine() doesn't contain non-UTF8 character(s). If it contains such chars, writeln() doesn't write anything to the console. Is there any chance to read such files? Thanks a lot! |
February 19, 2011 Re: Read non-UTF8 file | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nrgyzer | On 13/02/2011 21:49, Nrgyzer wrote:
<snip>
> It compiles and works as long as the returned char-array/string of f.readLine() doesn't
> contain non-UTF8 character(s). If it contains such chars, writeln() doesn't write
> anything to the console. Is there any chance to read such files?
Please post sample input that shows the problem, and the output generated by replacing the writeln call with
writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine()));
so that we can see what it is actually reading in.
Stewart.
|
February 19, 2011 Re: Read non-UTF8 file | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stewart Gordon | == Auszug aus Stewart Gordon (smjg_1998@yahoo.com)'s Artikel
> On 13/02/2011 21:49, Nrgyzer wrote:
> <snip>
> > It compiles and works as long as the returned char-array/string of f.readLine() doesn't
> > contain non-UTF8 character(s). If it contains such chars, writeln() doesn't write
> > anything to the console. Is there any chance to read such files?
> Please post sample input that shows the problem, and the output generated by replacing the
> writeln call with
> writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine()));
> so that we can see what it is actually reading in.
> Stewart.
My file contains the following:
ä
ö
ü
Now... and with writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); I get the following:
[195, 131, 164]
[195, 131, 182]
[195, 131, 188]
|
February 19, 2011 Re: Read non-UTF8 file | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nrgyzer | On 02/19/2011 02:42 PM, Nrgyzer wrote: > == Auszug aus Stewart Gordon (smjg_1998@yahoo.com)'s Artikel >> On 13/02/2011 21:49, Nrgyzer wrote: >> <snip> >>> It compiles and works as long as the returned char-array/string of f.readLine() doesn't >>> contain non-UTF8 character(s). If it contains such chars, writeln() doesn't write >>> anything to the console. Is there any chance to read such files? >> Please post sample input that shows the problem, and the output generated by replacing the >> writeln call with >> writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); >> so that we can see what it is actually reading in. >> Stewart. > > My file contains the following: > > � > � > � > > Now... and with writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); I get the following: > > [195, 131, 164] > [195, 131, 182] > [195, 131, 188] At first sight, I find your input strange. Actually, it looks like utf-8 (195 is common when representing converted latin text). But having 3 times (195, 131) which is the code for 'Ã' is weird. What is your source text, what is its encoding, and where does it come from? What don't you /start/ and tell us about that? Denis -- _________________ vita es estrany spir.wikidot.com |
February 20, 2011 Re: Read non-UTF8 file | ||||
---|---|---|---|---|
| ||||
Posted in reply to spir | == Auszug aus spir (denis.spir@gmail.com)'s Artikel > On 02/19/2011 02:42 PM, Nrgyzer wrote: > > == Auszug aus Stewart Gordon (smjg_1998@yahoo.com)'s Artikel > >> On 13/02/2011 21:49, Nrgyzer wrote: > >> <snip> > >>> It compiles and works as long as the returned char-array/string of f.readLine() doesn't > >>> contain non-UTF8 character(s). If it contains such chars, writeln() doesn't write > >>> anything to the console. Is there any chance to read such files? > >> Please post sample input that shows the problem, and the output generated by replacing the > >> writeln call with > >> writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); > >> so that we can see what it is actually reading in. > >> Stewart. > > > > My file contains the following: > > > > � > > � > > � > > > > Now... and with writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); I get the following: > > > > [195, 131, 164] > > [195, 131, 182] > > [195, 131, 188] > At first sight, I find your input strange. Actually, it looks like utf-8 (195 > is common when representing converted latin text). But having 3 times (195, > 131) which is the code for 'Ã' is weird. > What is your source text, what is its encoding, and where does it come from? > What don't you /start/ and tell us about that? > Denis It seems that my input chars doesn't show correctly above... it contains the following chars: 0xE4 (or 228), 0xF6 (or 246) and 0xFC (or 252) I used notepad to create the file and saved it as ANSI encoding. The file is for testing purposes only. |
February 21, 2011 Re: Read non-UTF8 file | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nrgyzer | What compiler version/platform are you using? I had to fix some errors before it would compile on mine (1.066/2.051 Windows).
On 19/02/2011 13:42, Nrgyzer wrote:
<snip>
> Now... and with writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); I get the following:
>
> [195, 131, 164]
> [195, 131, 182]
> [195, 131, 188]
It took a while for me to make sense of what's going on!
The expressions (0xC0 | (ch >> 6)) and (0x80 | (ch & 0x3F)) both have type int. It appears that, in D2, if you append an int to a string then it treats the int as a Unicode codepoint and automagically converts it to UTF-8. But why is it doing it on the first byte and not the second? This looks like a bug.
Casting each UTF-8 byte value to a char
if (ch < 0x80) {
result ~= cast(char) ch;
} else {
result ~= cast(char) (0xC0 | (ch >> 6));
result ~= cast(char) (0x80 | (ch & 0x3F));
}
gives the expected output
[195, 164]
[195, 182]
[195, 188]
HTH
Stewart.
|
February 22, 2011 Re: Read non-UTF8 file | ||||
---|---|---|---|---|
| ||||
Posted in reply to Stewart Gordon | == Auszug aus Stewart Gordon (smjg_1998@yahoo.com)'s Artikel
> What compiler version/platform are you using? I had to fix some errors before it would
> compile on mine (1.066/2.051 Windows).
> On 19/02/2011 13:42, Nrgyzer wrote:
> <snip>
> > Now... and with writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); I get the following:
> >
> > [195, 131, 164]
> > [195, 131, 182]
> > [195, 131, 188]
> It took a while for me to make sense of what's going on!
> The expressions (0xC0 | (ch >> 6)) and (0x80 | (ch & 0x3F)) both have type int. It
> appears that, in D2, if you append an int to a string then it treats the int as a Unicode
> codepoint and automagically converts it to UTF-8. But why is it doing it on the first
> byte and not the second? This looks like a bug.
> Casting each UTF-8 byte value to a char
> if (ch < 0x80) {
> result ~= cast(char) ch;
> } else {
> result ~= cast(char) (0xC0 | (ch >> 6));
> result ~= cast(char) (0x80 | (ch & 0x3F));
> }
> gives the expected output
> [195, 164]
> [195, 182]
> [195, 188]
> HTH
> Stewart.
I also wondered because I've used the same code in D1 and it worked without any problems. Anyway... thanks :)
|
Copyright © 1999-2021 by the D Language Foundation