Thread overview | ||||||
---|---|---|---|---|---|---|
|
January 11, 2019 UTFException when reading a file | ||||
---|---|---|---|---|
| ||||
I am using readText to read a file into a string. I am getting a UTFException on the file. It is probably because the file has an extended ANSI character that is not UTF-8. How can I read the file and convert the string into proper UTF-8 in memory without an exception? |
January 11, 2019 Re: UTFException when reading a file | ||||
---|---|---|---|---|
| ||||
Posted in reply to Head Scratcher | On Friday, 11 January 2019 at 19:45:05 UTC, Head Scratcher wrote: > How can I read the file and convert the string into proper UTF-8 in memory without an exception? Use regular read() instead of readText, and then convert it use another function. Phobos has std.encoding which offers a transcode function: http://dpldocs.info/experimental-docs/std.encoding.transcode.html you would cast to the input type: --- import std.encoding; import std.file; void main() { string s; // the read here replaces your readText // and the cast tells what encoding it has now transcode(cast(Latin1String) read("ooooo.d"), s); import std.stdio; // and after that, the utf-8 string is in s writeln(s); } --- Or, since I didn't like the Phobos module for my web scrape needs, I made my own: https://github.com/adamdruppe/arsd/blob/master/characterencodings.d Just drop that file in your build and call this function: http://dpldocs.info/experimental-docs/arsd.characterencodings.convertToUtf8Lossy.html --- import arsd.characterencodings; import std.file; void main() { string s = convertToUtf8Lossy(read("ooooo.d"), "iso_8859-1"); // you can now use s } --- just changing the encoding string to whatever it happens to be right now. But it is possible neither my module nor the Phobos one has the encoding you need... |
January 11, 2019 Re: UTFException when reading a file | ||||
---|---|---|---|---|
| ||||
Posted in reply to Head Scratcher | On Fri, Jan 11, 2019 at 07:45:05PM +0000, Head Scratcher via Digitalmars-d-learn wrote: > I am using readText to read a file into a string. I am getting a > UTFException on the file. It is probably because the file has an > extended ANSI character that is not UTF-8. > How can I read the file and convert the string into proper UTF-8 in > memory without an exception? What's the encoding of the file? Without knowing the original encoding, there is no way to get UTF-8 out of it without the risk of some data being lost / garbled. Take a look at std.encoding to see if your file's encoding is already supported. If not, you may have to read the file in binary and do the conversion into UTF-8 yourself. Or use an external program to re-encode your file into UTF-8. On Posix systems, the 'recode' utility will help you do this. T -- To err is human; to forgive is not our policy. -- Samuel Adler |
January 11, 2019 Re: UTFException when reading a file | ||||
---|---|---|---|---|
| ||||
Posted in reply to Head Scratcher | On Friday, 11 January 2019 at 19:45:05 UTC, Head Scratcher wrote: > How can I read the file and convert the string into proper UTF-8 in memory without an exception? You have multiple options: ``` import std.file: read; import std.encoding: transcode, Windows1252String; auto ansiStr = cast(Windows1252String) read(filename); string utf8string; transcode(ansiStr, utf8string); ``` If it's ANSI. ``` import std.encoding: sanitize; auto sanitized = (cast(string) read(filename)).sanitize; ``` If it's incorrect UTF8, eager ``` import std.exception: handle; import std.range; auto handled = str.handle!(UTFException, RangePrimitive.access, (e, r) => ' '); // Replace invalid code points with spaces ``` If it's incorrect UTF8, lazy See: https://dlang.org/phobos/std_encoding.html#transcode https://dlang.org/phobos/std_exception.html#handle |
Copyright © 1999-2021 by the D Language Foundation