Thread overview
UTFException when reading a file
Jan 11, 2019
Head Scratcher
Jan 11, 2019
Adam D. Ruppe
Jan 11, 2019
H. S. Teoh
Jan 11, 2019
Dennis
January 11, 2019
I am using readText to read a file into a string. I am getting a UTFException on the file. It is probably because the file has an extended ANSI character that is not UTF-8.

How can I read the file and convert the string into proper UTF-8 in memory without an exception?

January 11, 2019
On Friday, 11 January 2019 at 19:45:05 UTC, Head Scratcher wrote:
> How can I read the file and convert the string into proper UTF-8 in memory without an exception?

Use regular read() instead of readText, and then convert it use another function.

Phobos has std.encoding which offers a transcode function:

http://dpldocs.info/experimental-docs/std.encoding.transcode.html

you would cast to the input type:

---
import std.encoding;
import std.file;

void main() {
        string s;
        // the read here replaces your readText
        // and the cast tells what encoding it has now
        transcode(cast(Latin1String) read("ooooo.d"), s);
        import std.stdio;
        // and after that, the utf-8 string is in s
        writeln(s);
}
---


Or, since I didn't like the Phobos module for my web scrape needs, I made my own:

https://github.com/adamdruppe/arsd/blob/master/characterencodings.d

Just drop that file in your build and call this function:

http://dpldocs.info/experimental-docs/arsd.characterencodings.convertToUtf8Lossy.html

---
import arsd.characterencodings;
import std.file;

void main() {
     string s = convertToUtf8Lossy(read("ooooo.d"), "iso_8859-1");
     // you can now use s
}
---

just changing the encoding string to whatever it happens to be right now.



But it is possible neither my module nor the Phobos one has the encoding you need...
January 11, 2019
On Fri, Jan 11, 2019 at 07:45:05PM +0000, Head Scratcher via Digitalmars-d-learn wrote:
> I am using readText to read a file into a string. I am getting a
> UTFException on the file. It is probably because the file has an
> extended ANSI character that is not UTF-8.
> How can I read the file and convert the string into proper UTF-8 in
> memory without an exception?

What's the encoding of the file?  Without knowing the original encoding, there is no way to get UTF-8 out of it without the risk of some data being lost / garbled.

Take a look at std.encoding to see if your file's encoding is already supported. If not, you may have to read the file in binary and do the conversion into UTF-8 yourself. Or use an external program to re-encode your file into UTF-8.  On Posix systems, the 'recode' utility will help you do this.


T

-- 
To err is human; to forgive is not our policy. -- Samuel Adler
January 11, 2019
On Friday, 11 January 2019 at 19:45:05 UTC, Head Scratcher wrote:
> How can I read the file and convert the string into proper UTF-8 in memory without an exception?

You have multiple options:

```
import std.file: read;
import std.encoding: transcode, Windows1252String;
auto ansiStr = cast(Windows1252String) read(filename);
string utf8string;
transcode(ansiStr, utf8string);
```

If it's ANSI.

```
import std.encoding: sanitize;
auto sanitized =  (cast(string) read(filename)).sanitize;
```

If it's incorrect UTF8, eager

```
import std.exception: handle;
import std.range;
auto handled = str.handle!(UTFException, RangePrimitive.access,
        (e, r) => ' '); // Replace invalid code points with spaces
```

If it's incorrect UTF8, lazy

See:
https://dlang.org/phobos/std_encoding.html#transcode
https://dlang.org/phobos/std_exception.html#handle