Files and UTF

In my efforts to learn D I am writing some code to read files in different UTF encodings with the aim of having them end up as UTF-8 internally. As a start I have the following code:

import std.stdio;
import std.file;

void main(string[] args)
{
    if (args.length == 2)
    {
        if (args[1].exists && args[1].isFile)
        {
            auto f = File(args[1]);
            writeln(args[1]);

            for (auto i = 1; i <= 3; ++i)
                write(f.readln);
        }
    }
}

It works well outputting the file name and first three lines of the file properly, without any regard to the encoding of the file. The exception to this is if the file is UTF-16, with both LE and BE encodings, two characters representing the BOM are printed.

I assume that write detects the encoding of the string returned by readln and prints it correctly rather than readln reading in as a consistent encoding. Is this correct?

Is there a way to remove the BOM from the input buffer and still know the encoding of the file?

Is there a D idiomatic way to do what I want to do?

Mike

August 06, 2020

Re: Files and UTF

Posted by WebFreak001
in reply to Mike Surette

Permalink

WebFreak001

Posted in reply to Mike Surette

Permalink

On Wednesday, 5 August 2020 at 17:39:36 UTC, Mike Surette wrote:
> In my efforts to learn D I am writing some code to read files in different UTF encodings with the aim of having them end up as UTF-8 internally. As a start I have the following code:
>
> import std.stdio;
> import std.file;
>
> void main(string[] args)
> {
>     if (args.length == 2)
>     {
>         if (args[1].exists && args[1].isFile)
>         {
>             auto f = File(args[1]);
>             writeln(args[1]);
>
>             for (auto i = 1; i <= 3; ++i)
>                 write(f.readln);
>         }
>     }
> }
>
> It works well outputting the file name and first three lines of the file properly, without any regard to the encoding of the file. The exception to this is if the file is UTF-16, with both LE and BE encodings, two characters representing the BOM are printed.
>
> I assume that write detects the encoding of the string returned by readln and prints it correctly rather than readln reading in as a consistent encoding. Is this correct?
>
> Is there a way to remove the BOM from the input buffer and still know the encoding of the file?
>
> Is there a D idiomatic way to do what I want to do?
>
> Mike

all strings in D are _assumed_ to be UTF-8, so your I/O reading function needs to check that it is actually UTF-8. File/File.readln does not do that, so you are actually getting UTF-16 bytes in your string, not UTF-8 bytes.

What you are seeing through writeln is not fully correct: If you only test English characters there are null bytes (0) before each english character byte, which aren't rendered in the console.

You can verify this with this simple code:
auto s = File("test.txt").readln;
writefln("%(%02x %)", s.representation);

result: ff fe 68 00 65 00 6c 00 6c 00 6f 00 20 00 77 00 6f 00 72 00 6c 00 64 00 21 00 0d 00 0a


You can see there is a UTF-16 LE BOM in there and then all the English characters, which are null bytes and the characters.

Basically what you want to do is writing a function converting the input encoding to UTF-8 so you can use it in D strings. If you want to get into this yourself, there is std.encoding which offers you most of the functionality: https://dlang.org/phobos/std_encoding.html

You can use the getBOM method to try to determine encoding by BOM and can then convert it to UTF-8 from the source encoding using the `transcode` method or manually using the encoding classes.

If you don't have a BOM you need some kind of algorithm to determine the encoding of your file. If you don't want to do that and just want to check if it's UTF-8, use the `std.utf : validate` function which throws if your string is not valid UTF-8 and otherwise does nothing.

If you only support UTF-8, UTF-16 and possibly UTF-32, you can use just the std.utf methods after determining BOM to lazily convert without allocating memory (useful if you only go over your string linearly without going back):



import std;

void main() {
	// readln is actually rather unsafe for this! You should use std.file.read
	// or File.rawRead or read byte chunks instead. For chunking you need to adjust
	// the encode API however and probably make a helper struct with a small buffer.
	string s = File("test.txt").readln;
	// need to remove the line terminator before encoding
	// (it's encoded in UTF-8, potentially after UTF-16)
	if (s.length) s = s[0 .. $ - 1];
	string e = encodeUTF8(s.representation);

	writefln("%s\n(%(%02x %))", s, s.representation);
	writefln("%s\n(%(%02x %))", e, e.representation);
}

string encodeUTF8(immutable(ubyte)[] bytes) {
	auto bom = getBOM(bytes);
	bytes = bytes[bom.sequence.length .. $];

	switch (bom.schema) {
	// optionally we could validate, but we just trust because there is a UTF-8 BOM
	case BOM.utf8: return cast(string)bytes;
	case BOM.utf16le: return convertUTF!wchar(bytes, true);
	case BOM.utf16be: return convertUTF!wchar(bytes, false);
	case BOM.utf32le: return convertUTF!dchar(bytes, true);
	case BOM.utf32be: return convertUTF!dchar(bytes, false);
	default: string input = cast(string)bytes; validate(input); return input;
	}
}

private string convertUTF(T)(scope const(ubyte)[] bytes, bool littleEndian) {
	// T.sizeof expecting 2 or 4 (which divided maps to 0 and 1)
	enum name = ["UTF-16", "UTF-32"][T.sizeof / 4];
	alias Int = AliasSeq!(ushort, uint)[T.sizeof / 4];

	if (bytes.length % T.sizeof != 0)
		throw new Exception("File is " ~ name ~ ", but got "
			~ bytes.length.to!string ~ " bytes, which is not a multiple of "
			~ T.sizeof.stringof ~ "!");

	scope Int[] units = (cast(Int*) bytes.ptr)[0 .. bytes.length / T.sizeof];

	// swap mismatching endianness
	version (LittleEndian) // CPU is little endian, swap if file is big endian
		bool swap = !littleEndian;
	else // CPU is big endian, swap if file is little endian
		bool swap = littleEndian;

	if (swap) swapAllEndian(units);

	scope wstr = cast(const(T)[]) units;
	auto ret = wstr.toUTF8;
	// because we are operating in-place, we need to revert to keep memory consistent
	// if you don't use the byte data anywhere else, you could omit this
	// (note this could be unsafe though)
	if (swap) swapAllEndian(units);

	return ret;
}

private void swapAllEndian(T)(T[] data) {
	// TODO: could probably optimize this with SIMD instructions
	foreach (i; 0 .. data.length)
		data[i] = swapEndian(data[i]);
}



Example library if you want to guess encoding without BOM: https://code.dlang.org/packages/libguess-d

used API docs:
https://dlang.org/phobos/std_bitmanip.html#swapEndian <- swapping BE/LE for native encoding
https://dlang.org/phobos/std_encoding.html <- BOM detection, transcoding capabilities for encodings other than UTF
https://dlang.org/phobos/std_utf.html <- low level UTF-8 encoding/decoding, lazy decoding, validation

Forums