std.file: read, readText and UTF-8 decoding

Sep 21

Uranuz

Sep 21

Uranuz

Sep 22

Jonathan M Davis

Sep 22

Uranuz

Sep 22

Jonathan M Davis

Addition:
Current solution to this problemme that I was found is:
So I just check for BOM manually. Get length of bom.sequence and remove that count of items from beginning. But I dont' think that it's convenient solution, because who knows how much else issues with UTF could happend. And I don't think that it's correct to handle them on the side of users of standart D library... I think that should be solution "out of the box". It could be not much effective, but it should at least "just work" without extra movements...
string[] getGroupsFromFile(string groupFilePath)
{
writeln(Parse file ~ groupFilePath);
string[] groupNames = [];
char[] rawContent = cast(char[]) read(groupFilePath);
auto bom = getBOM(cast(ubyte[]) rawContent);
string content = cast(string) rawContent[bom.sequence.length..$];
writeln(Content:\n ~ content);

//... work with XML
return groupNames;

}

September 21

Re: std.file: read, readText and UTF-8 decoding

Posted by Jonathan M Davis
in reply to Uranuz

Permalink

Jonathan M Davis

Posted in reply to Uranuz

Permalink

On Thursday, September 21, 2023 9:29:17 AM MDT Uranuz via Digitalmars-d-learn wrote:
> Hello!
> I have some strange problem. I am trying to parse XML files and
> extract some information from it.
> I use library dxml for it by Jonathan M Davis. But I have a
> probleme that I have multiple  XML files made by different people
> around the world. Some of these files was created with Byte Order
> Mark, but some of them without BOM. dxml expects no BOM at the
> start of the string.
> At first I tried to read file with std.file.readText. Looks like
> it doesn't decode file at any way and doesn't remove BOM, so dxml
> failed to parse it then. This looks strange for me, because I
> expect that "text" function must decode data to UTF-8. Then I
> read that this behavior is documented at least:
> """
> ...However, no width or endian conversions are performed. So, if
> the width or endianness of the characters in the given file
> differ from the width or endianness of the element type of S,
> then validation will fail.
> """
> So it's OK. But I understood that this function "readText" is not
> usefull for me.
> So I tried to use plain "read" that returns "void[]". Problemmme
> is that I still don't understand which method I should use to
> convert this to string[] with proper UTF-8 decoding and remove
> BOM and etc.
> Could you help me, please to make some clearance.
> P.S. Function readText looks odd in std.file, because you cannot
> specify any encoding to decode this file. And logic how it
> decodes is unclear...

readText works great as long as you know that you're dealing with files with a specific encoding and without a BOM (which is very often true when dealing with text files on *nix systems where they're using UTF-8), but it's not so great when you're reading files where you have no clue what their encoding is going to be (and it's worse on Windows where they unfortunately are much more likely to be UTF-16). Phobos does give you the tools to solve the problem, but it doesn't currently make it as easy as it arguably should be. std.encoding has the pieces that you're missing here.

https://dlang.org/phobos/std_encoding.html#BOM https://dlang.org/phobos/std_encoding.html#getBOM

You'll need to do something like

    import std.encoding : BOM, getBOM;
    import std.file : read;

    auto data = read(file);
    immutable bom = getBOM(cast(ubyte[])data).schema;

to get the BOM. Then you can compare the BOM against BOM.utf8, BOM.utf16le, etc. so that you know what type to cast the data array to (string, wstring, etc.). Then you can remove the BOM with something like

    R stripBOM(R)(R range)
        if(isForwardRange!R && isSomeChar!(ElementType!R))
    {
        import std.utf : decodeFront, UseReplacementDchar;
        if(range.empty)
            return range;
        auto orig = range.save;
        immutable c = range.decodeFront!(UseReplacementDchar.yes)();
        return c == '\uFEFF' ? range : orig;
    }

And then you either operate on the array with its current encoding type, convert it to the desired string type (e.g. to!string) or wrap it in a type that converts it as you parse it (e.g. std.utf.byChar).

Alternatively, you can just read the very beginning of the file and grab the BOM that way and then call readText with the correct type after you've figured out the file's encoding.

readText should currently handle the BOM correctly insofar as it checks whether you made the correct choice when you told it whether you wanted a string, wstring, etc., but since it reads in the entire file, it's not a great plan to try it with each encoding (catching each exception in turn) until you get the right one, and it doesn't strip the BOM off for you.

So, Phobos probably should get some new functionality to handle this better, but it's at least possible to make it work with what's there.

- Jonathan M Davis

On Friday, September 22, 2023 12:28:39 AM MDT Uranuz via Digitalmars-d-learn wrote: > OK. Thanks for response. I wish that there it was some API to handle it "out of the box". Do I need to write some issue or something in order to not forget about this? You can open an issue if you want, though I don't know how much that will help it be remembered given how many issues ther are to sort through. I'll probably get around to writing something eventually (particularly since this issue is more likely to come up when using dxml than with many other use cases), but I have a variety of items on my todo list. - Jonathan M Davis

Forums