[Issue 15949] New: Improve readtext handling of byte order mark (BOM)

April 21, 2016

Posted by Jesse.K.Phillips+D@gmail.com

Permalink

Jesse.K.Phillips+D@gmail.com

Permalink

https://issues.dlang.org/show_bug.cgi?id=15949

          Issue ID: 15949
           Summary: Improve readtext handling of byte order mark (BOM)
           Product: D
           Version: D2
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P1
         Component: phobos
          Assignee: nobody@puremagic.com
          Reporter: Jesse.K.Phillips+D@gmail.com

Problem:

I've hit this many times in Windows. I try to read in a file with std.file.readText and get: "Syntax error at line 0"

This is because some Microsoft program has decided to insert a UTF-8 Byte Order Mark (BOM) into the beginning of the file (0xEF 0xBB 0xBF). But readText really shouldn't automatically convert a file's content based on the BOM specified.

Suggested fix:

I think readText should validate and skip the BOM. It should check that the BOM
is not UTF-16LE (0xFF 0xFE), UTF-16BE (0xFE 0xFF), UTF-32LE (FF FE 00 00),
UTF-32BE (0x00 0x00 0xFE 0xFF), if it is one of those then it should throw an
exception that the file being read is one of those encoding and will not be
converted to UTF-8 string.

The corresponding std.file.readText!wstring and std.file.readText!dstring should perform equivalent validation. If it is no cost to change the byte order then that should be done.


1. https://en.wikipedia.org/wiki/Byte_order_mark

--

Forums