Jump to page: 1 2
Thread overview
Auto-UTF-detection - Feature Request
Jul 25, 2004
Arcane Jill
Jul 25, 2004
Arcane Jill
Jul 25, 2004
J C Calvarese
Jul 26, 2004
Walter
Jul 26, 2004
Walter
Jul 26, 2004
Arcane Jill
Jul 26, 2004
Walter
Jul 26, 2004
Arcane Jill
Jul 26, 2004
Arcane Jill
Jul 26, 2004
Walter
Jul 26, 2004
Arcane Jill
Jul 27, 2004
James McComb
Jul 26, 2004
James McComb
July 25, 2004
In the source text analysis phase, the compiler does this (according to the
manual):

"The source text is assumed to be in UTF-8, unless one of the following BOMs (Byte Order Marks) is present at the beginning of the source text".

However, it is heuristically possible to distinguish between the various UTFs even /without/ a BOM. Okay, so it is /theoretically/ possible for an ambiguity to exist, but those edge cases are going to be almost infinitesimally rare for text files in general, and I'd say zero for D source files (which will consist mostly of ASCII characters). I say, try to auto-detect the difference.

Here's how ya do it:

Since a D source file mostly consists of ASCII characters, excluding NULL, any 4-byte-aligned fragment of a D source file is likely to look like this, where xx stands for any non-zero byte, and ?? stands for any byte at all:

#    xx xx xx xx   likely to be UTF-8
#    xx 00 xx 00   likely to be UTF-16LE
#    00 xx 00 xx   likely to be UTF-16BE
#    xx 00 00 00   likely to be UTF-32LE
#    00 00 00 xx   likely to be UTF-32BE
#
#    00 ?? ?? ??   definitely not UTF-8
#    ?? 00 ?? ??   definitely not UTF-8
#    ?? ?? 00 ??   definitely not UTF-8
#    ?? ?? ?? 00   definitely not UTF-8
#
#    00 00 ?? ??   definitely not UTF-16LE or UTF-16BE
#    ?? ?? 00 00   definitely not UTF-16LE or UTF-16BE
#
#    xx ?? ?? ??   definitely not UTF-32BE
#    ?? ?? ?? xx   definitely not UTF-32LE

Simply by analysing a few such four-byte chunks (say, the first 1024 byte of the file) and counting how many fit each pattern, you can easily determine the most likely encoding. This is a statistical test, obviously, since not /all/ bytes will be ASCII, but it will catch all but a few extreme edge cases. If you haven't made your mind up within the first 1024 bytes of the file, read the /next/ 1024 bytes and try again, and so on.

Alternatively, if that sounds too hard, there's an even easier (but less
efficient) algorithm:

1) Assume UTF-32LE. Validate.
2) Assume UTF-32BE. Validate.
3) Assume UTF-16BE. Validate.
4) Assume UTF-16BE. Validate.
5) Assume UTF-8.    Validate.

If precisely one of these validations succeeds, you've sussed it. If more than one succeeds, it's still ambiguous, but the chances of this happening are microscopic. If none succeed, the source file is not UTF.

Arcane Jill


July 25, 2004
Actually, it's just occurred to me that it's /really easy/ to tell the encoding of a D source file, because of the fact that the very first character of a D source file MUST be either a UTF BOM or a non-NULL ASCII. So all we have to do is to test for each of these contingencies. Here's a short function that does just that:

#    // Input: s: the first four bytes of the D source file
#    // Output: the D source file encoding
#
#    enum Encoding { UNKNOWN, UTF_8, UTF_16LE, UTF_16BE, UTF_32LE, UTF_32BE }
#
#    Encoding determineDSourceEncoding(ubyte[] s)
#    in
#    {
#        assert(s.length >= 4);
#    }
#    body
#    {
#        ubyte a = s[0];
#        ubyte b = s[1];
#        ubyte c = s[2];
#        ubyte d = s[3];
#
#        if (a==0xFF && b==0xFE && c==0x00 && d==0x00) return UTF_32LE; // BOM
#        if (a==0x00 && b==0x00 && c==0xFE && d==0xFF) return UTF_32BE; // BOM
#        if (a==0xFF && b==0xFE)                       return UTF_16LE; // BOM
#        if (a==0xFE && b==0xFF)                       return UTF_16BE; // BOM
#        if (a==0xEF && b==0xBB && c==0xBF)            return UTF_8;    // BOM
#        if (b==0x00 && c==0x00 && d==0x00)            return UTF_32LE; // ASCII
#        if (a==0x00 && b==0x00 && c==0x00)            return UTF_32BE; // ASCII
#        if (b==0x00)                                  return UTF_16LE; // ASCII
#        if (a==0x00)                                  return UTF_16BE; // ASCII
#        return UTF_8;                                                  // ASCII
#    }

This is an important issue, because I just did a quick test. Using my favorite text editor (TextPad), I saved a text file in UTF-16LE. I then examined the saved file with a hex editor. I can confirm that the file was saved in UTF-16LE, but, critically, /without a BOM/. I don't know what other text editors do, but clearly, if there is even a remote chance that source files will get saved without a BOM, then we really ought to be able to compile those source files!

Arcane Jill


July 25, 2004
Arcane Jill wrote:
...
> This is an important issue, because I just did a quick test. Using my favorite
> text editor (TextPad), I saved a text file in UTF-16LE. I then examined the
> saved file with a hex editor. I can confirm that the file was saved in UTF-16LE,
> but, critically, /without a BOM/. I don't know what other text editors do, but
> clearly, if there is even a remote chance that source files will get saved
> without a BOM, then we really ought to be able to compile those source files!
> 
> Arcane Jill

All is not lost. I downloaded and installed TextPad. I observed the problem of no BOMs, but I also found a solution:

* Choose the Configure/Preferences from the menubar.
* Find Document Classes/Default in the tree on the left (I guess you might have to choose something else here if you've set up a "D mode").

Document class options
[x] Write Unicode and UTF-8 BOM

Now when you save again, it'll add the BOM's.

Unicode:               FF FE    (UTF-16 LE)
Unicode (big endian):  FE FF    (UTF-16 BE)
UTF-8:                 EF BB BF (UTF-8)

So there's no problem using TextPad. (I don't know why the BOMs wouldn't be enabled by default, but that's a whole other issue.)

The BOMs are standard, right? If a supposedly Unicode-capable won't add BOMs, it might not really be considered Unicode-capable. If a person want to use one of those editors, fine. But please stick to UTF-8.

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/
July 26, 2004
Arcane Jill wrote:
> In the source text analysis phase, the compiler does this (according to the
> manual):
> 
> "The source text is assumed to be in UTF-8, unless one of the following BOMs
> (Byte Order Marks) is present at the beginning of the source text".

I like this rule. It says that D is not going to try and *guess* the character encoding of the file. Okay, maybe Walter can write some code that can guess the encoding correctly 99% of the time, but I don't think that it is worth complicating the compiler and slightly increasing compile times just to handle missing BOMs.

Here's why: Almost all the time, code will be written in UTF-8. If someone has gone to the trouble of writing their code in UTF-16 or UTF-32, they can go to the trouble of including a BOM in their file. After all, people are advised to use a BOM in those circumstances anyway.

James McComb
July 26, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:ce1848$2p2j$1@digitaldaemon.com...
> This is an important issue, because I just did a quick test. Using my
favorite
> text editor (TextPad), I saved a text file in UTF-16LE. I then examined
the
> saved file with a hex editor. I can confirm that the file was saved in
UTF-16LE,
> but, critically, /without a BOM/. I don't know what other text editors do,
but
> clearly, if there is even a remote chance that source files will get saved without a BOM, then we really ought to be able to compile those source
files!

Ack. What are the Textpad programmers thinking? They need to fix Textpad to put out the BOM.


July 26, 2004
"J C Calvarese" <jcc7@cox.net> wrote in message news:ce1d18$2sa5$1@digitaldaemon.com...
> So there's no problem using TextPad. (I don't know why the BOMs wouldn't be enabled by default, but that's a whole other issue.)

Send them a bug report!

> The BOMs are standard, right?

Yes.

> If a supposedly Unicode-capable won't add
> BOMs, it might not really be considered Unicode-capable. If a person
> want to use one of those editors, fine. But please stick to UTF-8.


July 26, 2004
In article <ce1848$2p2j$1@digitaldaemon.com>, Arcane Jill says...

Revised source - now handles files which are less than four bytes long:

>#    // Input: s: the first four byte of the D source file,
>#    //        (or all of them, if the file size is less than four bytes).
>#    // Output: the D source file encoding
>#
>#    enum Encoding { UTF_8, UTF_16LE, UTF_16BE, UTF_32LE, UTF_32BE }
>#
>#    Encoding determineDSourceEncoding(ubyte[] s)
>#    {
>#        ubyte a = s.length >= 1 ? s[0] : 1;
>#        ubyte b = s.length >= 2 ? s[1] : 1;
>#        ubyte c = s.length >= 3 ? s[2] : 1;
>#        ubyte d = s.length >= 4 ? s[3] : 1;
>#
>#        if (a==0xFF && b==0xFE && c==0x00 && d==0x00) return UTF_32LE;
>#        if (a==0x00 && b==0x00 && c==0xFE && d==0xFF) return UTF_32BE;
>#        if (a==0xFF && b==0xFE)                       return UTF_16LE;
>#        if (a==0xFE && b==0xFF)                       return UTF_16BE;
>#        if (a==0xEF && b==0xBB && c==0xBF)            return UTF_8;
>#        if (b==0x00 && c==0x00 && d==0x00)            return UTF_32LE;
>#        if (a==0x00 && b==0x00 && c==0x00)            return UTF_32BE;
>#        if (b==0x00)                                  return UTF_16LE;
>#        if (a==0x00)                                  return UTF_16BE;
>#        return UTF_8;
>#    }

Come on, that's a /tiny/ function, and I've written it all for you. The "overhead" is to call it /once/ during the source text stage (and that's /instead of/, not as well as, the current detection routine). As its input, it needs only the first four bytes of the source file (fewer if the file size is less than four bytes).

You are correct in that many applications which are out there are buggy when it comes to Unicode, and equally correct that we should complain about that. But I've just given you a new /feature/ which is almost trivially small and which you can boast about freely - and which will save you from receiving a few misdirected bug reports in the future.

Not even worth thinking about?

Arcane Jill


July 26, 2004
In article <ce2due$jos$1@digitaldaemon.com>, Arcane Jill says...

>Revised source - now handles files which are less than four bytes long:

I should have written that in C, shouldn't I? Never mind - I think just replacing "ubyte" with "unsigned char" should do it.

Jill


July 26, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:ce2due$jos$1@digitaldaemon.com...
> Come on, that's a /tiny/ function, and I've written it all for you. The "overhead" is to call it /once/ during the source text stage (and that's /instead of/, not as well as, the current detection routine). As its
input, it
> needs only the first four bytes of the source file (fewer if the file size
is
> less than four bytes).
>
> You are correct in that many applications which are out there are buggy
when it
> comes to Unicode, and equally correct that we should complain about that.
But
> I've just given you a new /feature/ which is almost trivially small and
which
> you can boast about freely - and which will save you from receiving a few misdirected bug reports in the future.
>
> Not even worth thinking about?

I'd already added it to my todo list, Jill <g>. But these things sometimes have hidden gotchas, so I wanted to let it simmer for a bit. I've put in stuff too quickly before, and had to back it out later :-(


July 26, 2004
In article <ce21g6$93r$1@digitaldaemon.com>, Walter says...

>Ack. What are the Textpad programmers thinking? They need to fix Textpad to put out the BOM.

This is incorrect. UTF-16LE does not require a BOM.

Almost all questions regarding UTFs and BOMs can be answered by heading over to the Unicode web site (www.unicode.org) and clicking on "FAQ". I cite from this FAQ here:

"In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor /permitted/" (italics present in original FAQ).

This doesn't apply to D source code, of course, since D source files are not "marked" in any way, however, read that FAQ. Everything about BOMs in that FAQ tells you that BOMs are "useful". Nowhere does it say they are "required" - and, as noted, in some cases they are even prohibited.

Fortunately, since D syntax requires that the first character of a D source file /must/ be an ASCII character, detecting the encoding is quick and easy. See my other posts on this thread for source code which works.

You *WILL* encounter BOM-less source files in the wild. Insisting that the BOM be there is in defiance of Unicode rules, and is just going to cripple DMD.

Arcane Jill


« First   ‹ Prev
1 2