Thread overview
utf.d update
Jul 28, 2004
Sean Kelly
Jul 28, 2004
Arcane Jill
Jul 28, 2004
Sean Kelly
Jul 28, 2004
Sean Kelly
Aug 06, 2004
Sean Kelly
July 28, 2004
I needed some new features for the readf work I've been doing.  I think they will be useful in general as I suspect it will become pretty common to want to encode or decode directly to a stream.  Here are the new prototypes:

// for all char types CharT
bit decode(out dchar val, bit delegate(out CharT) get)
dchar decode(bit delegate(out CharT) get)
void encode(bit delegate(CharT) put, dchar c)

The decode returning a bit will return false only if the first call to get fails (ie. the stream is already at EOF), and will throw in all other cases.  The remaining calls throw in all the same circumstances as the original calls.  All decode and encode functions have been rewritten based on these new functions. The old functions retain their weak gurantee while the new functions necessarily only have the basic gurantee.

http://home.f4.ca/sean/d/utf.d



July 28, 2004
In article <ce8sac$hhq$1@digitaldaemon.com>, Sean Kelly says...

>http://home.f4.ca/sean/d/utf.d

In your code:

bit isValidDchar(dchar c)
{
    return c < 0xD800 ||
	(c > 0xDFFF && c <= 0x10FFFF && c != 0xFFFE && c != 0xFFFF);
}

should read:

bit isValidDchar(dchar c)
{
    dchar d = c & 0xFFFF;
    if (d == 0xFFFE || d == 0xFFFF) return false;
    return c < 0xD800 ||
	(c >= 0xE000 && c < 0xFDD0) ||
	(c >= 0xFDF0 && c < 0x110000);
}

or something functionally equivalent thereto.

Anything I may previously have said about isValidChar() is wrong. The Unicode FAQ (which appears to have changed its wording, since I remember it being ambiguous in the past) now says, unambiguously: "These invalid code points are the 66 noncharacters (including FFFE and FFFF), as well as unpaired surrogates." Ergo, we must exclude all 66 noncharacters, not merely FFFE and FFFF.

Jill


July 28, 2004
Done.  I'll also integrate your other changes this evening and repost.


Sean


July 28, 2004
Well, the IsValidDChar change is up but I'm going to hold off on the rest unless I can rework the code to be optimized for ASCII, as per Walter's comment.  The existing code is already done this way so there's no loss in the meantime.


Sean


August 06, 2004
In article <ce8sac$hhq$1@digitaldaemon.com>, Sean Kelly says...
>
>http://home.f4.ca/sean/d/utf.d

Just a note that I've incorporated Stewart Gordon's fixes in the file I have online (thanks Stewart!).


Sean