Thread overview
dchar undefined behaviour
Oct 23, 2015
tsbockman
Oct 23, 2015
Walter Bright
Oct 23, 2015
tsbockman
Oct 23, 2015
Vladimir Panteleev
Oct 23, 2015
Anon
Oct 24, 2015
Dmitry Olshansky
October 23, 2015
While working on updating and improving Lionello Lunesu's proposed fix for DMD issue #259, I have come across a value range propagation related issue with the dchar type.

The patch adds VRP-based compile-time evaluation of integer type comparisons, where possible. This caused the following issue:

The compiler will now optimize out attempts to handle invalid, out-of-range dchar values. For example:

dchar c = cast(dchar) uint.max;
if(c > 0x10FFFF)
    writeln("invalid");
else
    writeln("OK");

With constant folding for integer comparisons, the above will print "OK" rather than "invalid", as it should. The predicate (c > 0x10FFFF) is simply *assumed* to be false, because the current starting range.imax for a dchar expression is dchar.max.

So, this leads to the question: is making use of dchar values greater than dchar.max considered undefined behaviour, or not?

1. If it is UB, then there is quite a lot of D code (including std.uni) which must be corrected to use uint instead of dchar when dealing with values which could possibly fall outside the officially supported range.

2. If it is not UB, then the compiler needs to be updated to stop assuming that dchar values greater than dchar.max are impossible. This basically just means removing some of dchar's special treatment, and running it through more of the same code paths as uint.

At the moment, I strongly prefer #2, but I suppose #1 could make sense if people think code which might have to deal with invalid code points can be isolated sufficiently from other unicode processing.
October 23, 2015
On 10/22/2015 6:31 PM, tsbockman wrote:
> So, this leads to the question: is making use of dchar values greater than
> dchar.max considered undefined behaviour, or not?
>
> 1. If it is UB, then there is quite a lot of D code (including std.uni) which
> must be corrected to use uint instead of dchar when dealing with values which
> could possibly fall outside the officially supported range.
>
> 2. If it is not UB, then the compiler needs to be updated to stop assuming that
> dchar values greater than dchar.max are impossible. This basically just means
> removing some of dchar's special treatment, and running it through more of the
> same code paths as uint.

I think that ship has sailed. Illegal values in a dchar are not UB. Making it UB would result in surprising behavior which you've noted. Also, this segues into what to do about string, wstring, and dstring with invalid sequences in them. Currently, functions defined what they do with invalid sequences. Making it UB would be a burden to programmers.
October 23, 2015
On Friday, 23 October 2015 at 12:17:22 UTC, Walter Bright wrote:
> I think that ship has sailed. Illegal values in a dchar are not UB. Making it UB would result in surprising behavior which you've noted. Also, this segues into what to do about string, wstring, and dstring with invalid sequences in them. Currently, functions defined what they do with invalid sequences. Making it UB would be a burden to programmers.

That makes sense to me. I think the language would have to work a lot harder to block the creation of invalid dchar values to justify the current VRP assumption.

Fixing the compiler in accordance with option #2 looks like it should be easy, so far.
October 23, 2015
On Friday, 23 October 2015 at 01:31:47 UTC, tsbockman wrote:
> dchar c = cast(dchar) uint.max;
> if(c > 0x10FFFF)
>     writeln("invalid");
> else
>     writeln("OK");
>
> With constant folding for integer comparisons, the above will print "OK" rather than "invalid", as it should. The predicate (c > 0x10FFFF) is simply *assumed* to be false, because the current starting range.imax for a dchar expression is dchar.max.

That doesn't sound right. In fact, this puts into question why dchar.max is at the value it is now. It might be the current maximum at the current version of Unicode, but this seems like a completely pointless restriction that breaks forward-compatibility with future Unicode versions, meaning that D programs compiled today may be unable to work with Unicode text in the future because of a pointless artificial limitation.

October 23, 2015
On Friday, 23 October 2015 at 21:22:38 UTC, Vladimir Panteleev wrote:
> That doesn't sound right. In fact, this puts into question why dchar.max is at the value it is now. It might be the current maximum at the current version of Unicode, but this seems like a completely pointless restriction that breaks forward-compatibility with future Unicode versions, meaning that D programs compiled today may be unable to work with Unicode text in the future because of a pointless artificial limitation.

Unless UTF-16 is deprecated and completely removed from all systems everywhere, there is no way for Unicode Consortium to increase the limit beyond U+10FFFF. That limit is not arbitrary, but based on the technical limitations of what UTF-16 can actually represent. UTF-8 and UTF-32 both have room for expansion, but have been defined to match UTF-16's limitations.
October 24, 2015
On 24-Oct-2015 02:45, Anon wrote:
> On Friday, 23 October 2015 at 21:22:38 UTC, Vladimir Panteleev wrote:
>> That doesn't sound right. In fact, this puts into question why
>> dchar.max is at the value it is now. It might be the current maximum
>> at the current version of Unicode, but this seems like a completely
>> pointless restriction that breaks forward-compatibility with future
>> Unicode versions, meaning that D programs compiled today may be unable
>> to work with Unicode text in the future because of a pointless
>> artificial limitation.
>
> Unless UTF-16 is deprecated and completely removed from all systems
> everywhere, there is no way for Unicode Consortium to increase the limit
> beyond U+10FFFF. That limit is not arbitrary, but based on the technical
> limitations of what UTF-16 can actually represent. UTF-8 and UTF-32 both
> have room for expansion, but have been defined to match UTF-16's
> limitations.

Exactly. Unicode officially limited UTf-8 to 10FFFF in Unicode 6.0 or so. Previously it was expected to (maybe) expand beyond but it was decided to stay with 10FFFF pretty much indefinitely because of UTF-16.

Also; only ~114k of codepoints have assigned meaning, we are looking at 900K+ unassigned values reserved today.

-- 
Dmitry Olshansky