[Issue 12455] [uni][reg] Bad lowercase mapping for 'LATIN CAPITAL LETTER I WITH DOT ABOVE'

Apr 19, 2014

monarchdodra@gmail.com

Jul 04, 2014

Dmitry Olshansky

Jul 05, 2014

github-bugzilla@puremagic.com

Jul 05, 2014

github-bugzilla@puremagic.com

Jul 08, 2014

github-bugzilla@puremagic.com

Aug 21, 2014

github-bugzilla@puremagic.com

https://issues.dlang.org/show_bug.cgi?id=12455 --- Comment #2 from monarchdodra@gmail.com --- I toyed around. The issue (apparently) is that it *can* be converted as: LATIN CAPITAL LETTER I (U+0049) COMBINING DOT ABOVE (U+0307) As such, when converted to lower case, it becomes: LATIN SMALL LETTER I (U+0049) COMBINING DOT ABOVE (U+0307) EG: //---- import std.uni, std.stdio, std.string, std.conv; void main() { auto c = 'İ'; // '\U0130' LATIN CAPITAL LETTER I WITH DOT ABOVE auto s = "İ"; // '\U0130' LATIN CAPITAL LETTER I WITH DOT ABOVE assert(std.uni.isUpper(c)); //Passes auto sl = std.uni.toLower(s).to!dstring; assert(sl == "\u0069\u0307"); //PASSES } //---- Because uni "thinks" the lowercase doesn't fit in a single dchar, it simply does nothing (as documeted). However, it's still wrong, as the standard (from what I read), is pretty clear on the fact that the lower case is simply 'i'. Furthermore, "LATIN SMALL LETTER I + COMBINING DOT ABOVE" is pretty redundant... --

July 04, 2014

[Issue 12455] [uni][reg] Bad lowercase mapping for 'LATIN CAPITAL LETTER I WITH DOT ABOVE'

Posted by Dmitry Olshansky

Permalink

Dmitry Olshansky

Permalink

https://issues.dlang.org/show_bug.cgi?id=12455

Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh@gmail.com

--- Comment #3 from Dmitry Olshansky <dmitry.olsh@gmail.com> ---
(In reply to monarchdodra from comment #2)
> I toyed around. The issue (apparently) is that it *can* be converted as:
> 

Indeed the key problem is that simple case mapping and full case mapping do differ for this character. Turns out there also about 13 characters with similar problem, but much less frequently used.

Secondly Turkish language further makes it confusing by making both mapping work as simple case (dropping the extra combining dot).

And last but not least somebody introduced this bit of Turk tailoring into original std.uni, probably Ali :)

> 
> Because uni "thinks" the lowercase doesn't fit in a single dchar, it simply does nothing (as documeted).
> 
> However, it's still wrong, as the standard (from what I read), is pretty clear on the fact that the lower case is simply 'i'.

In fact it's 2 codepoints. See SpecialCasing.txt file, even though it "looks" like it's one character in the web cldr utility.

Here the first line:

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE

Then at the end of file:

# Turkish and Azeri

# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri # The following rules handle those cases.

0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE 0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE

# When lowercasing, remove dot_above in the sequence I + dot_above, which will
turn into i.
# This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

# When lowercasing, unless an I is before a dot_above, it turns into a dotless i.

0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I 0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I

# When uppercasing, i turns into a dotted capital I

0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I

> 
> Furthermore, "LATIN SMALL LETTER I + COMBINING DOT ABOVE" is pretty redundant...

Can't say much on this but it's also the result of NFD normalization.

The course of action is clear - got to make it map to 'i' for toLower with dchar, and keep the current mapping in the string version.

Then when processing Turk text dot after I may be removed as a separate step.

--

https://issues.dlang.org/show_bug.cgi?id=12455 --- Comment #4 from github-bugzilla@puremagic.com --- Commits pushed to master at https://github.com/D-Programming-Language/phobos https://github.com/D-Programming-Language/phobos/commit/c131da58341b5af00feedd3dc535f2915cbdae0e Fix issue 12455 [reg]Bad lowercase mapping for 'LATIN CAPITAL LETTER I WITH DOT ABOVE' Also as part of a fix restores a test case in string.d to exactly match older behaviour. Some extended greek is not upper but title case, yet changes on toUpper. https://github.com/D-Programming-Language/phobos/commit/ced559888f8d244c13bcd93ef5c8412ce92ece82 Merge pull request #2304 from DmitryOlshansky/issue-12455 [REG]Fix issue 12455 Bad lowercase mapping for 'LATIN CAPITAL LETTER I W... --

https://issues.dlang.org/show_bug.cgi?id=12455 --- Comment #5 from github-bugzilla@puremagic.com --- Commit pushed to 2.066 at https://github.com/D-Programming-Language/phobos https://github.com/D-Programming-Language/phobos/commit/8caefcf93b5a88f0cc4576eeb95fd63c3f66f3fa Merge pull request #2304 from DmitryOlshansky/issue-12455 [REG]Fix issue 12455 Bad lowercase mapping for 'LATIN CAPITAL LETTER I W... --

https://issues.dlang.org/show_bug.cgi?id=12455 --- Comment #6 from github-bugzilla@puremagic.com --- Commit pushed to master at https://github.com/D-Programming-Language/phobos https://github.com/D-Programming-Language/phobos/commit/8caefcf93b5a88f0cc4576eeb95fd63c3f66f3fa Merge pull request #2304 from DmitryOlshansky/issue-12455 --

Forums