Formal Review of std.uni (page 2)

On Tuesday, April 30, 2013 15:13:14 Dmitry Olshansky wrote: > Unicode --> can't be done on character by character basis Sure it can. It operates on dchar. I don't know what you've done with std.uni, but the way things have been laid out is that std.ascii: Operates on dchars, doing ASCII operations, generally ignoring anything that's not ASCII. std.uni: Operates on dchars, doing Unicode operations. The operations are generally the same as what you'd do with ASCII only with Unicode instead. std.utf: Doing encoding related stuff with Unicode, so it operates on strings. std.string: Operates on strings (almost always using Unicode) So, with how it's been, std.uni would only be operating on dchars, and putting a function in there which operated on strings wouldn't make any sense. Maybe that doesn't work if you've done a bunch of grapheme stuff, and things will have to be adjusted, but it would be a definite shift to put anything in std.uni which operated on strings, and I think that it would need some definite justification (and there's a good chance that I'd be inclined to argue that it should still go in std.string, possibly using some internal modules if necessary). But clearly I need to take the time to take a look at what you've actually done (I keep meaning to but haven't gotten around to it yet). It had been my impression that what you were doing was primarily a matter of improving the implementation, but it sounds like you're doing something beyond that. - Jonathan M Davis

30-Apr-2013 23:17, Jonathan M Davis пишет: > On Tuesday, April 30, 2013 15:13:14 Dmitry Olshansky wrote: >> Unicode --> can't be done on character by character basis > > Sure it can. It operates on dchar. Getting back to this. Sure it can't - I'd hate to break the illusion but the keyword is e.g. Unicode Case Folding. Another one is Combining Character sequence. > So, with how it's been, std.uni would only be operating on dchars, and putting > a function in there which operated on strings wouldn't make any sense. Maybe > that doesn't work if you've done a bunch of grapheme stuff, and things will > have to be adjusted, but it would be a definite shift to put anything in > std.uni which operated on strings, and I think that it would need some definite > justification (and there's a good chance that I'd be inclined to argue that it > should still go in std.string, possibly using some internal modules if > necessary). Justification is that we'd rather have exactly one module dealing with a bunch of Unicode data arranged into intricate tables. Strictly speaking I'd abolish any Unicode related algorithm in std.string since it's almost definitely doing it wrong anyway (I've checked only 2 - both broken). There is not a single sign of unicode standards used, just the fallacious logic: byte --> dchar and use the same algorithm as with ASCII. It won't work. > > But clearly I need to take the time to take a look at what you've actually > done (I keep meaning to but haven't gotten around to it yet). It had been my > impression that what you were doing was primarily a matter of improving the > implementation, but it sounds like you're doing something beyond that. Take a peek at icmp and sicmp in new std.uni. Current fork of Phobos is here: https://github.com/blackwhale/phobos/tree/new-std-uni Eventually we'd have to do a bit more in the same direction e.g. title casing, split by word boundary etc. (all of these need fixing in std.string). Also all of the core tools are now in the open: CodepointSet, and generating Tries from sets and AA-s. -- Dmitry Olshansky

May 12, 2013

Re: Formal Review of std.uni

Posted by Dmitry Olshansky
in reply to Jesse Phillips

Permalink

Dmitry Olshansky

Posted in reply to Jesse Phillips

Permalink

28-Apr-2013 20:56, Jesse Phillips пишет:
> This is a replacement module for the current std.uni by Dmitry
> Olshansky. The std.uni module provides an implementation of fundamental
> Unicode algorithms and data structures.
>
> To use this module, install 2.63 beta, import uni; and not std.uni,
> compile two files from the source uni.d unicode_tables.d
>
> Docs:
> http://blackwhale.github.io/phobos/uni.html
>
> Source:
> https://github.com/blackwhale/gsoc-bench-2012
>
> DMD Beta:
> http://forum.dlang.org/post/517C8552.7040704@digitalmars.com
>
> It should be noted that inclusion into Phobos may require addressing
> inter-dependencies, see "Reducing the inter-dependencies"
> http://forum.dlang.org/post/kl8hn8$bm3$1@digitalmars.com

We have only one week for review left so I'd like to sort out the last issues before we get to the voting.

First to fill in on latest developments.
With a bunch of ugly hacks I've managed to integrate new std.uni in my Phobos fork and it passes unittests for me now (on win32 at least).

See it hanging there and waiting to be destroyed by the pull tester:
https://github.com/D-Programming-Language/phobos/pull/1289

Remaining issues that I'm aware of:
- proper toLower/toUpper (current one is simplified codepoint-for-codepoint)
- clean up the debris after crush-landing back into Phobos, revert some unrelated changes etc.

Please take time to make that list grow, esp w.r.t interface choices and the code itself.

Plus separately I'd need to remove rudimentary versions of the same data-structures used in std.regex and rewire it to use the new std.uni.

There are few bugs and issues uncovered during integration that I wish to get feedback on.

std.string has a bogus test for toLower:
Of the very few tests being done 2 are very special corner case around \u0130 which is I with dot and is expected to be lowercased to i.
But it's *not* supposed to - this conversion is specific to Turk(?) locale (=tailoring). What should happen is unfolding it to 2-codepoint sequence 'i' and 'dot-above' (this is in works).

I just hope nobody depends on these particular conversions and I am wondering who's put them there in the first place.

std.json is another thing - 0x7F somehow is specifically tested as being accepted as part of string literal. Yet ECMA script docs clearly state that Unicode control characters are to be stripped even before lexing (ignored even in literals).

P.S. Someday I need to track down and file about 2 (or 3?) distinct compiler bugs (fwd-ref hell, private alias hijacking) that I worked around while getting there.
Another one has a fix already (thanks, Kenji):
http://d.puremagic.com/issues/show_bug.cgi?id=10067

-- 
Dmitry Olshansky

On 2013-05-12 22:06, Dmitry Olshansky wrote: > std.json is another thing - 0x7F somehow is specifically tested as being > accepted as part of string literal. Yet ECMA script docs clearly state > that Unicode control characters are to be stripped even before lexing > (ignored even in literals). That is my fault. There's a unit test for std.net.isemail that contains a 0x7F. std.json is somehow used in the auto tester. std.net.isemail broke the auto tester because of this. I haven't seen any JSON parser that doesn't accept 0x7F in a string literal. -- /Jacob Carlborg

I've found a difference from the existing std.uni. Compile-time execution of toUpper/toLower does not work. import uni; void main() { enum a = 'm'; static assert(a.toUpper == 'M'); static assert(a.toLower == 'm'); } This never really was part of the contract in std.uni, but is probably a good idea to continue supporting it.

18-May-2013 21:15, Jesse Phillips пишет: > I've found a difference from the existing std.uni. > > Compile-time execution of toUpper/toLower does not work. > > import uni; > > void main() { > enum a = 'm'; > > static assert(a.toUpper == 'M'); > static assert(a.toLower == 'm'); > } > > This never really was part of the contract in std.uni, but is probably a > good idea to continue supporting it. Will take a stub at it. -- Dmitry Olshansky

19-May-2013 13:08, Dmitry Olshansky пишет: > 18-May-2013 21:15, Jesse Phillips пишет: >> I've found a difference from the existing std.uni. >> >> Compile-time execution of toUpper/toLower does not work. >> >> import uni; >> >> void main() { >> enum a = 'm'; >> >> static assert(a.toUpper == 'M'); >> static assert(a.toLower == 'm'); >> } >> >> This never really was part of the contract in std.uni, but is probably a >> good idea to continue supporting it. > > Will take a stub at it. > Was already fixed in my Phobos fork. I must have hit it during the integration process. Will re-sync the standalone version then. This also works: enum a = 'я'; static assert(a.toUpper == 'Я'); static assert(a.toLower == 'я'); -- Dmitry Olshansky

Forums