View mode: basic / threaded / horizontal-split · Log in · Help
January 11, 2013
Ready for review: new std.uni
It's been long over due to present the work I did during the last GSOC 
as it was a summer not winter of code after all. Unfortunately some 
compiler bugs, a new job :) and unrelated events of importance have 
postponed its release beyond measure.

Anyway it's polished and ready for the good old collective destruction 
called peer review. I'm looking for a review manager.

The code, including extra tests and a benchmark is here:
https://github.com/blackwhale/gsoc-bench-2012

And documentation:
http://blackwhale.github.com/phobos/uni.html

And with that I'm asking if there is anything ready to be reviewed in 
the queue?

Currently I prefer to keep it standalone so read 'uni' everywhere as 
'std.uni'. On the plus side it makes it easy to try new 'uni' and 
compare with the old one without re-compiling the Phobos.

To use in place of std.uni replace 'std.uni'->'uni' in your programs and 
compare the results. Just make sure both uni and unicode_tables modules 
are linked in, usually rdmd can take care of this dependency.

The list of new functionality is quite large so I'll point out the major 
sections and let the rest to the documentation.

In general there are 3 angles to the new std.uni:

1) The same stuff but better and faster. For one thing isXXX 
classification functions are brought up to date with Unicode 6.2 and are 
up to 5-6x times faster on non-ASCII text.

2) The commonly expected stuff in any modern Unicode-aware language:
normalization, grapheme decoding, composition/decomposition and 
case-insensitive comparison*.

3) I've taken it as a crucial point to provide all of the tools used to 
build Unicode algorithms from ground up to the end user.
Thus all generally useful data structures used to implement the library 
internals are accessible for 'mortals' too:
 - a type for manipulating sets of codepoints, with full set algebra
 - a construction for generating fast multi-stage lookup tables (Trie)
 - a ton of predefined sets to construct your own specific ones from

There is an extra candy for meta-programming text processing libraries: 
a set type can generate source code of unrolled hard-coded binary search 
for a given set of codepoints.

Among other things the entire collection of data required is generated 
automatically by downloading from unicode.org. The tool relies on the 
same foundation (3) and for the most part this version of std.uni should 
be trivially updated to the new versions of the standard (see gen_uni.d 
script).

* The only missing 'big' thing is the collation algorithm. At this point 
I'm proposing to just move the large chunk of new std.uni in place. This 
way potential contributors would have tools to implement missing bits 
later on.

P.S. CodepointSet could be easily adjusted to serve as generic integer 
set type and Trie already supports far more the codepoints->values 
mappings. These should probably be enhanced and later adopted to 
std.container(2).

-- 
Dmitry Olshansky
January 11, 2013
Re: Ready for review: new std.uni
On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
[...]
> Anyway it's polished and ready for the good old collective
> destruction called peer review. I'm looking for a review manager.
> 
> The code, including extra tests and a benchmark is here:
> https://github.com/blackwhale/gsoc-bench-2012
> 
> And documentation:
> http://blackwhale.github.com/phobos/uni.html

Excellent!! It looks pretty impressive. We should definitely try out
best to get this into Phobos.


[...]
> In general there are 3 angles to the new std.uni:
> 
> 1) The same stuff but better and faster. For one thing isXXX
> classification functions are brought up to date with Unicode 6.2 and
> are up to 5-6x times faster on non-ASCII text.

And there is a tool for auto-updating the data, correct? So in theory,
barring major structural changes in the standard, we should be able to
recompile Phobos (maybe after running the update tool) and it should be
automatically updated to the latest Unicode release?


> 2) The commonly expected stuff in any modern Unicode-aware language:
> normalization, grapheme decoding, composition/decomposition and
> case-insensitive comparison*.
>
> 3) I've taken it as a crucial point to provide all of the tools used
> to build Unicode algorithms from ground up to the end user.

Are there enough tools to implement line-breaking algorithms (TR14)?


> Thus all generally useful data structures used to implement the
> library internals are accessible for 'mortals' too:
>  - a type for manipulating sets of codepoints, with full set algebra
>  - a construction for generating fast multi-stage lookup tables (Trie)
>  - a ton of predefined sets to construct your own specific ones from

I looked over the examples in the docs. Looks awesome!

One question though: how do I check a character in a specific unicode
category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the
specific category, and I assume it's overkill to have one function per
category, so I presume it should be possible to do this using the
codepoint sets? Can you add an example of this to the docs, if this is
possible?


[...]
> Among other things the entire collection of data required is
> generated automatically by downloading from unicode.org. The tool
> relies on the same foundation (3) and for the most part this version
> of std.uni should be trivially updated to the new versions of the
> standard (see gen_uni.d script).

Excellent!


> * The only missing 'big' thing is the collation algorithm. At this
> point I'm proposing to just move the large chunk of new std.uni in
> place. This way potential contributors would have tools to implement
> missing bits later on.

But the aforementioned tools should be enough for someone to use as the
basis for implementing the collation algorithm right?


> P.S. CodepointSet could be easily adjusted to serve as generic
> integer set type and Trie already supports far more the
> codepoints->values mappings. These should probably be enhanced and
> later adopted to std.container(2).
[...]

Yes, having an efficient integer set type and generic Trie type would be
a huge plus to add to std.container. Although, isn't Andrei supposed to
be working on a new std.container, with a better range-based API and
custom allocator scheme? If that's not going to happen for a while yet,
I say we should merge std.uni first, then worry about factoring Trie and
int set later (we can always just alias them in std.uni later when we
move them, that should prevent user code breakage).

Now, some nitpicks:

- InversionList:

  - I don't like using ~ to mean symmetric set difference, because ~
    already means concatenation in D, and it's confusing to overload it
    with an unrelated meaning. I propose using ^ instead, because
    symmetric set difference is analogous to XOR.

  - Why is the table of operators in opBinary's doc duplicated twice?
    Is the second table supposed to be something else?

- We should at least briefly describe what the various Unicode
 normalization forms mean (yes I know they can be looked up on
 unicode.org and various other online resources, but having them right
 there in the docs makes the docs so much more helpful).

- Why are the deprecated functions still in there? The deprecation
 messages say "It will be removed in August 2012", which is already
 past. So they should be removed by now.

- Is there a reason there's a composeJamo / hangulDecompose, but no
 other writing system specific functions (such as functions to deal
 with Thai consonant/vowel compositions)? Is this a peculiarity of the
 Unicode standard? Or should there be more general functions that
 handle composition/decomposition of compound characters?

- Why are many of the category checking functions @system rather than
 @safe? It would seem rather crippling to me if much of std.uni can't
 be used from @safe code!

- It would also be nice if these functions can be made pure (I'm not
 sure I understand why checking the category of a character should be
 impure). The lack of nothrow I can understand, since the input
 character may be illegal or otherwise malformed. But @nothrow pure
 seems to me to be necessary for all category-checking functions.


T

-- 
The volume of a pizza of thickness a and radius z can be described by
the following formula: pi zz a. -- Wouter Verhelst
January 11, 2013
Re: Ready for review: new std.uni
On Friday, 11 January 2013 at 19:31:13 UTC, Dmitry Olshansky 
wrote:
> The code, including extra tests and a benchmark is here:
> https://github.com/blackwhale/gsoc-bench-2012

Phew, that repository is quite a hefty checkout with all the 
example data.

Just for the fun of it, a completely irreproducible, unscientific 
benchmark (sorry, Andrei!) run on a Core i7 M 620 under Linux 
x86_64:

DMD 2.061
---
$ rdmd -O -release -noboundscheck -inline char_class.d 
enwiki-latest-all-titles-in-ns0
Creating level 1 of Tries.
Alpha:262144
Mark:262144
Number:262144
Symbol:262144
Creating level 2 of Tries.
Alpha:9472
Mark:6656
Number:6656
Symbol:6400
Creating level 3 of Tries.
Alpha:4000
Mark:2304
Number:2432
Symbol:2848
Creating level 4 of Tries.
Alpha:3376
Mark:1976
Number:1624
Symbol:2096
Time to build dchar array: 2702 ms

Baselines
            noop [enwiki],   217.335, 888.99 M/s
   new-std-alpha [enwiki],   1197.49, 161.34 M/s
    new-std-mark [enwiki],   6374.73, 30.31 M/s
     new-std-num [enwiki],   6290.89, 30.71 M/s
     new-std-sym [enwiki],   6500.93, 29.72 M/s
   new-std-white [enwiki],   822.245, 234.98 M/s
 combining class [enwiki],   5948.63, 32.48 M/s
       inv-alpha [enwiki],     10290, 18.78 M/s
        inv-mark [enwiki],    9229.2, 20.93 M/s
         inv-num [enwiki],   9046.48, 21.36 M/s
         inv-sym [enwiki],   9843.59, 19.63 M/s

Tries of level 1
      trie-alpha [enwiki],   2179.17, 88.66 M/s
       trie-mark [enwiki],   1912.54, 101.02 M/s
        trie-num [enwiki],   1924.92, 100.37 M/s
        trie-sym [enwiki],   1911.22, 101.09 M/s

Tries of level 2
      trie-alpha [enwiki],   4524.33, 42.70 M/s
       trie-mark [enwiki],   4276.92, 45.17 M/s
        trie-num [enwiki],   4308.42, 44.84 M/s
        trie-sym [enwiki],   4298.68, 44.95 M/s

Tries of level 3
      trie-alpha [enwiki],   6407.64, 30.15 M/s
       trie-mark [enwiki],    6150.8, 31.41 M/s
        trie-num [enwiki],   6095.96, 31.69 M/s
        trie-sym [enwiki],    6253.7, 30.89 M/s

Tries of level 4
      trie-alpha [enwiki],   8329.66, 23.20 M/s
       trie-mark [enwiki],   7993.11, 24.17 M/s
        trie-num [enwiki],   8040.94, 24.03 M/s
        trie-sym [enwiki],   8033.07, 24.05 M/s
---

LDC (2.061 merge branch):
---
$ rdmd --compiler=ldmd2 -O3 -release -noboundscheck -inline 
char_class.d enwiki-latest-all-titles-in-ns0
Alpha:262144
Mark:262144
Number:262144
Symbol:262144
Creating level 2 of Tries.
Alpha:9472
Mark:6656
Number:6656
Symbol:6400
Creating level 3 of Tries.
Alpha:4000
Mark:2304
Number:2432
Symbol:2848
Creating level 4 of Tries.
Alpha:3376
Mark:1976
Number:1624
Symbol:2096
Time to build dchar array: 1962 ms

Baselines
            noop [enwiki],   211.516, 913.44 M/s
   new-std-alpha [enwiki],   612.324, 315.53 M/s
    new-std-mark [enwiki],   577.134, 334.77 M/s
     new-std-num [enwiki],   600.996, 321.48 M/s
     new-std-sym [enwiki],   577.433, 334.60 M/s
   new-std-white [enwiki],   381.535, 506.40 M/s
 combining class [enwiki],   517.231, 373.54 M/s
       inv-alpha [enwiki],   8181.25, 23.62 M/s
        inv-mark [enwiki],   6938.09, 27.85 M/s
         inv-num [enwiki],   6347.71, 30.44 M/s
         inv-sym [enwiki],    6930.7, 27.88 M/s

Tries of level 1
      trie-alpha [enwiki],   275.741, 700.69 M/s
       trie-mark [enwiki],   274.972, 702.65 M/s
        trie-num [enwiki],   274.558, 703.70 M/s
        trie-sym [enwiki],   277.255, 696.86 M/s

Tries of level 2
      trie-alpha [enwiki],   453.453, 426.08 M/s
       trie-mark [enwiki],   452.912, 426.59 M/s
        trie-num [enwiki],   455.004, 424.63 M/s
        trie-sym [enwiki],   453.492, 426.04 M/s

Tries of level 3
      trie-alpha [enwiki],   577.442, 334.59 M/s
       trie-mark [enwiki],   579.054, 333.66 M/s
        trie-num [enwiki],    577.95, 334.30 M/s
        trie-sym [enwiki],   577.566, 334.52 M/s

Tries of level 4
      trie-alpha [enwiki],   778.261, 248.26 M/s
       trie-mark [enwiki],   778.943, 248.04 M/s
        trie-num [enwiki],   778.352, 248.23 M/s
        trie-sym [enwiki],   779.442, 247.88 M/s
---

Haven't had a look at what is going on there yet – it could well 
be that some measurement loop is optimized out altogether.

David
January 11, 2013
Re: Ready for review: new std.uni
On Fri, Jan 11, 2013 at 12:35:42PM -0800, H. S. Teoh wrote:
[...]
> - It would also be nice if these functions can be made pure (I'm not
>   sure I understand why checking the category of a character should be
>   impure). The lack of nothrow I can understand, since the input
>   character may be illegal or otherwise malformed. But @nothrow pure
>   seems to me to be necessary for all category-checking functions.

Gah, I meant @safe pure.


T

-- 
Lawyer: (n.) An innocence-vending machine, the effectiveness of which depends on how much money is inserted.
January 11, 2013
Re: Ready for review: new std.uni
On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
> Anyway it's polished and ready for the good old collective destruction called
> peer review. I'm looking for a review manager.

Thank you for doing this. Some notes:


struct InversionList:

   1. It implements length and empty, but otherwise does not seem to be a 
range, though its name implies it is a container.

   2. The description "InversionList is a packed interval-based data structure 
that represents a set of codepoints." completely baffles me.

   3. The table for opBinary appears twice.

   4. toSourceCode says "if the codepoint". If what codepoint? And what 
funcName is ""?

mapTrieIndex:

   I have no idea what this does from the description. Is it a map like in 
std.algorithms? I.e. does it work with ranges? Need an example.

TrieBuilder:

   1. No description. I'm completely baffled.

   2. putRange? What does that do?

General lack of examples.

Trie:

    Should have a wikipedia link to what a trie is.

buildTrie:

    Contains a number of functions that return auto, but no mention of what is 
returned. While I like auto, in these cases it is not helpful, because the user 
needs to know what type is returned.

Grapheme:

    Should provide an InputRange so a user can iterate over its codepoints. It 
already appears to provide part of the range interface.


sicmp and icmp:

    No reason one should have "in" and the other "inout". Both should be "const".

combiningClass

    What is "the canonicial combining class" ?

General note:

    Needs a section that briefly defines terms, with links to external 
documentation.
January 11, 2013
Re: Ready for review: new std.uni
12-Jan-2013 00:38, David Nadlinger пишет:
> On Friday, 11 January 2013 at 19:31:13 UTC, Dmitry Olshansky wrote:
>> The code, including extra tests and a benchmark is here:
>> https://github.com/blackwhale/gsoc-bench-2012
>
> Phew, that repository is quite a hefty checkout with all the example data.
>
> Just for the fun of it, a completely irreproducible, unscientific
> benchmark (sorry, Andrei!) run on a Core i7 M 620 under Linux x86_64:
>
> DMD 2.061
> ---
> $ rdmd -O -release -noboundscheck -inline char_class.d
[snip]

I'd rather use dewiki or arwiki these give the more coverage otherwise 
you just check the presence of ASCII special casing (only isAlpha and 
isWhite have it).

> Time to build dchar array: 2702 ms
>
> Baselines
>              noop [enwiki],   217.335, 888.99 M/s
>     new-std-alpha [enwiki],   1197.49, 161.34 M/s
>      new-std-mark [enwiki],   6374.73, 30.31 M/s
>       new-std-num [enwiki],   6290.89, 30.71 M/s
>       new-std-sym [enwiki],   6500.93, 29.72 M/s
>     new-std-white [enwiki],   822.245, 234.98 M/s
>   combining class [enwiki],   5948.63, 32.48 M/s
>         inv-alpha [enwiki],     10290, 18.78 M/s
>          inv-mark [enwiki],    9229.2, 20.93 M/s
>           inv-num [enwiki],   9046.48, 21.36 M/s
>           inv-sym [enwiki],   9843.59, 19.63 M/s
>
> Tries of level 1
>        trie-alpha [enwiki],   2179.17, 88.66 M/s
>         trie-mark [enwiki],   1912.54, 101.02 M/s
>          trie-num [enwiki],   1924.92, 100.37 M/s
>          trie-sym [enwiki],   1911.22, 101.09 M/s
>
> Tries of level 2
>        trie-alpha [enwiki],   4524.33, 42.70 M/s
>         trie-mark [enwiki],   4276.92, 45.17 M/s
>          trie-num [enwiki],   4308.42, 44.84 M/s
>          trie-sym [enwiki],   4298.68, 44.95 M/s
>
> Tries of level 3
>        trie-alpha [enwiki],   6407.64, 30.15 M/s
>         trie-mark [enwiki],    6150.8, 31.41 M/s
>          trie-num [enwiki],   6095.96, 31.69 M/s
>          trie-sym [enwiki],    6253.7, 30.89 M/s
>
> Tries of level 4
>        trie-alpha [enwiki],   8329.66, 23.20 M/s
>         trie-mark [enwiki],   7993.11, 24.17 M/s
>          trie-num [enwiki],   8040.94, 24.03 M/s
>          trie-sym [enwiki],   8033.07, 24.05 M/s
> ---



> LDC (2.061 merge branch):
> ---
> $ rdmd --compiler=ldmd2 -O3 -release -noboundscheck -inline char_class.d
> enwiki-latest-all-titles-in-ns0

[snip]

> Baselines
>              noop [enwiki],   211.516, 913.44 M/s
>     new-std-alpha [enwiki],   612.324, 315.53 M/s
>      new-std-mark [enwiki],   577.134, 334.77 M/s
>       new-std-num [enwiki],   600.996, 321.48 M/s
>       new-std-sym [enwiki],   577.433, 334.60 M/s
>     new-std-white [enwiki],   381.535, 506.40 M/s
>   combining class [enwiki],   517.231, 373.54 M/s
>         inv-alpha [enwiki],   8181.25, 23.62 M/s
>          inv-mark [enwiki],   6938.09, 27.85 M/s
>           inv-num [enwiki],   6347.71, 30.44 M/s
>           inv-sym [enwiki],    6930.7, 27.88 M/s
>
> Tries of level 1
>        trie-alpha [enwiki],   275.741, 700.69 M/s
>         trie-mark [enwiki],   274.972, 702.65 M/s
>          trie-num [enwiki],   274.558, 703.70 M/s
>          trie-sym [enwiki],   277.255, 696.86 M/s
>
> Tries of level 2
>        trie-alpha [enwiki],   453.453, 426.08 M/s
>         trie-mark [enwiki],   452.912, 426.59 M/s
>          trie-num [enwiki],   455.004, 424.63 M/s
>          trie-sym [enwiki],   453.492, 426.04 M/s
>
> Tries of level 3
>        trie-alpha [enwiki],   577.442, 334.59 M/s
>         trie-mark [enwiki],   579.054, 333.66 M/s
>          trie-num [enwiki],    577.95, 334.30 M/s
>          trie-sym [enwiki],   577.566, 334.52 M/s
>
> Tries of level 4
>        trie-alpha [enwiki],   778.261, 248.26 M/s
>         trie-mark [enwiki],   778.943, 248.04 M/s
>          trie-num [enwiki],   778.352, 248.23 M/s
>          trie-sym [enwiki],   779.442, 247.88 M/s
> ---
>
> Haven't had a look at what is going on there yet – it could well be that
> some measurement loop is optimized out altogether.
>

You can print total counts after each bench, there is a TLS varaible 
written at the end of it. But anyway I like your numbers! :)

> David


-- 
Dmitry Olshansky
January 11, 2013
Re: Ready for review: new std.uni
12-Jan-2013 00:35, H. S. Teoh пишет:
> On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
> [...]
>> Anyway it's polished and ready for the good old collective
>> destruction called peer review. I'm looking for a review manager.
>>
>> The code, including extra tests and a benchmark is here:
>> https://github.com/blackwhale/gsoc-bench-2012
>>
>> And documentation:
>> http://blackwhale.github.com/phobos/uni.html
>
> Excellent!! It looks pretty impressive. We should definitely try out
> best to get this into Phobos.
>
>
> [...]
>> In general there are 3 angles to the new std.uni:
>>
>> 1) The same stuff but better and faster. For one thing isXXX
>> classification functions are brought up to date with Unicode 6.2 and
>> are up to 5-6x times faster on non-ASCII text.
>
> And there is a tool for auto-updating the data, correct? So in theory,
> barring major structural changes in the standard, we should be able to
> recompile Phobos (maybe after running the update tool) and it should be
> automatically updated to the latest Unicode release?

Yup, except they tweak the algorithms all along. The normalization for 
instance is full of erratas and minor notes accumulated over time. I'll 
probably do a write up on how to implement it and where to find all the 
related info.

>
>
>> 2) The commonly expected stuff in any modern Unicode-aware language:
>> normalization, grapheme decoding, composition/decomposition and
>> case-insensitive comparison*.
>>
>> 3) I've taken it as a crucial point to provide all of the tools used
>> to build Unicode algorithms from ground up to the end user.
>
> Are there enough tools to implement line-breaking algorithms (TR14)?
>

Should be. AFAIK _word_ breaking seemed trivial to do, got to check the 
line breaking.

>
>> Thus all generally useful data structures used to implement the
>> library internals are accessible for 'mortals' too:
>>   - a type for manipulating sets of codepoints, with full set algebra
>>   - a construction for generating fast multi-stage lookup tables (Trie)
>>   - a ton of predefined sets to construct your own specific ones from
>
> I looked over the examples in the docs. Looks awesome!
>
> One question though: how do I check a character in a specific unicode
> category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the
> specific category, and I assume it's overkill to have one function per
> category, so I presume it should be possible to do this using the
> codepoint sets? Can you add an example of this to the docs, if this is
> possible?

auto psSet = unicode.Ps;
assert(psSet['(']); //should pass

>
>
> [...]
>> Among other things the entire collection of data required is
>> generated automatically by downloading from unicode.org. The tool
>> relies on the same foundation (3) and for the most part this version
>> of std.uni should be trivially updated to the new versions of the
>> standard (see gen_uni.d script).
>
> Excellent!
>
>
>> * The only missing 'big' thing is the collation algorithm. At this
>> point I'm proposing to just move the large chunk of new std.uni in
>> place. This way potential contributors would have tools to implement
>> missing bits later on.
>
> But the aforementioned tools should be enough for someone to use as the
> basis for implementing the collation algorithm right?
>
I sure hope so. I've been looking through all of TRs and found nothing 
surprising beyond more and more tricky codepoint sets and more and more 
wierd rules to use these.

>
>> P.S. CodepointSet could be easily adjusted to serve as generic
>> integer set type and Trie already supports far more the
>> codepoints->values mappings. These should probably be enhanced and
>> later adopted to std.container(2).
> [...]
>
> Yes, having an efficient integer set type and generic Trie type would be
> a huge plus to add to std.container. Although, isn't Andrei supposed to
> be working on a new std.container, with a better range-based API and
> custom allocator scheme? If that's not going to happen for a while yet,
> I say we should merge std.uni first, then worry about factoring Trie and
> int set later (we can always just alias them in std.uni later when we
> move them, that should prevent user code breakage).
>
> Now, some nitpicks:
>

Thanks for early feedback.

> - InversionList:
>
>     - I don't like using ~ to mean symmetric set difference, because ~
>       already means concatenation in D, and it's confusing to overload it
>       with an unrelated meaning. I propose using ^ instead, because
>       symmetric set difference is analogous to XOR.
>

Point taken, arguably '~' was a bad idea. But I don't like ^ either. 
Seems like it's the best candidate though.


>     - Why is the table of operators in opBinary's doc duplicated twice?
>       Is the second table supposed to be something else?
>

Beats me. Must be some glitch with BOOKTABLE macro (or how I use it). I 
think I've hit it before.

> - We should at least briefly describe what the various Unicode
>    normalization forms mean (yes I know they can be looked up on
>    unicode.org and various other online resources, but having them right
>    there in the docs makes the docs so much more helpful).
>

> - Why are the deprecated functions still in there? The deprecation
>    messages say "It will be removed in August 2012", which is already
>    past. So they should be removed by now.
>

Will do before Walter chimes in with his motto of keeping clutter just 
in case.

> - Is there a reason there's a composeJamo / hangulDecompose, but no
>    other writing system specific functions (such as functions to deal
>    with Thai consonant/vowel compositions)? Is this a peculiarity of the
>    Unicode standard? Or should there be more general functions that
>    handle composition/decomposition of compound characters?
>
Hangul are huge (11K+) and they have a specific algorithmic 
decomposition rule. Also they are hell of a special case in Unicode 
standard (that is mentioned in one chapter only!). Everything else is 
table-driven.

> - Why are many of the category checking functions @system rather than
>    @safe? It would seem rather crippling to me if much of std.uni can't
>    be used from @safe code!
>

@system? Well, these got to be either @trusted or @safe.

> - It would also be nice if these functions can be made pure (I'm not
>    sure I understand why checking the category of a character should be
>    impure).

Global sophisticated immutable table. I think this can be pure but the 
compiler might have disagreed.

>  The lack of nothrow I can understand, since the input
>    character may be illegal or otherwise malformed.

nothrow is probably doable as any codepoint will do. (and if it's > 
dchar.max then it's a problem somewhere else).

TBH I've killed these qualifiers early on as it prevented the thing to 
compile. I can't recall is there is still a problem with any of theses.

> But @nothrow pure
>    seems to me to be necessary for all category-checking functions.
>


-- 
Dmitry Olshansky
January 11, 2013
Re: Ready for review: new std.uni
12-Jan-2013 00:50, Walter Bright пишет:
> On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
>> Anyway it's polished and ready for the good old collective destruction
>> called
>> peer review. I'm looking for a review manager.
>
> Thank you for doing this. Some notes:
>

Great bashing, thanks :)

>
> struct InversionList:
>
>     1. It implements length and empty, but otherwise does not seem to be
> a range, though its name implies it is a container.
>

Containers also have length and can be empty. Ranges on top of it are 
.byChar, .byInterval.

>     2. The description "InversionList is a packed interval-based data
> structure that represents a set of codepoints." completely baffles me.
>

OK, probably it was a bad idea to combine implementation detail and 
general description in one statement.

How about: InversionList is a set of codepoints, optimized for size.
It represents a set as an array of intervals using 6 bytes per interval 
of codepoints.

Better?

>     3. The table for opBinary appears twice.
>
>     4. toSourceCode says "if the codepoint". If what codepoint? And what
> funcName is ""?
>
> mapTrieIndex:
>
>     I have no idea what this does from the description. Is it a map like
> in std.algorithms? I.e. does it work with ranges? Need an example.
>
> TrieBuilder:
>
>     1. No description. I'm completely baffled.
>
>     2. putRange? What does that do?
>
> General lack of examples.
>
> Trie:
>
>      Should have a wikipedia link to what a trie is.
>
> buildTrie:
>
>      Contains a number of functions that return auto, but no mention of
> what is returned. While I like auto, in these cases it is not helpful,
> because the user needs to know what type is returned.
>

Trust me you won't like it when you see it :) Part of the reason it's 
hidden. But you are correct that documentation on these artifacts is 
*ehm* ... sketchy. Will fix/update.

> Grapheme:
>
>      Should provide an InputRange so a user can iterate over its
> codepoints. It already appears to provide part of the range interface.

Container != range. Just slice it and RA range is in your hands. I need 
to put this info somewhere prominent I guess.

BTW foreach(v; Grapheme("abc")) { ... }
should work because of automatic slicing.

>
> sicmp and icmp:
>
>      No reason one should have "in" and the other "inout". Both should
> be "const".
>
> combiningClass
>
>      What is "the canonicial combining class" ?
>
> General note:
>
>      Needs a section that briefly defines terms, with links to external
> documentation.

All to the point, thanks.

-- 
Dmitry Olshansky
January 11, 2013
Re: Ready for review: new std.uni
On Sat, Jan 12, 2013 at 01:06:30AM +0400, Dmitry Olshansky wrote:
> 12-Jan-2013 00:35, H. S. Teoh пишет:
> >On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
> >[...]
> >>Anyway it's polished and ready for the good old collective
> >>destruction called peer review. I'm looking for a review manager.
> >>
> >>The code, including extra tests and a benchmark is here:
> >>https://github.com/blackwhale/gsoc-bench-2012
> >>
> >>And documentation:
> >>http://blackwhale.github.com/phobos/uni.html
[...]
> >>2) The commonly expected stuff in any modern Unicode-aware language:
> >>normalization, grapheme decoding, composition/decomposition and
> >>case-insensitive comparison*.
> >>
> >>3) I've taken it as a crucial point to provide all of the tools used
> >>to build Unicode algorithms from ground up to the end user.
> >
> >Are there enough tools to implement line-breaking algorithms (TR14)?
> >
> 
> Should be. AFAIK _word_ breaking seemed trivial to do, got to check
> the line breaking.

Line-breaking is in TR14. I looked over it. It's definitely more
complicated than I expected it to be. Some alphabets may even require
stylistic rules for line-breaking! (Not to mention hyphenation.) So it's
probably too much to expect std.uni to do it. But at least the tools
necessary to implement it should be there.


[...]
> >One question though: how do I check a character in a specific unicode
> >category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the
> >specific category, and I assume it's overkill to have one function per
> >category, so I presume it should be possible to do this using the
> >codepoint sets? Can you add an example of this to the docs, if this is
> >possible?
> 
> auto psSet = unicode.Ps;
> assert(psSet['(']); //should pass

It would be nice to have a list of possible categories in the docs for
unicode.opDispatch/opCall, at least, the common categories, with a
hyperlink to unicode.org (or wherever) that provides the full list.
Currently, it's unclear, beyond the single example given, of what the
possibilities are.


[...]
> >- We should at least briefly describe what the various Unicode
> >   normalization forms mean (yes I know they can be looked up on
> >   unicode.org and various other online resources, but having them right
> >   there in the docs makes the docs so much more helpful).
> >
> 
> >- Why are the deprecated functions still in there? The deprecation
> >   messages say "It will be removed in August 2012", which is already
> >   past. So they should be removed by now.
> >
> 
> Will do before Walter chimes in with his motto of keeping clutter
> just in case.

In that case, I would keep the functions (aliases?) but remove them from
DDoc. So old code will still silently work, but the functions will no
longer be officially supported, there will be no documentation for them
and they can disappear anytime.


> >- Is there a reason there's a composeJamo / hangulDecompose, but no
> >   other writing system specific functions (such as functions to deal
> >   with Thai consonant/vowel compositions)? Is this a peculiarity of
> >   the Unicode standard? Or should there be more general functions
> >   that handle composition/decomposition of compound characters?
> >
> Hangul are huge (11K+) and they have a specific algorithmic
> decomposition rule. Also they are hell of a special case in Unicode
> standard (that is mentioned in one chapter only!). Everything else
> is table-driven.

OK. Makes sense, I guess. Though it gave me the impression that std.uni
has some undue bias for Korean text. ;-)

[...]
> >- It would also be nice if these functions can be made pure (I'm not
> >   sure I understand why checking the category of a character should
> >   be impure).
> 
> Global sophisticated immutable table. I think this can be pure but
> the compiler might have disagreed.

Hmm. This sounds like a bug in purity, maybe? It seems clear to me that
immutable values might as well be constants, and therefore using said
values should not make the code impure, even if they are implemented as
a global table. I think a bug should be filed for this.


> > The lack of nothrow I can understand, since the input
> >   character may be illegal or otherwise malformed.
> 
> nothrow is probably doable as any codepoint will do. (and if it's >
> dchar.max then it's a problem somewhere else).

True, an illegal character by definition does not belong to any
category, so all category functions should return false for it. No
exceptions need to be thrown.


> TBH I've killed these qualifiers early on as it prevented the thing
> to compile. I can't recall is there is still a problem with any of
> theses.
[...]

Before merging into Phobos, I would think we should put (as many of) the
qualifiers back (as possible). Preferably everwhere they make sense,
barring compiler bugs.


T

-- 
Государство делает вид, что платит нам зарплату, а мы делаем вид, что работаем.
January 11, 2013
Re: Ready for review: new std.uni
On 1/11/2013 1:21 PM, Dmitry Olshansky wrote:
> 12-Jan-2013 00:50, Walter Bright пишет:
>> On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
>>> Anyway it's polished and ready for the good old collective destruction
>>> called
>>> peer review. I'm looking for a review manager.
>>
>> Thank you for doing this. Some notes:
>>
>
> Great bashing, thanks :)

Anytime!


>> struct InversionList:
>>
>>     1. It implements length and empty, but otherwise does not seem to be
>> a range, though its name implies it is a container.
>>
>
> Containers also have length and can be empty. Ranges on top of it are .byChar,
> .byInterval.

Some more text about accessing it as a range would be helpful. Being a set of 
"codepoints" yet accessing it "by char" makes no sense.

In unicode, the terms "char", "codepoint", "octet", "grapheme", etc., are 
bandied about and are often conflated and confused with each other. I suggest a 
brief section defining those terms, and then scrupulously, consistently, anally, 
and nitpickingly adhering to them in the documentation and names of things.


>>     2. The description "InversionList is a packed interval-based data
>> structure that represents a set of codepoints." completely baffles me.
>>
>
> OK, probably it was a bad idea to combine implementation detail and general
> description in one statement.
>
> How about: InversionList is a set of codepoints, optimized for size.
> It represents a set as an array of intervals using 6 bytes per interval of
> codepoints.
>
> Better?

The name says "list" and yet it is described as a "set". Being a "list" makes an 
implication that it is a linked list. BTW, what is an "interval"?


>>     3. The table for opBinary appears twice.
>>
>>     4. toSourceCode says "if the codepoint". If what codepoint? And what
>> funcName is ""?
>>
>> mapTrieIndex:
>>
>>     I have no idea what this does from the description. Is it a map like
>> in std.algorithms? I.e. does it work with ranges? Need an example.
>>
>> TrieBuilder:
>>
>>     1. No description. I'm completely baffled.
>>
>>     2. putRange? What does that do?
>>
>> General lack of examples.
>>
>> Trie:
>>
>>      Should have a wikipedia link to what a trie is.
>>
>> buildTrie:
>>
>>      Contains a number of functions that return auto, but no mention of
>> what is returned. While I like auto, in these cases it is not helpful,
>> because the user needs to know what type is returned.
>>
>
> Trust me you won't like it when you see it :) Part of the reason it's hidden.
> But you are correct that documentation on these artifacts is *ehm* ... sketchy.
> Will fix/update.

If the compiler type is hard to grok, ok, but at least the documentation should 
make it clear what is returned.


>> Grapheme:
>>
>>      Should provide an InputRange so a user can iterate over its
>> codepoints. It already appears to provide part of the range interface.
>
> Container != range. Just slice it and RA range is in your hands. I need to put
> this info somewhere prominent I guess.
>
> BTW foreach(v; Grapheme("abc")) { ... }
> should work because of automatic slicing.

Yes, but is the resulting v an octet, char, codepoint ????


> All to the point, thanks.

Welcs!
« First   ‹ Prev
1 2 3 4 5
Top | Discussion index | About this forum | D home