Ready for review: new std.uni

Re: DMD, GDC, LDC [was Re: Ready for review: new std.uni]
Jan 14, 2013 Russel Winder
Jan 14, 2013 David Nadlinger
Jan 14, 2013 Iain Buclaw
Jan 14, 2013 David Nadlinger
Jan 14, 2013 Russel Winder
Jan 14, 2013 Russel Winder
Jan 14, 2013 Iain Buclaw
Jan 14, 2013 Iain Buclaw
Jan 14, 2013 H. S. Teoh
Jan 14, 2013 eles

Re: DMD, GDC, LDC [was Re: Ready for review: new std.uni]
Jan 14, 2013 Jordi Sayol

Re: DMD, GDC, LDC [was Re: Ready for review: new std.uni]
Jan 14, 2013 Russel Winder

Jan 14, 2013

Jan 14, 2013

Jan 12, 2013

Jan 11, 2013

Jan 11, 2013

Jan 11, 2013

Jan 11, 2013

Jan 12, 2013

Jan 12, 2013

Jan 12, 2013

Jan 12, 2013

Jan 12, 2013

Jan 12, 2013

Jan 12, 2013

Jan 12, 2013

Jan 12, 2013

Jan 12, 2013

Jan 12, 2013

Jan 13, 2013

Jan 13, 2013

Jan 13, 2013

Jan 13, 2013

Jan 13, 2013

Jan 14, 2013

Jan 13, 2013

Jan 16, 2013

Jan 16, 2013

Update #1 on new std.uni
Jan 16, 2013 Dmitry Olshansky
Jan 16, 2013 Walter Bright
Jan 16, 2013 Dmitry Olshansky
Jan 16, 2013 Walter Bright
Jan 16, 2013 Dmitry Olshansky
Jan 17, 2013 H. S. Teoh
Jan 17, 2013 Dmitry Olshansky
Jan 22, 2013 Dmitry Olshansky
Jan 22, 2013 Jonathan M Davis

January 11, 2013

Ready for review: new std.uni

Posted by Dmitry Olshansky

Permalink

Dmitry Olshansky

Permalink

It's been long over due to present the work I did during the last GSOC as it was a summer not winter of code after all. Unfortunately some compiler bugs, a new job :) and unrelated events of importance have postponed its release beyond measure.

Anyway it's polished and ready for the good old collective destruction called peer review. I'm looking for a review manager.

The code, including extra tests and a benchmark is here:
https://github.com/blackwhale/gsoc-bench-2012

And documentation:
http://blackwhale.github.com/phobos/uni.html

And with that I'm asking if there is anything ready to be reviewed in the queue?

Currently I prefer to keep it standalone so read 'uni' everywhere as 'std.uni'. On the plus side it makes it easy to try new 'uni' and compare with the old one without re-compiling the Phobos.

To use in place of std.uni replace 'std.uni'->'uni' in your programs and compare the results. Just make sure both uni and unicode_tables modules are linked in, usually rdmd can take care of this dependency.

The list of new functionality is quite large so I'll point out the major sections and let the rest to the documentation.

In general there are 3 angles to the new std.uni:

1) The same stuff but better and faster. For one thing isXXX classification functions are brought up to date with Unicode 6.2 and are up to 5-6x times faster on non-ASCII text.

2) The commonly expected stuff in any modern Unicode-aware language:
normalization, grapheme decoding, composition/decomposition and case-insensitive comparison*.

3) I've taken it as a crucial point to provide all of the tools used to build Unicode algorithms from ground up to the end user.
Thus all generally useful data structures used to implement the library internals are accessible for 'mortals' too:
 - a type for manipulating sets of codepoints, with full set algebra
 - a construction for generating fast multi-stage lookup tables (Trie)
 - a ton of predefined sets to construct your own specific ones from

There is an extra candy for meta-programming text processing libraries: a set type can generate source code of unrolled hard-coded binary search for a given set of codepoints.

Among other things the entire collection of data required is generated automatically by downloading from unicode.org. The tool relies on the same foundation (3) and for the most part this version of std.uni should be trivially updated to the new versions of the standard (see gen_uni.d script).

* The only missing 'big' thing is the collation algorithm. At this point I'm proposing to just move the large chunk of new std.uni in place. This way potential contributors would have tools to implement missing bits later on.

P.S. CodepointSet could be easily adjusted to serve as generic integer set type and Trie already supports far more the codepoints->values mappings. These should probably be enhanced and later adopted to std.container(2).

-- 
Dmitry Olshansky

January 11, 2013

Re: Ready for review: new std.uni

Posted by H. S. Teoh
in reply to Dmitry Olshansky

Permalink

H. S. Teoh

Posted in reply to Dmitry Olshansky

Permalink

On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote: [...]
> Anyway it's polished and ready for the good old collective destruction called peer review. I'm looking for a review manager.
> 
> The code, including extra tests and a benchmark is here: https://github.com/blackwhale/gsoc-bench-2012
> 
> And documentation: http://blackwhale.github.com/phobos/uni.html

Excellent!! It looks pretty impressive. We should definitely try out best to get this into Phobos.


[...]
> In general there are 3 angles to the new std.uni:
> 
> 1) The same stuff but better and faster. For one thing isXXX classification functions are brought up to date with Unicode 6.2 and are up to 5-6x times faster on non-ASCII text.

And there is a tool for auto-updating the data, correct? So in theory, barring major structural changes in the standard, we should be able to recompile Phobos (maybe after running the update tool) and it should be automatically updated to the latest Unicode release?


> 2) The commonly expected stuff in any modern Unicode-aware language: normalization, grapheme decoding, composition/decomposition and case-insensitive comparison*.
>
> 3) I've taken it as a crucial point to provide all of the tools used to build Unicode algorithms from ground up to the end user.

Are there enough tools to implement line-breaking algorithms (TR14)?


> Thus all generally useful data structures used to implement the
> library internals are accessible for 'mortals' too:
>  - a type for manipulating sets of codepoints, with full set algebra
>  - a construction for generating fast multi-stage lookup tables (Trie)
>  - a ton of predefined sets to construct your own specific ones from

I looked over the examples in the docs. Looks awesome!

One question though: how do I check a character in a specific unicode category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the specific category, and I assume it's overkill to have one function per category, so I presume it should be possible to do this using the codepoint sets? Can you add an example of this to the docs, if this is possible?


[...]
> Among other things the entire collection of data required is generated automatically by downloading from unicode.org. The tool relies on the same foundation (3) and for the most part this version of std.uni should be trivially updated to the new versions of the standard (see gen_uni.d script).

Excellent!


> * The only missing 'big' thing is the collation algorithm. At this point I'm proposing to just move the large chunk of new std.uni in place. This way potential contributors would have tools to implement missing bits later on.

But the aforementioned tools should be enough for someone to use as the basis for implementing the collation algorithm right?


> P.S. CodepointSet could be easily adjusted to serve as generic integer set type and Trie already supports far more the codepoints->values mappings. These should probably be enhanced and later adopted to std.container(2).
[...]

Yes, having an efficient integer set type and generic Trie type would be a huge plus to add to std.container. Although, isn't Andrei supposed to be working on a new std.container, with a better range-based API and custom allocator scheme? If that's not going to happen for a while yet, I say we should merge std.uni first, then worry about factoring Trie and int set later (we can always just alias them in std.uni later when we move them, that should prevent user code breakage).

Now, some nitpicks:

- InversionList:

   - I don't like using ~ to mean symmetric set difference, because ~
     already means concatenation in D, and it's confusing to overload it
     with an unrelated meaning. I propose using ^ instead, because
     symmetric set difference is analogous to XOR.

   - Why is the table of operators in opBinary's doc duplicated twice?
     Is the second table supposed to be something else?

- We should at least briefly describe what the various Unicode
  normalization forms mean (yes I know they can be looked up on
  unicode.org and various other online resources, but having them right
  there in the docs makes the docs so much more helpful).

- Why are the deprecated functions still in there? The deprecation
  messages say "It will be removed in August 2012", which is already
  past. So they should be removed by now.

- Is there a reason there's a composeJamo / hangulDecompose, but no
  other writing system specific functions (such as functions to deal
  with Thai consonant/vowel compositions)? Is this a peculiarity of the
  Unicode standard? Or should there be more general functions that
  handle composition/decomposition of compound characters?

- Why are many of the category checking functions @system rather than
  @safe? It would seem rather crippling to me if much of std.uni can't
  be used from @safe code!

- It would also be nice if these functions can be made pure (I'm not
  sure I understand why checking the category of a character should be
  impure). The lack of nothrow I can understand, since the input
  character may be illegal or otherwise malformed. But @nothrow pure
  seems to me to be necessary for all category-checking functions.


T

-- 
The volume of a pizza of thickness a and radius z can be described by the following formula: pi zz a. -- Wouter Verhelst

January 11, 2013

Re: Ready for review: new std.uni

Posted by David Nadlinger
in reply to Dmitry Olshansky

Permalink

David Nadlinger

Posted in reply to Dmitry Olshansky

Permalink

On Friday, 11 January 2013 at 19:31:13 UTC, Dmitry Olshansky wrote:
> The code, including extra tests and a benchmark is here:
> https://github.com/blackwhale/gsoc-bench-2012

Phew, that repository is quite a hefty checkout with all the example data.

Just for the fun of it, a completely irreproducible, unscientific benchmark (sorry, Andrei!) run on a Core i7 M 620 under Linux x86_64:

DMD 2.061
---
$ rdmd -O -release -noboundscheck -inline char_class.d enwiki-latest-all-titles-in-ns0
Creating level 1 of Tries.
Alpha:262144
Mark:262144
Number:262144
Symbol:262144
Creating level 2 of Tries.
Alpha:9472
Mark:6656
Number:6656
Symbol:6400
Creating level 3 of Tries.
Alpha:4000
Mark:2304
Number:2432
Symbol:2848
Creating level 4 of Tries.
Alpha:3376
Mark:1976
Number:1624
Symbol:2096
Time to build dchar array: 2702 ms

Baselines
            noop [enwiki],   217.335, 888.99 M/s
   new-std-alpha [enwiki],   1197.49, 161.34 M/s
    new-std-mark [enwiki],   6374.73, 30.31 M/s
     new-std-num [enwiki],   6290.89, 30.71 M/s
     new-std-sym [enwiki],   6500.93, 29.72 M/s
   new-std-white [enwiki],   822.245, 234.98 M/s
 combining class [enwiki],   5948.63, 32.48 M/s
       inv-alpha [enwiki],     10290, 18.78 M/s
        inv-mark [enwiki],    9229.2, 20.93 M/s
         inv-num [enwiki],   9046.48, 21.36 M/s
         inv-sym [enwiki],   9843.59, 19.63 M/s

Tries of level 1
      trie-alpha [enwiki],   2179.17, 88.66 M/s
       trie-mark [enwiki],   1912.54, 101.02 M/s
        trie-num [enwiki],   1924.92, 100.37 M/s
        trie-sym [enwiki],   1911.22, 101.09 M/s

Tries of level 2
      trie-alpha [enwiki],   4524.33, 42.70 M/s
       trie-mark [enwiki],   4276.92, 45.17 M/s
        trie-num [enwiki],   4308.42, 44.84 M/s
        trie-sym [enwiki],   4298.68, 44.95 M/s

Tries of level 3
      trie-alpha [enwiki],   6407.64, 30.15 M/s
       trie-mark [enwiki],    6150.8, 31.41 M/s
        trie-num [enwiki],   6095.96, 31.69 M/s
        trie-sym [enwiki],    6253.7, 30.89 M/s

Tries of level 4
      trie-alpha [enwiki],   8329.66, 23.20 M/s
       trie-mark [enwiki],   7993.11, 24.17 M/s
        trie-num [enwiki],   8040.94, 24.03 M/s
        trie-sym [enwiki],   8033.07, 24.05 M/s
---

LDC (2.061 merge branch):
---
$ rdmd --compiler=ldmd2 -O3 -release -noboundscheck -inline char_class.d enwiki-latest-all-titles-in-ns0
Alpha:262144
Mark:262144
Number:262144
Symbol:262144
Creating level 2 of Tries.
Alpha:9472
Mark:6656
Number:6656
Symbol:6400
Creating level 3 of Tries.
Alpha:4000
Mark:2304
Number:2432
Symbol:2848
Creating level 4 of Tries.
Alpha:3376
Mark:1976
Number:1624
Symbol:2096
Time to build dchar array: 1962 ms

Baselines
            noop [enwiki],   211.516, 913.44 M/s
   new-std-alpha [enwiki],   612.324, 315.53 M/s
    new-std-mark [enwiki],   577.134, 334.77 M/s
     new-std-num [enwiki],   600.996, 321.48 M/s
     new-std-sym [enwiki],   577.433, 334.60 M/s
   new-std-white [enwiki],   381.535, 506.40 M/s
 combining class [enwiki],   517.231, 373.54 M/s
       inv-alpha [enwiki],   8181.25, 23.62 M/s
        inv-mark [enwiki],   6938.09, 27.85 M/s
         inv-num [enwiki],   6347.71, 30.44 M/s
         inv-sym [enwiki],    6930.7, 27.88 M/s

Tries of level 1
      trie-alpha [enwiki],   275.741, 700.69 M/s
       trie-mark [enwiki],   274.972, 702.65 M/s
        trie-num [enwiki],   274.558, 703.70 M/s
        trie-sym [enwiki],   277.255, 696.86 M/s

Tries of level 2
      trie-alpha [enwiki],   453.453, 426.08 M/s
       trie-mark [enwiki],   452.912, 426.59 M/s
        trie-num [enwiki],   455.004, 424.63 M/s
        trie-sym [enwiki],   453.492, 426.04 M/s

Tries of level 3
      trie-alpha [enwiki],   577.442, 334.59 M/s
       trie-mark [enwiki],   579.054, 333.66 M/s
        trie-num [enwiki],    577.95, 334.30 M/s
        trie-sym [enwiki],   577.566, 334.52 M/s

Tries of level 4
      trie-alpha [enwiki],   778.261, 248.26 M/s
       trie-mark [enwiki],   778.943, 248.04 M/s
        trie-num [enwiki],   778.352, 248.23 M/s
        trie-sym [enwiki],   779.442, 247.88 M/s
---

Haven't had a look at what is going on there yet – it could well be that some measurement loop is optimized out altogether.

David

January 11, 2013

Re: Ready for review: new std.uni

Posted by H. S. Teoh

Permalink

H. S. Teoh

Permalink

On Fri, Jan 11, 2013 at 12:35:42PM -0800, H. S. Teoh wrote: [...]
> - It would also be nice if these functions can be made pure (I'm not
>   sure I understand why checking the category of a character should be
>   impure). The lack of nothrow I can understand, since the input
>   character may be illegal or otherwise malformed. But @nothrow pure
>   seems to me to be necessary for all category-checking functions.

Gah, I meant @safe pure.


T

-- 
Lawyer: (n.) An innocence-vending machine, the effectiveness of which depends on how much money is inserted.

January 11, 2013

Re: Ready for review: new std.uni

Posted by Walter Bright
in reply to Dmitry Olshansky

Permalink

Walter Bright

Posted in reply to Dmitry Olshansky

Permalink

On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
> Anyway it's polished and ready for the good old collective destruction called
> peer review. I'm looking for a review manager.

Thank you for doing this. Some notes:

struct InversionList:

   1. It implements length and empty, but otherwise does not seem to be a range, though its name implies it is a container.

   2. The description "InversionList is a packed interval-based data structure that represents a set of codepoints." completely baffles me.

   3. The table for opBinary appears twice.

   4. toSourceCode says "if the codepoint". If what codepoint? And what funcName is ""?

mapTrieIndex:

   I have no idea what this does from the description. Is it a map like in std.algorithms? I.e. does it work with ranges? Need an example.

TrieBuilder:

   1. No description. I'm completely baffled.

   2. putRange? What does that do?

General lack of examples.

Trie:

    Should have a wikipedia link to what a trie is.

buildTrie:

    Contains a number of functions that return auto, but no mention of what is returned. While I like auto, in these cases it is not helpful, because the user needs to know what type is returned.

Grapheme:

    Should provide an InputRange so a user can iterate over its codepoints. It already appears to provide part of the range interface.

sicmp and icmp:

    No reason one should have "in" and the other "inout". Both should be "const".

combiningClass

    What is "the canonicial combining class" ?

General note:

    Needs a section that briefly defines terms, with links to external documentation.

January 11, 2013

Re: Ready for review: new std.uni

Posted by Dmitry Olshansky
in reply to David Nadlinger

Permalink

Dmitry Olshansky

Posted in reply to David Nadlinger

Permalink

12-Jan-2013 00:38, David Nadlinger пишет:
> On Friday, 11 January 2013 at 19:31:13 UTC, Dmitry Olshansky wrote:
>> The code, including extra tests and a benchmark is here:
>> https://github.com/blackwhale/gsoc-bench-2012
>
> Phew, that repository is quite a hefty checkout with all the example data.
>
> Just for the fun of it, a completely irreproducible, unscientific
> benchmark (sorry, Andrei!) run on a Core i7 M 620 under Linux x86_64:
>
> DMD 2.061
> ---
> $ rdmd -O -release -noboundscheck -inline char_class.d
[snip]

I'd rather use dewiki or arwiki these give the more coverage otherwise you just check the presence of ASCII special casing (only isAlpha and isWhite have it).

> Time to build dchar array: 2702 ms
>
> Baselines
>              noop [enwiki],   217.335, 888.99 M/s
>     new-std-alpha [enwiki],   1197.49, 161.34 M/s
>      new-std-mark [enwiki],   6374.73, 30.31 M/s
>       new-std-num [enwiki],   6290.89, 30.71 M/s
>       new-std-sym [enwiki],   6500.93, 29.72 M/s
>     new-std-white [enwiki],   822.245, 234.98 M/s
>   combining class [enwiki],   5948.63, 32.48 M/s
>         inv-alpha [enwiki],     10290, 18.78 M/s
>          inv-mark [enwiki],    9229.2, 20.93 M/s
>           inv-num [enwiki],   9046.48, 21.36 M/s
>           inv-sym [enwiki],   9843.59, 19.63 M/s
>
> Tries of level 1
>        trie-alpha [enwiki],   2179.17, 88.66 M/s
>         trie-mark [enwiki],   1912.54, 101.02 M/s
>          trie-num [enwiki],   1924.92, 100.37 M/s
>          trie-sym [enwiki],   1911.22, 101.09 M/s
>
> Tries of level 2
>        trie-alpha [enwiki],   4524.33, 42.70 M/s
>         trie-mark [enwiki],   4276.92, 45.17 M/s
>          trie-num [enwiki],   4308.42, 44.84 M/s
>          trie-sym [enwiki],   4298.68, 44.95 M/s
>
> Tries of level 3
>        trie-alpha [enwiki],   6407.64, 30.15 M/s
>         trie-mark [enwiki],    6150.8, 31.41 M/s
>          trie-num [enwiki],   6095.96, 31.69 M/s
>          trie-sym [enwiki],    6253.7, 30.89 M/s
>
> Tries of level 4
>        trie-alpha [enwiki],   8329.66, 23.20 M/s
>         trie-mark [enwiki],   7993.11, 24.17 M/s
>          trie-num [enwiki],   8040.94, 24.03 M/s
>          trie-sym [enwiki],   8033.07, 24.05 M/s
> ---



> LDC (2.061 merge branch):
> ---
> $ rdmd --compiler=ldmd2 -O3 -release -noboundscheck -inline char_class.d
> enwiki-latest-all-titles-in-ns0

[snip]

> Baselines
>              noop [enwiki],   211.516, 913.44 M/s
>     new-std-alpha [enwiki],   612.324, 315.53 M/s
>      new-std-mark [enwiki],   577.134, 334.77 M/s
>       new-std-num [enwiki],   600.996, 321.48 M/s
>       new-std-sym [enwiki],   577.433, 334.60 M/s
>     new-std-white [enwiki],   381.535, 506.40 M/s
>   combining class [enwiki],   517.231, 373.54 M/s
>         inv-alpha [enwiki],   8181.25, 23.62 M/s
>          inv-mark [enwiki],   6938.09, 27.85 M/s
>           inv-num [enwiki],   6347.71, 30.44 M/s
>           inv-sym [enwiki],    6930.7, 27.88 M/s
>
> Tries of level 1
>        trie-alpha [enwiki],   275.741, 700.69 M/s
>         trie-mark [enwiki],   274.972, 702.65 M/s
>          trie-num [enwiki],   274.558, 703.70 M/s
>          trie-sym [enwiki],   277.255, 696.86 M/s
>
> Tries of level 2
>        trie-alpha [enwiki],   453.453, 426.08 M/s
>         trie-mark [enwiki],   452.912, 426.59 M/s
>          trie-num [enwiki],   455.004, 424.63 M/s
>          trie-sym [enwiki],   453.492, 426.04 M/s
>
> Tries of level 3
>        trie-alpha [enwiki],   577.442, 334.59 M/s
>         trie-mark [enwiki],   579.054, 333.66 M/s
>          trie-num [enwiki],    577.95, 334.30 M/s
>          trie-sym [enwiki],   577.566, 334.52 M/s
>
> Tries of level 4
>        trie-alpha [enwiki],   778.261, 248.26 M/s
>         trie-mark [enwiki],   778.943, 248.04 M/s
>          trie-num [enwiki],   778.352, 248.23 M/s
>          trie-sym [enwiki],   779.442, 247.88 M/s
> ---
>
> Haven't had a look at what is going on there yet – it could well be that
> some measurement loop is optimized out altogether.
>

You can print total counts after each bench, there is a TLS varaible written at the end of it. But anyway I like your numbers! :)

> David


-- 
Dmitry Olshansky

January 11, 2013

Re: Ready for review: new std.uni

Posted by Dmitry Olshansky
in reply to H. S. Teoh

Permalink

Dmitry Olshansky

Posted in reply to H. S. Teoh

Permalink

12-Jan-2013 00:35, H. S. Teoh пишет:
> On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
> [...]
>> Anyway it's polished and ready for the good old collective
>> destruction called peer review. I'm looking for a review manager.
>>
>> The code, including extra tests and a benchmark is here:
>> https://github.com/blackwhale/gsoc-bench-2012
>>
>> And documentation:
>> http://blackwhale.github.com/phobos/uni.html
>
> Excellent!! It looks pretty impressive. We should definitely try out
> best to get this into Phobos.
>
>
> [...]
>> In general there are 3 angles to the new std.uni:
>>
>> 1) The same stuff but better and faster. For one thing isXXX
>> classification functions are brought up to date with Unicode 6.2 and
>> are up to 5-6x times faster on non-ASCII text.
>
> And there is a tool for auto-updating the data, correct? So in theory,
> barring major structural changes in the standard, we should be able to
> recompile Phobos (maybe after running the update tool) and it should be
> automatically updated to the latest Unicode release?

Yup, except they tweak the algorithms all along. The normalization for instance is full of erratas and minor notes accumulated over time. I'll probably do a write up on how to implement it and where to find all the related info.

>
>
>> 2) The commonly expected stuff in any modern Unicode-aware language:
>> normalization, grapheme decoding, composition/decomposition and
>> case-insensitive comparison*.
>>
>> 3) I've taken it as a crucial point to provide all of the tools used
>> to build Unicode algorithms from ground up to the end user.
>
> Are there enough tools to implement line-breaking algorithms (TR14)?
>

Should be. AFAIK _word_ breaking seemed trivial to do, got to check the line breaking.

>
>> Thus all generally useful data structures used to implement the
>> library internals are accessible for 'mortals' too:
>>   - a type for manipulating sets of codepoints, with full set algebra
>>   - a construction for generating fast multi-stage lookup tables (Trie)
>>   - a ton of predefined sets to construct your own specific ones from
>
> I looked over the examples in the docs. Looks awesome!
>
> One question though: how do I check a character in a specific unicode
> category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the
> specific category, and I assume it's overkill to have one function per
> category, so I presume it should be possible to do this using the
> codepoint sets? Can you add an example of this to the docs, if this is
> possible?

auto psSet = unicode.Ps;
assert(psSet['(']); //should pass

>
>
> [...]
>> Among other things the entire collection of data required is
>> generated automatically by downloading from unicode.org. The tool
>> relies on the same foundation (3) and for the most part this version
>> of std.uni should be trivially updated to the new versions of the
>> standard (see gen_uni.d script).
>
> Excellent!
>
>
>> * The only missing 'big' thing is the collation algorithm. At this
>> point I'm proposing to just move the large chunk of new std.uni in
>> place. This way potential contributors would have tools to implement
>> missing bits later on.
>
> But the aforementioned tools should be enough for someone to use as the
> basis for implementing the collation algorithm right?
>
I sure hope so. I've been looking through all of TRs and found nothing surprising beyond more and more tricky codepoint sets and more and more wierd rules to use these.

>
>> P.S. CodepointSet could be easily adjusted to serve as generic
>> integer set type and Trie already supports far more the
>> codepoints->values mappings. These should probably be enhanced and
>> later adopted to std.container(2).
> [...]
>
> Yes, having an efficient integer set type and generic Trie type would be
> a huge plus to add to std.container. Although, isn't Andrei supposed to
> be working on a new std.container, with a better range-based API and
> custom allocator scheme? If that's not going to happen for a while yet,
> I say we should merge std.uni first, then worry about factoring Trie and
> int set later (we can always just alias them in std.uni later when we
> move them, that should prevent user code breakage).
>
> Now, some nitpicks:
>

Thanks for early feedback.

> - InversionList:
>
>     - I don't like using ~ to mean symmetric set difference, because ~
>       already means concatenation in D, and it's confusing to overload it
>       with an unrelated meaning. I propose using ^ instead, because
>       symmetric set difference is analogous to XOR.
>

Point taken, arguably '~' was a bad idea. But I don't like ^ either. Seems like it's the best candidate though.


>     - Why is the table of operators in opBinary's doc duplicated twice?
>       Is the second table supposed to be something else?
>

Beats me. Must be some glitch with BOOKTABLE macro (or how I use it). I think I've hit it before.

> - We should at least briefly describe what the various Unicode
>    normalization forms mean (yes I know they can be looked up on
>    unicode.org and various other online resources, but having them right
>    there in the docs makes the docs so much more helpful).
>

> - Why are the deprecated functions still in there? The deprecation
>    messages say "It will be removed in August 2012", which is already
>    past. So they should be removed by now.
>

Will do before Walter chimes in with his motto of keeping clutter just in case.

> - Is there a reason there's a composeJamo / hangulDecompose, but no
>    other writing system specific functions (such as functions to deal
>    with Thai consonant/vowel compositions)? Is this a peculiarity of the
>    Unicode standard? Or should there be more general functions that
>    handle composition/decomposition of compound characters?
>
Hangul are huge (11K+) and they have a specific algorithmic decomposition rule. Also they are hell of a special case in Unicode standard (that is mentioned in one chapter only!). Everything else is table-driven.

> - Why are many of the category checking functions @system rather than
>    @safe? It would seem rather crippling to me if much of std.uni can't
>    be used from @safe code!
>

@system? Well, these got to be either @trusted or @safe.

> - It would also be nice if these functions can be made pure (I'm not
>    sure I understand why checking the category of a character should be
>    impure).

Global sophisticated immutable table. I think this can be pure but the compiler might have disagreed.

>  The lack of nothrow I can understand, since the input
>    character may be illegal or otherwise malformed.

nothrow is probably doable as any codepoint will do. (and if it's > dchar.max then it's a problem somewhere else).

TBH I've killed these qualifiers early on as it prevented the thing to compile. I can't recall is there is still a problem with any of theses.

> But @nothrow pure
>    seems to me to be necessary for all category-checking functions.
>


-- 
Dmitry Olshansky

January 11, 2013

Re: Ready for review: new std.uni

Posted by Dmitry Olshansky
in reply to Walter Bright

Permalink

Dmitry Olshansky

Posted in reply to Walter Bright

Permalink

12-Jan-2013 00:50, Walter Bright пишет:
> On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
>> Anyway it's polished and ready for the good old collective destruction
>> called
>> peer review. I'm looking for a review manager.
>
> Thank you for doing this. Some notes:
>

Great bashing, thanks :)

>
> struct InversionList:
>
>     1. It implements length and empty, but otherwise does not seem to be
> a range, though its name implies it is a container.
>

Containers also have length and can be empty. Ranges on top of it are .byChar, .byInterval.

>     2. The description "InversionList is a packed interval-based data
> structure that represents a set of codepoints." completely baffles me.
>

OK, probably it was a bad idea to combine implementation detail and general description in one statement.

How about: InversionList is a set of codepoints, optimized for size.
It represents a set as an array of intervals using 6 bytes per interval of codepoints.

Better?

>     3. The table for opBinary appears twice.
>
>     4. toSourceCode says "if the codepoint". If what codepoint? And what
> funcName is ""?
>
> mapTrieIndex:
>
>     I have no idea what this does from the description. Is it a map like
> in std.algorithms? I.e. does it work with ranges? Need an example.
>
> TrieBuilder:
>
>     1. No description. I'm completely baffled.
>
>     2. putRange? What does that do?
>
> General lack of examples.
>
> Trie:
>
>      Should have a wikipedia link to what a trie is.
>
> buildTrie:
>
>      Contains a number of functions that return auto, but no mention of
> what is returned. While I like auto, in these cases it is not helpful,
> because the user needs to know what type is returned.
>

Trust me you won't like it when you see it :) Part of the reason it's hidden. But you are correct that documentation on these artifacts is *ehm* ... sketchy. Will fix/update.

> Grapheme:
>
>      Should provide an InputRange so a user can iterate over its
> codepoints. It already appears to provide part of the range interface.

Container != range. Just slice it and RA range is in your hands. I need to put this info somewhere prominent I guess.

BTW foreach(v; Grapheme("abc")) { ... }
should work because of automatic slicing.

>
> sicmp and icmp:
>
>      No reason one should have "in" and the other "inout". Both should
> be "const".
>
> combiningClass
>
>      What is "the canonicial combining class" ?
>
> General note:
>
>      Needs a section that briefly defines terms, with links to external
> documentation.

All to the point, thanks.

-- 
Dmitry Olshansky

January 11, 2013

Re: Ready for review: new std.uni

Posted by H. S. Teoh
in reply to Dmitry Olshansky

Permalink

H. S. Teoh

Posted in reply to Dmitry Olshansky

Permalink

On Sat, Jan 12, 2013 at 01:06:30AM +0400, Dmitry Olshansky wrote:
> 12-Jan-2013 00:35, H. S. Teoh пишет:
> >On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote: [...]
> >>Anyway it's polished and ready for the good old collective destruction called peer review. I'm looking for a review manager.
> >>
> >>The code, including extra tests and a benchmark is here: https://github.com/blackwhale/gsoc-bench-2012
> >>
> >>And documentation: http://blackwhale.github.com/phobos/uni.html
[...]
> >>2) The commonly expected stuff in any modern Unicode-aware language: normalization, grapheme decoding, composition/decomposition and case-insensitive comparison*.
> >>
> >>3) I've taken it as a crucial point to provide all of the tools used to build Unicode algorithms from ground up to the end user.
> >
> >Are there enough tools to implement line-breaking algorithms (TR14)?
> >
> 
> Should be. AFAIK _word_ breaking seemed trivial to do, got to check the line breaking.

Line-breaking is in TR14. I looked over it. It's definitely more complicated than I expected it to be. Some alphabets may even require stylistic rules for line-breaking! (Not to mention hyphenation.) So it's probably too much to expect std.uni to do it. But at least the tools necessary to implement it should be there.

[...]
> >One question though: how do I check a character in a specific unicode category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the specific category, and I assume it's overkill to have one function per category, so I presume it should be possible to do this using the codepoint sets? Can you add an example of this to the docs, if this is possible?
> 
> auto psSet = unicode.Ps;
> assert(psSet['(']); //should pass

It would be nice to have a list of possible categories in the docs for unicode.opDispatch/opCall, at least, the common categories, with a hyperlink to unicode.org (or wherever) that provides the full list. Currently, it's unclear, beyond the single example given, of what the possibilities are.

[...]
> >- We should at least briefly describe what the various Unicode
> >   normalization forms mean (yes I know they can be looked up on
> >   unicode.org and various other online resources, but having them right
> >   there in the docs makes the docs so much more helpful).
> >
> 
> >- Why are the deprecated functions still in there? The deprecation
> >   messages say "It will be removed in August 2012", which is already
> >   past. So they should be removed by now.
> >
> 
> Will do before Walter chimes in with his motto of keeping clutter just in case.

In that case, I would keep the functions (aliases?) but remove them from DDoc. So old code will still silently work, but the functions will no longer be officially supported, there will be no documentation for them and they can disappear anytime.

> >- Is there a reason there's a composeJamo / hangulDecompose, but no
> >   other writing system specific functions (such as functions to deal
> >   with Thai consonant/vowel compositions)? Is this a peculiarity of
> >   the Unicode standard? Or should there be more general functions
> >   that handle composition/decomposition of compound characters?
> >
> Hangul are huge (11K+) and they have a specific algorithmic decomposition rule. Also they are hell of a special case in Unicode standard (that is mentioned in one chapter only!). Everything else is table-driven.

OK. Makes sense, I guess. Though it gave me the impression that std.uni has some undue bias for Korean text. ;-)

[...]
> >- It would also be nice if these functions can be made pure (I'm not
> >   sure I understand why checking the category of a character should
> >   be impure).
> 
> Global sophisticated immutable table. I think this can be pure but the compiler might have disagreed.

Hmm. This sounds like a bug in purity, maybe? It seems clear to me that immutable values might as well be constants, and therefore using said values should not make the code impure, even if they are implemented as a global table. I think a bug should be filed for this.

> > The lack of nothrow I can understand, since the input
> >   character may be illegal or otherwise malformed.
> 
> nothrow is probably doable as any codepoint will do. (and if it's > dchar.max then it's a problem somewhere else).

True, an illegal character by definition does not belong to any category, so all category functions should return false for it. No exceptions need to be thrown.

> TBH I've killed these qualifiers early on as it prevented the thing to compile. I can't recall is there is still a problem with any of theses.
[...]

Before merging into Phobos, I would think we should put (as many of) the
qualifiers back (as possible). Preferably everwhere they make sense,
barring compiler bugs.

T

-- 
Государство делает вид, что платит нам зарплату, а мы делаем вид, что работаем.

January 11, 2013

Re: Ready for review: new std.uni

Posted by Walter Bright
in reply to Dmitry Olshansky

Permalink

Walter Bright

Posted in reply to Dmitry Olshansky

Permalink

On 1/11/2013 1:21 PM, Dmitry Olshansky wrote:
> 12-Jan-2013 00:50, Walter Bright пишет:
>> On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
>>> Anyway it's polished and ready for the good old collective destruction
>>> called
>>> peer review. I'm looking for a review manager.
>>
>> Thank you for doing this. Some notes:
>>
>
> Great bashing, thanks :)

Anytime!


>> struct InversionList:
>>
>>     1. It implements length and empty, but otherwise does not seem to be
>> a range, though its name implies it is a container.
>>
>
> Containers also have length and can be empty. Ranges on top of it are .byChar,
> .byInterval.

Some more text about accessing it as a range would be helpful. Being a set of "codepoints" yet accessing it "by char" makes no sense.

In unicode, the terms "char", "codepoint", "octet", "grapheme", etc., are bandied about and are often conflated and confused with each other. I suggest a brief section defining those terms, and then scrupulously, consistently, anally, and nitpickingly adhering to them in the documentation and names of things.


>>     2. The description "InversionList is a packed interval-based data
>> structure that represents a set of codepoints." completely baffles me.
>>
>
> OK, probably it was a bad idea to combine implementation detail and general
> description in one statement.
>
> How about: InversionList is a set of codepoints, optimized for size.
> It represents a set as an array of intervals using 6 bytes per interval of
> codepoints.
>
> Better?

The name says "list" and yet it is described as a "set". Being a "list" makes an implication that it is a linked list. BTW, what is an "interval"?


>>     3. The table for opBinary appears twice.
>>
>>     4. toSourceCode says "if the codepoint". If what codepoint? And what
>> funcName is ""?
>>
>> mapTrieIndex:
>>
>>     I have no idea what this does from the description. Is it a map like
>> in std.algorithms? I.e. does it work with ranges? Need an example.
>>
>> TrieBuilder:
>>
>>     1. No description. I'm completely baffled.
>>
>>     2. putRange? What does that do?
>>
>> General lack of examples.
>>
>> Trie:
>>
>>      Should have a wikipedia link to what a trie is.
>>
>> buildTrie:
>>
>>      Contains a number of functions that return auto, but no mention of
>> what is returned. While I like auto, in these cases it is not helpful,
>> because the user needs to know what type is returned.
>>
>
> Trust me you won't like it when you see it :) Part of the reason it's hidden.
> But you are correct that documentation on these artifacts is *ehm* ... sketchy.
> Will fix/update.

If the compiler type is hard to grok, ok, but at least the documentation should make it clear what is returned.


>> Grapheme:
>>
>>      Should provide an InputRange so a user can iterate over its
>> codepoints. It already appears to provide part of the range interface.
>
> Container != range. Just slice it and RA range is in your hands. I need to put
> this info somewhere prominent I guess.
>
> BTW foreach(v; Grapheme("abc")) { ... }
> should work because of automatic slicing.

Yes, but is the resulting v an octet, char, codepoint ????


> All to the point, thanks.

Welcs!

Top | Forum index | About this forum

Forums