Thread overview | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
January 11, 2013 Ready for review: new std.uni | ||||
---|---|---|---|---|
| ||||
It's been long over due to present the work I did during the last GSOC as it was a summer not winter of code after all. Unfortunately some compiler bugs, a new job :) and unrelated events of importance have postponed its release beyond measure. Anyway it's polished and ready for the good old collective destruction called peer review. I'm looking for a review manager. The code, including extra tests and a benchmark is here: https://github.com/blackwhale/gsoc-bench-2012 And documentation: http://blackwhale.github.com/phobos/uni.html And with that I'm asking if there is anything ready to be reviewed in the queue? Currently I prefer to keep it standalone so read 'uni' everywhere as 'std.uni'. On the plus side it makes it easy to try new 'uni' and compare with the old one without re-compiling the Phobos. To use in place of std.uni replace 'std.uni'->'uni' in your programs and compare the results. Just make sure both uni and unicode_tables modules are linked in, usually rdmd can take care of this dependency. The list of new functionality is quite large so I'll point out the major sections and let the rest to the documentation. In general there are 3 angles to the new std.uni: 1) The same stuff but better and faster. For one thing isXXX classification functions are brought up to date with Unicode 6.2 and are up to 5-6x times faster on non-ASCII text. 2) The commonly expected stuff in any modern Unicode-aware language: normalization, grapheme decoding, composition/decomposition and case-insensitive comparison*. 3) I've taken it as a crucial point to provide all of the tools used to build Unicode algorithms from ground up to the end user. Thus all generally useful data structures used to implement the library internals are accessible for 'mortals' too: - a type for manipulating sets of codepoints, with full set algebra - a construction for generating fast multi-stage lookup tables (Trie) - a ton of predefined sets to construct your own specific ones from There is an extra candy for meta-programming text processing libraries: a set type can generate source code of unrolled hard-coded binary search for a given set of codepoints. Among other things the entire collection of data required is generated automatically by downloading from unicode.org. The tool relies on the same foundation (3) and for the most part this version of std.uni should be trivially updated to the new versions of the standard (see gen_uni.d script). * The only missing 'big' thing is the collation algorithm. At this point I'm proposing to just move the large chunk of new std.uni in place. This way potential contributors would have tools to implement missing bits later on. P.S. CodepointSet could be easily adjusted to serve as generic integer set type and Trie already supports far more the codepoints->values mappings. These should probably be enhanced and later adopted to std.container(2). -- Dmitry Olshansky |
January 11, 2013 Re: Ready for review: new std.uni | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote: [...] > Anyway it's polished and ready for the good old collective destruction called peer review. I'm looking for a review manager. > > The code, including extra tests and a benchmark is here: https://github.com/blackwhale/gsoc-bench-2012 > > And documentation: http://blackwhale.github.com/phobos/uni.html Excellent!! It looks pretty impressive. We should definitely try out best to get this into Phobos. [...] > In general there are 3 angles to the new std.uni: > > 1) The same stuff but better and faster. For one thing isXXX classification functions are brought up to date with Unicode 6.2 and are up to 5-6x times faster on non-ASCII text. And there is a tool for auto-updating the data, correct? So in theory, barring major structural changes in the standard, we should be able to recompile Phobos (maybe after running the update tool) and it should be automatically updated to the latest Unicode release? > 2) The commonly expected stuff in any modern Unicode-aware language: normalization, grapheme decoding, composition/decomposition and case-insensitive comparison*. > > 3) I've taken it as a crucial point to provide all of the tools used to build Unicode algorithms from ground up to the end user. Are there enough tools to implement line-breaking algorithms (TR14)? > Thus all generally useful data structures used to implement the > library internals are accessible for 'mortals' too: > - a type for manipulating sets of codepoints, with full set algebra > - a construction for generating fast multi-stage lookup tables (Trie) > - a ton of predefined sets to construct your own specific ones from I looked over the examples in the docs. Looks awesome! One question though: how do I check a character in a specific unicode category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the specific category, and I assume it's overkill to have one function per category, so I presume it should be possible to do this using the codepoint sets? Can you add an example of this to the docs, if this is possible? [...] > Among other things the entire collection of data required is generated automatically by downloading from unicode.org. The tool relies on the same foundation (3) and for the most part this version of std.uni should be trivially updated to the new versions of the standard (see gen_uni.d script). Excellent! > * The only missing 'big' thing is the collation algorithm. At this point I'm proposing to just move the large chunk of new std.uni in place. This way potential contributors would have tools to implement missing bits later on. But the aforementioned tools should be enough for someone to use as the basis for implementing the collation algorithm right? > P.S. CodepointSet could be easily adjusted to serve as generic integer set type and Trie already supports far more the codepoints->values mappings. These should probably be enhanced and later adopted to std.container(2). [...] Yes, having an efficient integer set type and generic Trie type would be a huge plus to add to std.container. Although, isn't Andrei supposed to be working on a new std.container, with a better range-based API and custom allocator scheme? If that's not going to happen for a while yet, I say we should merge std.uni first, then worry about factoring Trie and int set later (we can always just alias them in std.uni later when we move them, that should prevent user code breakage). Now, some nitpicks: - InversionList: - I don't like using ~ to mean symmetric set difference, because ~ already means concatenation in D, and it's confusing to overload it with an unrelated meaning. I propose using ^ instead, because symmetric set difference is analogous to XOR. - Why is the table of operators in opBinary's doc duplicated twice? Is the second table supposed to be something else? - We should at least briefly describe what the various Unicode normalization forms mean (yes I know they can be looked up on unicode.org and various other online resources, but having them right there in the docs makes the docs so much more helpful). - Why are the deprecated functions still in there? The deprecation messages say "It will be removed in August 2012", which is already past. So they should be removed by now. - Is there a reason there's a composeJamo / hangulDecompose, but no other writing system specific functions (such as functions to deal with Thai consonant/vowel compositions)? Is this a peculiarity of the Unicode standard? Or should there be more general functions that handle composition/decomposition of compound characters? - Why are many of the category checking functions @system rather than @safe? It would seem rather crippling to me if much of std.uni can't be used from @safe code! - It would also be nice if these functions can be made pure (I'm not sure I understand why checking the category of a character should be impure). The lack of nothrow I can understand, since the input character may be illegal or otherwise malformed. But @nothrow pure seems to me to be necessary for all category-checking functions. T -- The volume of a pizza of thickness a and radius z can be described by the following formula: pi zz a. -- Wouter Verhelst |
January 11, 2013 Re: Ready for review: new std.uni | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On Friday, 11 January 2013 at 19:31:13 UTC, Dmitry Olshansky wrote: > The code, including extra tests and a benchmark is here: > https://github.com/blackwhale/gsoc-bench-2012 Phew, that repository is quite a hefty checkout with all the example data. Just for the fun of it, a completely irreproducible, unscientific benchmark (sorry, Andrei!) run on a Core i7 M 620 under Linux x86_64: DMD 2.061 --- $ rdmd -O -release -noboundscheck -inline char_class.d enwiki-latest-all-titles-in-ns0 Creating level 1 of Tries. Alpha:262144 Mark:262144 Number:262144 Symbol:262144 Creating level 2 of Tries. Alpha:9472 Mark:6656 Number:6656 Symbol:6400 Creating level 3 of Tries. Alpha:4000 Mark:2304 Number:2432 Symbol:2848 Creating level 4 of Tries. Alpha:3376 Mark:1976 Number:1624 Symbol:2096 Time to build dchar array: 2702 ms Baselines noop [enwiki], 217.335, 888.99 M/s new-std-alpha [enwiki], 1197.49, 161.34 M/s new-std-mark [enwiki], 6374.73, 30.31 M/s new-std-num [enwiki], 6290.89, 30.71 M/s new-std-sym [enwiki], 6500.93, 29.72 M/s new-std-white [enwiki], 822.245, 234.98 M/s combining class [enwiki], 5948.63, 32.48 M/s inv-alpha [enwiki], 10290, 18.78 M/s inv-mark [enwiki], 9229.2, 20.93 M/s inv-num [enwiki], 9046.48, 21.36 M/s inv-sym [enwiki], 9843.59, 19.63 M/s Tries of level 1 trie-alpha [enwiki], 2179.17, 88.66 M/s trie-mark [enwiki], 1912.54, 101.02 M/s trie-num [enwiki], 1924.92, 100.37 M/s trie-sym [enwiki], 1911.22, 101.09 M/s Tries of level 2 trie-alpha [enwiki], 4524.33, 42.70 M/s trie-mark [enwiki], 4276.92, 45.17 M/s trie-num [enwiki], 4308.42, 44.84 M/s trie-sym [enwiki], 4298.68, 44.95 M/s Tries of level 3 trie-alpha [enwiki], 6407.64, 30.15 M/s trie-mark [enwiki], 6150.8, 31.41 M/s trie-num [enwiki], 6095.96, 31.69 M/s trie-sym [enwiki], 6253.7, 30.89 M/s Tries of level 4 trie-alpha [enwiki], 8329.66, 23.20 M/s trie-mark [enwiki], 7993.11, 24.17 M/s trie-num [enwiki], 8040.94, 24.03 M/s trie-sym [enwiki], 8033.07, 24.05 M/s --- LDC (2.061 merge branch): --- $ rdmd --compiler=ldmd2 -O3 -release -noboundscheck -inline char_class.d enwiki-latest-all-titles-in-ns0 Alpha:262144 Mark:262144 Number:262144 Symbol:262144 Creating level 2 of Tries. Alpha:9472 Mark:6656 Number:6656 Symbol:6400 Creating level 3 of Tries. Alpha:4000 Mark:2304 Number:2432 Symbol:2848 Creating level 4 of Tries. Alpha:3376 Mark:1976 Number:1624 Symbol:2096 Time to build dchar array: 1962 ms Baselines noop [enwiki], 211.516, 913.44 M/s new-std-alpha [enwiki], 612.324, 315.53 M/s new-std-mark [enwiki], 577.134, 334.77 M/s new-std-num [enwiki], 600.996, 321.48 M/s new-std-sym [enwiki], 577.433, 334.60 M/s new-std-white [enwiki], 381.535, 506.40 M/s combining class [enwiki], 517.231, 373.54 M/s inv-alpha [enwiki], 8181.25, 23.62 M/s inv-mark [enwiki], 6938.09, 27.85 M/s inv-num [enwiki], 6347.71, 30.44 M/s inv-sym [enwiki], 6930.7, 27.88 M/s Tries of level 1 trie-alpha [enwiki], 275.741, 700.69 M/s trie-mark [enwiki], 274.972, 702.65 M/s trie-num [enwiki], 274.558, 703.70 M/s trie-sym [enwiki], 277.255, 696.86 M/s Tries of level 2 trie-alpha [enwiki], 453.453, 426.08 M/s trie-mark [enwiki], 452.912, 426.59 M/s trie-num [enwiki], 455.004, 424.63 M/s trie-sym [enwiki], 453.492, 426.04 M/s Tries of level 3 trie-alpha [enwiki], 577.442, 334.59 M/s trie-mark [enwiki], 579.054, 333.66 M/s trie-num [enwiki], 577.95, 334.30 M/s trie-sym [enwiki], 577.566, 334.52 M/s Tries of level 4 trie-alpha [enwiki], 778.261, 248.26 M/s trie-mark [enwiki], 778.943, 248.04 M/s trie-num [enwiki], 778.352, 248.23 M/s trie-sym [enwiki], 779.442, 247.88 M/s --- Haven't had a look at what is going on there yet – it could well be that some measurement loop is optimized out altogether. David |
January 11, 2013 Re: Ready for review: new std.uni | ||||
---|---|---|---|---|
| ||||
On Fri, Jan 11, 2013 at 12:35:42PM -0800, H. S. Teoh wrote: [...] > - It would also be nice if these functions can be made pure (I'm not > sure I understand why checking the category of a character should be > impure). The lack of nothrow I can understand, since the input > character may be illegal or otherwise malformed. But @nothrow pure > seems to me to be necessary for all category-checking functions. Gah, I meant @safe pure. T -- Lawyer: (n.) An innocence-vending machine, the effectiveness of which depends on how much money is inserted. |
January 11, 2013 Re: Ready for review: new std.uni | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On 1/11/2013 11:31 AM, Dmitry Olshansky wrote:
> Anyway it's polished and ready for the good old collective destruction called
> peer review. I'm looking for a review manager.
Thank you for doing this. Some notes:
struct InversionList:
1. It implements length and empty, but otherwise does not seem to be a range, though its name implies it is a container.
2. The description "InversionList is a packed interval-based data structure that represents a set of codepoints." completely baffles me.
3. The table for opBinary appears twice.
4. toSourceCode says "if the codepoint". If what codepoint? And what funcName is ""?
mapTrieIndex:
I have no idea what this does from the description. Is it a map like in std.algorithms? I.e. does it work with ranges? Need an example.
TrieBuilder:
1. No description. I'm completely baffled.
2. putRange? What does that do?
General lack of examples.
Trie:
Should have a wikipedia link to what a trie is.
buildTrie:
Contains a number of functions that return auto, but no mention of what is returned. While I like auto, in these cases it is not helpful, because the user needs to know what type is returned.
Grapheme:
Should provide an InputRange so a user can iterate over its codepoints. It already appears to provide part of the range interface.
sicmp and icmp:
No reason one should have "in" and the other "inout". Both should be "const".
combiningClass
What is "the canonicial combining class" ?
General note:
Needs a section that briefly defines terms, with links to external documentation.
|
January 11, 2013 Re: Ready for review: new std.uni | ||||
---|---|---|---|---|
| ||||
Posted in reply to David Nadlinger | 12-Jan-2013 00:38, David Nadlinger пишет: > On Friday, 11 January 2013 at 19:31:13 UTC, Dmitry Olshansky wrote: >> The code, including extra tests and a benchmark is here: >> https://github.com/blackwhale/gsoc-bench-2012 > > Phew, that repository is quite a hefty checkout with all the example data. > > Just for the fun of it, a completely irreproducible, unscientific > benchmark (sorry, Andrei!) run on a Core i7 M 620 under Linux x86_64: > > DMD 2.061 > --- > $ rdmd -O -release -noboundscheck -inline char_class.d [snip] I'd rather use dewiki or arwiki these give the more coverage otherwise you just check the presence of ASCII special casing (only isAlpha and isWhite have it). > Time to build dchar array: 2702 ms > > Baselines > noop [enwiki], 217.335, 888.99 M/s > new-std-alpha [enwiki], 1197.49, 161.34 M/s > new-std-mark [enwiki], 6374.73, 30.31 M/s > new-std-num [enwiki], 6290.89, 30.71 M/s > new-std-sym [enwiki], 6500.93, 29.72 M/s > new-std-white [enwiki], 822.245, 234.98 M/s > combining class [enwiki], 5948.63, 32.48 M/s > inv-alpha [enwiki], 10290, 18.78 M/s > inv-mark [enwiki], 9229.2, 20.93 M/s > inv-num [enwiki], 9046.48, 21.36 M/s > inv-sym [enwiki], 9843.59, 19.63 M/s > > Tries of level 1 > trie-alpha [enwiki], 2179.17, 88.66 M/s > trie-mark [enwiki], 1912.54, 101.02 M/s > trie-num [enwiki], 1924.92, 100.37 M/s > trie-sym [enwiki], 1911.22, 101.09 M/s > > Tries of level 2 > trie-alpha [enwiki], 4524.33, 42.70 M/s > trie-mark [enwiki], 4276.92, 45.17 M/s > trie-num [enwiki], 4308.42, 44.84 M/s > trie-sym [enwiki], 4298.68, 44.95 M/s > > Tries of level 3 > trie-alpha [enwiki], 6407.64, 30.15 M/s > trie-mark [enwiki], 6150.8, 31.41 M/s > trie-num [enwiki], 6095.96, 31.69 M/s > trie-sym [enwiki], 6253.7, 30.89 M/s > > Tries of level 4 > trie-alpha [enwiki], 8329.66, 23.20 M/s > trie-mark [enwiki], 7993.11, 24.17 M/s > trie-num [enwiki], 8040.94, 24.03 M/s > trie-sym [enwiki], 8033.07, 24.05 M/s > --- > LDC (2.061 merge branch): > --- > $ rdmd --compiler=ldmd2 -O3 -release -noboundscheck -inline char_class.d > enwiki-latest-all-titles-in-ns0 [snip] > Baselines > noop [enwiki], 211.516, 913.44 M/s > new-std-alpha [enwiki], 612.324, 315.53 M/s > new-std-mark [enwiki], 577.134, 334.77 M/s > new-std-num [enwiki], 600.996, 321.48 M/s > new-std-sym [enwiki], 577.433, 334.60 M/s > new-std-white [enwiki], 381.535, 506.40 M/s > combining class [enwiki], 517.231, 373.54 M/s > inv-alpha [enwiki], 8181.25, 23.62 M/s > inv-mark [enwiki], 6938.09, 27.85 M/s > inv-num [enwiki], 6347.71, 30.44 M/s > inv-sym [enwiki], 6930.7, 27.88 M/s > > Tries of level 1 > trie-alpha [enwiki], 275.741, 700.69 M/s > trie-mark [enwiki], 274.972, 702.65 M/s > trie-num [enwiki], 274.558, 703.70 M/s > trie-sym [enwiki], 277.255, 696.86 M/s > > Tries of level 2 > trie-alpha [enwiki], 453.453, 426.08 M/s > trie-mark [enwiki], 452.912, 426.59 M/s > trie-num [enwiki], 455.004, 424.63 M/s > trie-sym [enwiki], 453.492, 426.04 M/s > > Tries of level 3 > trie-alpha [enwiki], 577.442, 334.59 M/s > trie-mark [enwiki], 579.054, 333.66 M/s > trie-num [enwiki], 577.95, 334.30 M/s > trie-sym [enwiki], 577.566, 334.52 M/s > > Tries of level 4 > trie-alpha [enwiki], 778.261, 248.26 M/s > trie-mark [enwiki], 778.943, 248.04 M/s > trie-num [enwiki], 778.352, 248.23 M/s > trie-sym [enwiki], 779.442, 247.88 M/s > --- > > Haven't had a look at what is going on there yet – it could well be that > some measurement loop is optimized out altogether. > You can print total counts after each bench, there is a TLS varaible written at the end of it. But anyway I like your numbers! :) > David -- Dmitry Olshansky |
January 11, 2013 Re: Ready for review: new std.uni | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | 12-Jan-2013 00:35, H. S. Teoh пишет: > On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote: > [...] >> Anyway it's polished and ready for the good old collective >> destruction called peer review. I'm looking for a review manager. >> >> The code, including extra tests and a benchmark is here: >> https://github.com/blackwhale/gsoc-bench-2012 >> >> And documentation: >> http://blackwhale.github.com/phobos/uni.html > > Excellent!! It looks pretty impressive. We should definitely try out > best to get this into Phobos. > > > [...] >> In general there are 3 angles to the new std.uni: >> >> 1) The same stuff but better and faster. For one thing isXXX >> classification functions are brought up to date with Unicode 6.2 and >> are up to 5-6x times faster on non-ASCII text. > > And there is a tool for auto-updating the data, correct? So in theory, > barring major structural changes in the standard, we should be able to > recompile Phobos (maybe after running the update tool) and it should be > automatically updated to the latest Unicode release? Yup, except they tweak the algorithms all along. The normalization for instance is full of erratas and minor notes accumulated over time. I'll probably do a write up on how to implement it and where to find all the related info. > > >> 2) The commonly expected stuff in any modern Unicode-aware language: >> normalization, grapheme decoding, composition/decomposition and >> case-insensitive comparison*. >> >> 3) I've taken it as a crucial point to provide all of the tools used >> to build Unicode algorithms from ground up to the end user. > > Are there enough tools to implement line-breaking algorithms (TR14)? > Should be. AFAIK _word_ breaking seemed trivial to do, got to check the line breaking. > >> Thus all generally useful data structures used to implement the >> library internals are accessible for 'mortals' too: >> - a type for manipulating sets of codepoints, with full set algebra >> - a construction for generating fast multi-stage lookup tables (Trie) >> - a ton of predefined sets to construct your own specific ones from > > I looked over the examples in the docs. Looks awesome! > > One question though: how do I check a character in a specific unicode > category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the > specific category, and I assume it's overkill to have one function per > category, so I presume it should be possible to do this using the > codepoint sets? Can you add an example of this to the docs, if this is > possible? auto psSet = unicode.Ps; assert(psSet['(']); //should pass > > > [...] >> Among other things the entire collection of data required is >> generated automatically by downloading from unicode.org. The tool >> relies on the same foundation (3) and for the most part this version >> of std.uni should be trivially updated to the new versions of the >> standard (see gen_uni.d script). > > Excellent! > > >> * The only missing 'big' thing is the collation algorithm. At this >> point I'm proposing to just move the large chunk of new std.uni in >> place. This way potential contributors would have tools to implement >> missing bits later on. > > But the aforementioned tools should be enough for someone to use as the > basis for implementing the collation algorithm right? > I sure hope so. I've been looking through all of TRs and found nothing surprising beyond more and more tricky codepoint sets and more and more wierd rules to use these. > >> P.S. CodepointSet could be easily adjusted to serve as generic >> integer set type and Trie already supports far more the >> codepoints->values mappings. These should probably be enhanced and >> later adopted to std.container(2). > [...] > > Yes, having an efficient integer set type and generic Trie type would be > a huge plus to add to std.container. Although, isn't Andrei supposed to > be working on a new std.container, with a better range-based API and > custom allocator scheme? If that's not going to happen for a while yet, > I say we should merge std.uni first, then worry about factoring Trie and > int set later (we can always just alias them in std.uni later when we > move them, that should prevent user code breakage). > > Now, some nitpicks: > Thanks for early feedback. > - InversionList: > > - I don't like using ~ to mean symmetric set difference, because ~ > already means concatenation in D, and it's confusing to overload it > with an unrelated meaning. I propose using ^ instead, because > symmetric set difference is analogous to XOR. > Point taken, arguably '~' was a bad idea. But I don't like ^ either. Seems like it's the best candidate though. > - Why is the table of operators in opBinary's doc duplicated twice? > Is the second table supposed to be something else? > Beats me. Must be some glitch with BOOKTABLE macro (or how I use it). I think I've hit it before. > - We should at least briefly describe what the various Unicode > normalization forms mean (yes I know they can be looked up on > unicode.org and various other online resources, but having them right > there in the docs makes the docs so much more helpful). > > - Why are the deprecated functions still in there? The deprecation > messages say "It will be removed in August 2012", which is already > past. So they should be removed by now. > Will do before Walter chimes in with his motto of keeping clutter just in case. > - Is there a reason there's a composeJamo / hangulDecompose, but no > other writing system specific functions (such as functions to deal > with Thai consonant/vowel compositions)? Is this a peculiarity of the > Unicode standard? Or should there be more general functions that > handle composition/decomposition of compound characters? > Hangul are huge (11K+) and they have a specific algorithmic decomposition rule. Also they are hell of a special case in Unicode standard (that is mentioned in one chapter only!). Everything else is table-driven. > - Why are many of the category checking functions @system rather than > @safe? It would seem rather crippling to me if much of std.uni can't > be used from @safe code! > @system? Well, these got to be either @trusted or @safe. > - It would also be nice if these functions can be made pure (I'm not > sure I understand why checking the category of a character should be > impure). Global sophisticated immutable table. I think this can be pure but the compiler might have disagreed. > The lack of nothrow I can understand, since the input > character may be illegal or otherwise malformed. nothrow is probably doable as any codepoint will do. (and if it's > dchar.max then it's a problem somewhere else). TBH I've killed these qualifiers early on as it prevented the thing to compile. I can't recall is there is still a problem with any of theses. > But @nothrow pure > seems to me to be necessary for all category-checking functions. > -- Dmitry Olshansky |
January 11, 2013 Re: Ready for review: new std.uni | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | 12-Jan-2013 00:50, Walter Bright пишет: > On 1/11/2013 11:31 AM, Dmitry Olshansky wrote: >> Anyway it's polished and ready for the good old collective destruction >> called >> peer review. I'm looking for a review manager. > > Thank you for doing this. Some notes: > Great bashing, thanks :) > > struct InversionList: > > 1. It implements length and empty, but otherwise does not seem to be > a range, though its name implies it is a container. > Containers also have length and can be empty. Ranges on top of it are .byChar, .byInterval. > 2. The description "InversionList is a packed interval-based data > structure that represents a set of codepoints." completely baffles me. > OK, probably it was a bad idea to combine implementation detail and general description in one statement. How about: InversionList is a set of codepoints, optimized for size. It represents a set as an array of intervals using 6 bytes per interval of codepoints. Better? > 3. The table for opBinary appears twice. > > 4. toSourceCode says "if the codepoint". If what codepoint? And what > funcName is ""? > > mapTrieIndex: > > I have no idea what this does from the description. Is it a map like > in std.algorithms? I.e. does it work with ranges? Need an example. > > TrieBuilder: > > 1. No description. I'm completely baffled. > > 2. putRange? What does that do? > > General lack of examples. > > Trie: > > Should have a wikipedia link to what a trie is. > > buildTrie: > > Contains a number of functions that return auto, but no mention of > what is returned. While I like auto, in these cases it is not helpful, > because the user needs to know what type is returned. > Trust me you won't like it when you see it :) Part of the reason it's hidden. But you are correct that documentation on these artifacts is *ehm* ... sketchy. Will fix/update. > Grapheme: > > Should provide an InputRange so a user can iterate over its > codepoints. It already appears to provide part of the range interface. Container != range. Just slice it and RA range is in your hands. I need to put this info somewhere prominent I guess. BTW foreach(v; Grapheme("abc")) { ... } should work because of automatic slicing. > > sicmp and icmp: > > No reason one should have "in" and the other "inout". Both should > be "const". > > combiningClass > > What is "the canonicial combining class" ? > > General note: > > Needs a section that briefly defines terms, with links to external > documentation. All to the point, thanks. -- Dmitry Olshansky |
January 11, 2013 Re: Ready for review: new std.uni | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On Sat, Jan 12, 2013 at 01:06:30AM +0400, Dmitry Olshansky wrote: > 12-Jan-2013 00:35, H. S. Teoh пишет: > >On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote: [...] > >>Anyway it's polished and ready for the good old collective destruction called peer review. I'm looking for a review manager. > >> > >>The code, including extra tests and a benchmark is here: https://github.com/blackwhale/gsoc-bench-2012 > >> > >>And documentation: http://blackwhale.github.com/phobos/uni.html [...] > >>2) The commonly expected stuff in any modern Unicode-aware language: normalization, grapheme decoding, composition/decomposition and case-insensitive comparison*. > >> > >>3) I've taken it as a crucial point to provide all of the tools used to build Unicode algorithms from ground up to the end user. > > > >Are there enough tools to implement line-breaking algorithms (TR14)? > > > > Should be. AFAIK _word_ breaking seemed trivial to do, got to check the line breaking. Line-breaking is in TR14. I looked over it. It's definitely more complicated than I expected it to be. Some alphabets may even require stylistic rules for line-breaking! (Not to mention hyphenation.) So it's probably too much to expect std.uni to do it. But at least the tools necessary to implement it should be there. [...] > >One question though: how do I check a character in a specific unicode category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the specific category, and I assume it's overkill to have one function per category, so I presume it should be possible to do this using the codepoint sets? Can you add an example of this to the docs, if this is possible? > > auto psSet = unicode.Ps; > assert(psSet['(']); //should pass It would be nice to have a list of possible categories in the docs for unicode.opDispatch/opCall, at least, the common categories, with a hyperlink to unicode.org (or wherever) that provides the full list. Currently, it's unclear, beyond the single example given, of what the possibilities are. [...] > >- We should at least briefly describe what the various Unicode > > normalization forms mean (yes I know they can be looked up on > > unicode.org and various other online resources, but having them right > > there in the docs makes the docs so much more helpful). > > > > >- Why are the deprecated functions still in there? The deprecation > > messages say "It will be removed in August 2012", which is already > > past. So they should be removed by now. > > > > Will do before Walter chimes in with his motto of keeping clutter just in case. In that case, I would keep the functions (aliases?) but remove them from DDoc. So old code will still silently work, but the functions will no longer be officially supported, there will be no documentation for them and they can disappear anytime. > >- Is there a reason there's a composeJamo / hangulDecompose, but no > > other writing system specific functions (such as functions to deal > > with Thai consonant/vowel compositions)? Is this a peculiarity of > > the Unicode standard? Or should there be more general functions > > that handle composition/decomposition of compound characters? > > > Hangul are huge (11K+) and they have a specific algorithmic decomposition rule. Also they are hell of a special case in Unicode standard (that is mentioned in one chapter only!). Everything else is table-driven. OK. Makes sense, I guess. Though it gave me the impression that std.uni has some undue bias for Korean text. ;-) [...] > >- It would also be nice if these functions can be made pure (I'm not > > sure I understand why checking the category of a character should > > be impure). > > Global sophisticated immutable table. I think this can be pure but the compiler might have disagreed. Hmm. This sounds like a bug in purity, maybe? It seems clear to me that immutable values might as well be constants, and therefore using said values should not make the code impure, even if they are implemented as a global table. I think a bug should be filed for this. > > The lack of nothrow I can understand, since the input > > character may be illegal or otherwise malformed. > > nothrow is probably doable as any codepoint will do. (and if it's > dchar.max then it's a problem somewhere else). True, an illegal character by definition does not belong to any category, so all category functions should return false for it. No exceptions need to be thrown. > TBH I've killed these qualifiers early on as it prevented the thing to compile. I can't recall is there is still a problem with any of theses. [...] Before merging into Phobos, I would think we should put (as many of) the qualifiers back (as possible). Preferably everwhere they make sense, barring compiler bugs. T -- Государство делает вид, что платит нам зарплату, а мы делаем вид, что работаем. |
January 11, 2013 Re: Ready for review: new std.uni | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On 1/11/2013 1:21 PM, Dmitry Olshansky wrote: > 12-Jan-2013 00:50, Walter Bright пишет: >> On 1/11/2013 11:31 AM, Dmitry Olshansky wrote: >>> Anyway it's polished and ready for the good old collective destruction >>> called >>> peer review. I'm looking for a review manager. >> >> Thank you for doing this. Some notes: >> > > Great bashing, thanks :) Anytime! >> struct InversionList: >> >> 1. It implements length and empty, but otherwise does not seem to be >> a range, though its name implies it is a container. >> > > Containers also have length and can be empty. Ranges on top of it are .byChar, > .byInterval. Some more text about accessing it as a range would be helpful. Being a set of "codepoints" yet accessing it "by char" makes no sense. In unicode, the terms "char", "codepoint", "octet", "grapheme", etc., are bandied about and are often conflated and confused with each other. I suggest a brief section defining those terms, and then scrupulously, consistently, anally, and nitpickingly adhering to them in the documentation and names of things. >> 2. The description "InversionList is a packed interval-based data >> structure that represents a set of codepoints." completely baffles me. >> > > OK, probably it was a bad idea to combine implementation detail and general > description in one statement. > > How about: InversionList is a set of codepoints, optimized for size. > It represents a set as an array of intervals using 6 bytes per interval of > codepoints. > > Better? The name says "list" and yet it is described as a "set". Being a "list" makes an implication that it is a linked list. BTW, what is an "interval"? >> 3. The table for opBinary appears twice. >> >> 4. toSourceCode says "if the codepoint". If what codepoint? And what >> funcName is ""? >> >> mapTrieIndex: >> >> I have no idea what this does from the description. Is it a map like >> in std.algorithms? I.e. does it work with ranges? Need an example. >> >> TrieBuilder: >> >> 1. No description. I'm completely baffled. >> >> 2. putRange? What does that do? >> >> General lack of examples. >> >> Trie: >> >> Should have a wikipedia link to what a trie is. >> >> buildTrie: >> >> Contains a number of functions that return auto, but no mention of >> what is returned. While I like auto, in these cases it is not helpful, >> because the user needs to know what type is returned. >> > > Trust me you won't like it when you see it :) Part of the reason it's hidden. > But you are correct that documentation on these artifacts is *ehm* ... sketchy. > Will fix/update. If the compiler type is hard to grok, ok, but at least the documentation should make it clear what is returned. >> Grapheme: >> >> Should provide an InputRange so a user can iterate over its >> codepoints. It already appears to provide part of the range interface. > > Container != range. Just slice it and RA range is in your hands. I need to put > this info somewhere prominent I guess. > > BTW foreach(v; Grapheme("abc")) { ... } > should work because of automatic slicing. Yes, but is the resulting v an octet, char, codepoint ???? > All to the point, thanks. Welcs! |
Copyright © 1999-2021 by the D Language Foundation