Ready for review: new std.uni (page 3)

12-Jan-2013 19:01, Jacob Carlborg пишет: > On 2013-01-11 20:31, Dmitry Olshansky wrote: >> It's been long over due to present the work I did during the last GSOC >> as it was a summer not winter of code after all. Unfortunately some >> compiler bugs, a new job :) and unrelated events of importance have >> postponed its release beyond measure. >> > > I assume that "sicmp" is faster than "icmp"? In that case you might want > to mention that in the documentation. > Yeah, good point. -- Dmitry Olshansky

On 1/12/2013 8:26 AM, Dmitry Olshansky wrote: > I'll make another pass and add more examples, glossary > and so on. You can also try the ACRONYM macro: ACRONYM=<acronym title="$+">$1</acronym> ($+) If you mouse over an acronym, a little popup text will appear.

On 1/12/2013 11:57 AM, Walter Bright wrote: > On 1/12/2013 8:26 AM, Dmitry Olshansky wrote: >> I'll make another pass and add more examples, glossary >> and so on. > > You can also try the ACRONYM macro: > > ACRONYM=<acronym title="$+">$1</acronym> ($+) > > If you mouse over an acronym, a little popup text will appear. More info: http://html5doctor.com/the-abbr-element/ Although "octet", for example, isn't exactly an abbreviation.

On 1/11/2013 11:31 AM, Dmitry Olshansky wrote: > And documentation: > http://blackwhale.github.com/phobos/uni.html struct InversionList: length() returns "number of characters", while "empty()" returns true if there are no codepoints. Aren't the two redundant? And should it be characters or codepoints?

13-Jan-2013 03:39, Walter Bright пишет: > On 1/11/2013 11:31 AM, Dmitry Olshansky wrote: >> And documentation: >> http://blackwhale.github.com/phobos/uni.html > > struct InversionList: > > > length() returns "number of characters", while "empty()" returns true if > there are no codepoints. > > Aren't the two redundant? Containers commonly have both. But maybe I should kill length on different grounds because it's computed by walking the array of intervals [a, b) and summing up (b-a) lengths. It's O(N).... ouch. The only problem is that then it's not at all obvious that length is better calculated via: reduce!"a + (b[1] - b[0])"(set.byInterval, 0); vs walkLength(set.byChar); where the latter may take enormous time to finish on common sets. (damn it should by DChar at least) > And should it be characters or codepoints? Codepoints, everything works on codepoints, that is dchars. A character is a Grapheme. So all mentions of character that slipped through got to be fixed. -- Dmitry Olshansky

On Sunday, January 13, 2013 11:00:50 Dmitry Olshansky wrote: > 13-Jan-2013 03:39, Walter Bright пишет: > > On 1/11/2013 11:31 AM, Dmitry Olshansky wrote: > >> And documentation: http://blackwhale.github.com/phobos/uni.html > > > > struct InversionList: > > > > > > length() returns "number of characters", while "empty()" returns true if > > there are no codepoints. > > > > Aren't the two redundant? > > Containers commonly have both. But maybe I should kill length on > different grounds because it's computed by walking the array of > intervals [a, b) and summing up (b-a) lengths. It's O(N).... ouch. > > The only problem is that then it's not at all obvious that length is > better calculated via: > reduce!"a + (b[1] - b[0])"(set.byInterval, 0); > vs > walkLength(set.byChar); > > where the latter may take enormous time to finish on common sets. > > (damn it should by DChar at least) According to std.container, length should never be defined unless it's at worst O(log n) (though I would have expected it to require O(1); I don't know of any container that can calculate its lengeth in O(log n) but not O(1) - it's always O(1) or O(n) in my experience). So, don't define length if it's not that efficient, even if it's not in std.container. It _may_ be valuable to add a function which does the same (e.g. calculateLength or calcLength or calcLen or something liket that), especially if walkLength isn't going to be the most efficient. And yes, providing both empty and length is common and expected. And since empty is _required_ to be O(1), it's still guaranteed to be efficient, whereas length isn't. I really should ask Andrei why he made length require O(log n) instead O(1)... - Jonathan M Davis

On 01/13/2013 02:28 AM, Jonathan M Davis wrote: > > I really should ask Andrei why he made length require O(log n) instead O(1)... > > - Jonathan M Davis Are there cases where it can't be O(1)? My current intuition is that length can always be cached. If a container has a long length calculation, then it can be shortened by wrapping it in a proxy container that does nothing but expose the same operations as jackets around the original container. These jackets forward all data in and out with no changes, except for one: they count the number of elements entering and leaving the underlying container and store this count in a length variable.

On Sunday, January 13, 2013 04:58:16 Chad J wrote: > On 01/13/2013 02:28 AM, Jonathan M Davis wrote: > > I really should ask Andrei why he made length require O(log n) instead > > O(1)... > > > > - Jonathan M Davis > > Are there cases where it can't be O(1)? Most definitely. Take a doubly linked list for example. Either length can be O(1) and splicing is then O(n), or length is O(n) and splicing is then O(1). That's because if you want to keep track of the length, you have to count the number of elements being spliced in. For instance, it's a relatively common mistake in C++ to use std::List's size function and compare it with 0 to see whether the list is empty, because size is O(n). The correct thing to do is to call empty and check its result. You _could_ make std::list's size function be O(1), but that would mean that splicing becomes O(n), which C++98 definitely did not do (though I hear that C++11 made the interesting choice of changing how std::list works so that size is O(1) and splicing is O(n); I don't know if that's good or not). std.container.slist and std.container.dlist don't define length precisely because it can't be O(1) given their design. - Jonathan M Davis

January 13, 2013

Re: Ready for review: new std.uni

Posted by Dmitry Olshansky
in reply to H. S. Teoh

Permalink

Dmitry Olshansky

Posted in reply to H. S. Teoh

Permalink

12-Jan-2013 01:33, H. S. Teoh пишет:
> On Sat, Jan 12, 2013 at 01:06:30AM +0400, Dmitry Olshansky wrote:
>> 12-Jan-2013 00:35, H. S. Teoh пишет:
>>> On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
>>> [...]
>>>> Anyway it's polished and ready for the good old collective
>>>> destruction called peer review. I'm looking for a review manager.
>>>>
>>>> The code, including extra tests and a benchmark is here:
>>>> https://github.com/blackwhale/gsoc-bench-2012
>>>>
>>>> And documentation:
>>>> http://blackwhale.github.com/phobos/uni.html
> [...]
>>>> 2) The commonly expected stuff in any modern Unicode-aware language:
>>>> normalization, grapheme decoding, composition/decomposition and
>>>> case-insensitive comparison*.
>>>>
>>>> 3) I've taken it as a crucial point to provide all of the tools used
>>>> to build Unicode algorithms from ground up to the end user.
>>>
>>> Are there enough tools to implement line-breaking algorithms (TR14)?
>>>
>>
>> Should be. AFAIK _word_ breaking seemed trivial to do, got to check
>> the line breaking.
>
> Line-breaking is in TR14. I looked over it. It's definitely more
> complicated than I expected it to be. Some alphabets may even require
> stylistic rules for line-breaking! (Not to mention hyphenation.) So it's
> probably too much to expect std.uni to do it. But at least the tools
> necessary to implement it should be there.
>

I had given it a practiced look of tired Unicode implementor. It has a lot of subtle moments to it but should be doable.

I'm not about to implement it any time soon, but notes on how to do it are the following:

1. Avoid reading all of the reasons behind the algorithm listed there (there are good ones, a ton of them). So at first SKIP chapter 5 except maybe 5.3-5.4. And the fonts problems listed are not important to get it working.

2. Re-read the important chapters that is 2 and 6, 7. Others pretty much can be simply ignored until you start testing your implementation. 7 is actually pretty nice.

3. The first step to implement is to seek out the character classes (=sets) involved in algorithm. These are listed in Table.1 and the accompanying data file.

4. Process the file (see e.g. gen_uni.d) and create all of sets.

5. The core part of the algorithm itself is by the end of day not bad - section 6.1 lists all of the mandatory line breaking rules. Then there is a bunch of tweakable ones that are a bit harder. Plus see chapter 7 for a (logically) pair-wise table driven solution (though the row/cols in the table are "belongs to the set X").

-- 
Dmitry Olshansky

13-Jan-2013 13:58, Chad J пишет: > On 01/13/2013 02:28 AM, Jonathan M Davis wrote: >> >> I really should ask Andrei why he made length require O(log n) instead >> O(1)... >> >> - Jonathan M Davis > > Are there cases where it can't be O(1)? > > My current intuition is that length can always be cached. Caching of length is suitable for adapters on top of base/core containers. > If a > container has a long length calculation, then it can be shortened by > wrapping it in a proxy container that does nothing but expose the same > operations as jackets around the original container. These jackets > forward all data in and out with no changes, except for one: they count > the number of elements entering and leaving the underlying container and > store this count in a length variable. > Yeah, exactly. -- Dmitry Olshansky

Forums