January 16, 2013
On 01/11/2013 08:31 PM, Dmitry Olshansky wrote:
> There is an extra candy for meta-programming text processing libraries: a set type can generate source code of unrolled hard-coded binary search for a given set of codepoints.

Great, I already had to do that myself for a DFA Lexer.
January 16, 2013
11-Jan-2013 23:31, Dmitry Olshansky пишет:
>
> The code, including extra tests and a benchmark is here:
> https://github.com/blackwhale/gsoc-bench-2012
>
> And documentation:
> http://blackwhale.github.com/phobos/uni.html
>

First of all, @safe pure and nothrow is back. Let me know if something is still not.

OK, I've made an extra pass through docs with these things in mind:
- getting the introduction & terminology part right
- more explanations and details where applicable
 (let me if that's too much / too little / wrong)
- hiding away the truly generic (and not easy to use) Trie from documentation
- old deprecated stuff is hidden from docs to discourage its use

The whole Trie template was hidden because it needs to be in std.container anyway, and I want to extend & tweak it some more before committing it to public interface. It may very well worth a short review on its own.

I've spent some hours to get an easy, useful and correct (as far as it gets) terminology throughout the module.

In that wake 2 primary definitions are:
A character means an encoded character, that is code point that has *assigned* mapping to abstract character (i.e. symbolically representable meaning).
A code point is a point in a range of [0, 10FFFF] that is a value not necessarily having a symbolic meaning.

Things I'm aware of that are not there yet:

1. "Moar" examples everywhere!

2. Expand on the description of certain primitives.

3. Table that lists all of the properties, scripts and other sets available through the 'unicode' helper.


-- 
Dmitry Olshansky
January 16, 2013
16-Jan-2013 06:37, Martin Nowak пишет:
> On 01/11/2013 08:31 PM, Dmitry Olshansky wrote:
>> There is an extra candy for meta-programming text processing libraries:
>> a set type can generate source code of unrolled hard-coded binary search
>> for a given set of codepoints.
>
> Great, I already had to do that myself for a DFA Lexer.
>

The only thing left that I'd love to improve is to integrate UTF awareness (for no-decoding processing) into it so that it becomes:
auto set = ...
...
set.toSourceCode("func", Encoding.UTF8);

and prototype of generated function
is something along the lines of:
size_t func(char[] src);

or:
size_t func(char[] src, size_t offset);

that returns number of matched code units and 0 on failure.

-- 
Dmitry Olshansky
January 16, 2013
On 1/16/2013 2:48 AM, Dmitry Olshansky wrote:
> I've spent some hours to get an easy, useful and correct (as far as it gets)
> terminology throughout the module.

Thank you. Looking at the Terminology section (the reference to it at the beginning should be a hyperlink):

"Not all code points are assigned to encoded characters.":

	?? I thought that was the whole point?

"Note that UTF-32 code unit (dchar) holds the actual code point value." => "Note that in UTF-32, a code unit is a code point and is represented by the D dchar type."

What happened to "octet", which I thought was the official term?

"Also known as simply character."

	No, please no, at least not in this document. I suspect you need to ban the word "character" from this page. It is so overloaded in meaning that it is useless.

"An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a Grapheme."

	This just makes me cry. Who knows what a user thinks of as a character? "not necessarily" means what? Is "Grapheme" a Unicode term?

Why can't there be precise definitions of these terms? I wonder if even the Unicode standard people have no idea exactly what they are.

Sorry for the rant, but unicode terms always make me mad.
January 16, 2013
16-Jan-2013 23:35, Walter Bright пишет:
> On 1/16/2013 2:48 AM, Dmitry Olshansky wrote:
>> I've spent some hours to get an easy, useful and correct (as far as it
>> gets)
>> terminology throughout the module.
>
> Thank you. Looking at the Terminology section (the reference to it at
> the beginning should be a hyperlink):
>
> "Not all code points are assigned to encoded characters.":
>
>      ?? I thought that was the whole point?
>

Obviously it's not. The simple truth is only 110K+ codepoints are currently assigned everything else is either reserved for speicific use (internal, surrogates etc.) or for future additions.

Thin of it this way - previously they thought a unit of symbolic info was 16 bits. Then a lot of problems cropped up (including adapting 8-bit only/Latin-1/ascii systems). Now they treat encoding as form of data storage and everything in the Unicode standard defined as operating on _values_ of codespace (that is code points).

This doesn't mean that all these values are real characters.

> "Note that UTF-32 code unit (dchar) holds the actual code point value."
> => "Note that in UTF-32, a code unit is a code point and is represented
> by the D dchar type."

Yeah, that's simpler.

>
> What happened to "octet", which I thought was the official term?

Octet means 8-bits as a unit of data in just about any Internet protocol description. What do you what it here for?

>
> "Also known as simply character."
>
>      No, please no, at least not in this document. I suspect you need to
> ban the word "character" from this page. It is so overloaded in meaning
> that it is useless.
>

It's not me. It's just the way things are and people have to be aware this particular meaning of character.

Trust me, I've first converted the whole thing to code point(s) (note the space). Then I read it and it was like "meh" even to me who wrote the stuff in the first place. Then I looked at all documents by Unicode consortium they use 'character' throughout to mean either encoded character or abstract character depending on the time of day.

Utility beats pedantry and thus 'character' is 'encoded character' everywhere it's used. Probably I'll remove this statement so that uninitiated don't fixate on it too much.

> "An abstract character does not necessarily correspond to what a user
> thinks of as a “character” and should not be confused with a Grapheme."
>
>      This just makes me cry. Who knows what a user thinks of as a
> character? "not necessarily" means what? Is "Grapheme" a Unicode term?
>

Yes and yes. The user thinks a character is what he usually writes on a piece of paper obviously. It's a warning that what a typical user thinks a character doesn't always match the character in the Unicode.

Grapheme is a unicode term that has direct substitute in this library hence the hyper-link. Grapheme are _visually_ a single entity. In fact the could be a sequence consisting of a base character (Unicode standard says 'character') + sequence of combining marks. This is called combining character sequence, but there are other kinds of grapheme see e.g. Hangul syllables and more weird stuff.

BTW the text is taken as is from the Unicode standard definitions.
I can litter a couple more pages with this crap, but instead I've taken
the most important ones and merged a couple of definitions to make it simpler (e.g. see glyph).

> Why can't there be precise definitions of these terms?

'cause they are muddy ;)
'Abstract character' is one of the ugliest actually.

>I wonder if even
> the Unicode standard people have no idea exactly what they are.
>
> Sorry for the rant, but unicode terms always make me mad.

Nothing I can cure here. And they know what they do but have a boatload of legacy crap that has to crawl on. Just recall the time where everybody thought that 16-bit codes should be enough for everybody. When they gradually took the whole world of writing systems into account things got dirty, awfully so.

After learning a lot about Unicode I'd say they managed to do the job just fine given the constraints. I, for one, wouldn't even dare to try to reinvent *these* wheels. A lot of stuff could have been smoothed and simplified but overall - it's complex because the writing systems are utterly irregular plus the backwards compatibility has taken its toll.

-- 
Dmitry Olshansky
January 16, 2013
16-Jan-2013 23:35, Walter Bright пишет:
> On 1/16/2013 2:48 AM, Dmitry Olshansky wrote:
>> I've spent some hours to get an easy, useful and correct (as far as it
>> gets)
>> terminology throughout the module.
>
> Thank you. Looking at the Terminology section (the reference to it at
> the beginning should be a hyperlink):
>
> "Not all code points are assigned to encoded characters.":
>
>      ?? I thought that was the whole point?
>

And a lot of people really want to go this way see e.g. this:
http://www.parrot.org/content/what-nfg-why-you-want-parrot-have-it

There is a value in that but:
a) it's not portable anywhere except for in-memory usage \
b) everything input/output got to be converted possibly creating new entries in a global dictionary of graphemes.

-- 
Dmitry Olshansky
January 16, 2013
On 1/16/2013 12:16 PM, Dmitry Olshansky wrote:
> Nothing I can cure here. And they know what they do but have a boatload of
> legacy crap that has to crawl on. Just recall the time where everybody thought
> that 16-bit codes should be enough for everybody. When they gradually took the
> whole world of writing systems into account things got dirty, awfully so.
>
> After learning a lot about Unicode I'd say they managed to do the job just fine
> given the constraints. I, for one, wouldn't even dare to try to reinvent *these*
> wheels. A lot of stuff could have been smoothed and simplified but overall -
> it's complex because the writing systems are utterly irregular plus the
> backwards compatibility has taken its toll.

I guess I have to accept it :-)

And thanks for taking on this pretty much impossible task.

January 17, 2013
On Monday, 14 January 2013 at 07:21:57 UTC, Walter Bright wrote:
> I've done some research on auto-vectorization, i.e. "The Software Vectorization Handbook" by Bik.
>
> My conclusion (Manu independently came to the same one, he's our resident SIMD expert) is that auto-vectorization is a disaster.
>
> What it is is:
>
> 1. Reverse engineer a loop into a higher level construct
> 2. Recompile that construct using vector instructions
>
> It's a disaster because (2) often fails in ways that are utterly mysterious to 99% of programmers. The failure mode is to not auto-vectorize the loop. Hence, the failure is silent and the user just sees poor performance, if he notices it at all.



Have you seen the Visual C++ 2012 compiler? It fixes that problem.

Its auto-vectorizer has two switches:

1. /Qvec-report:1, which reports the auto-vectorized loops
2. /Qvec-report:2, which reports all potential candidates and whether or not they were vectorized (and if not, a reason code)


January 17, 2013
On Wed, Jan 16, 2013 at 02:48:30PM +0400, Dmitry Olshansky wrote:
> 11-Jan-2013 23:31, Dmitry Olshansky пишет:
> >
> >The code, including extra tests and a benchmark is here: https://github.com/blackwhale/gsoc-bench-2012
> >
> >And documentation: http://blackwhale.github.com/phobos/uni.html
> >
> 
> First of all, @safe pure and nothrow is back. Let me know if something is still not.
> 
> OK, I've made an extra pass through docs with these things in mind:
> - getting the introduction & terminology part right
> - more explanations and details where applicable
>  (let me if that's too much / too little / wrong)
> - hiding away the truly generic (and not easy to use) Trie from
> documentation
> - old deprecated stuff is hidden from docs to discourage its use
[...]

Looks much better now!

Some nitpicks:

- Under Overview:

	[4th paragraph] "It's recognized that an application may need
	further enhancements and extensions. It could be the need for
	less commonly known algorithms or tailoring existing ones for
	regional-specific needs. To help users with building any extra
	functionality beyond the core primitives the module provides:"

  The grammar nazi in me thinks a better wording might be (changes
  delimited by {{}}):

	"It's recognized that an application may need further
	enhancements and extensions{{, such as}} less{{-}}commonly known
	algorithms{{,}} or tailoring existing ones for
	{{region}}-specific needs. To help users with building any extra
	functionality beyond the core primitives{{,}} the module
	provides:"

  The second item in the subsequent list:

	A way to construct optimal packed multi-stage tables also known
	as a special case of Trie. {{The functions}} codepointTrie,
	codepointSetTrie construct custom tries that map dchar to value.
	The end result is {{a}} fast and predictable Ο(1) lookup that
	powers functions like isAlpha {{and}} combiningClass{{,}} but
	for user-defined data sets.

  The last item in the list:

	Access to the commonly{{-}}used predefined sets of code points.
	The commonly{{-}}defined one{{s}} can be observed in the CLDR
	utility, on {{the}} page property index. {{S}}upported ones
	include Script, Block and General Category. See unicode for easy
	{{}} compile-time checked queries.

- Under Terminology:

	[[3rd paragraph]] "The minimal bit combination that can represent
	a unit of encoded text for processing or interchange. Depending
	on the encoding this could be: 8-bit code units in the UTF-8
	(($D char)), [...]"

  I think you transposed the "$(" here. :)

  The last sentence in this section appears to be truncated. Maybe a
  runaway DDoc macro somewhere earlier?

- Under Construction of lookup tables, the grammar nazi says:

	[[1st sentence]] "{{The}} Unicode standard describes a set of
	algorithms that {{}} depend on having {{the}} ability to quickly
	look{{ }}up various properties of a code point. Given the the
	codespace of about 1 million code points, it is not a trivial
	task to providing a space{{-}}efficient solution for the {{}}
	multitude of properties."

	[[2nd paragraph]] "[...] Hash-tables {{have}} enormous memory
	footprint and binary search over intervals is not fast enough
	for some heavy-duty algorithms."

	[[3rd paragraph]] "{{(P }}The recommended solution (see Unicode
	Implementation Guidelines) {{}} is using multi-stage tables{{,}}
	that is{{,}} {{an instance}} of Trie with integer keys and {{a}}
	fixed number of stages. For the {{remainder}} of {{this}}
	section {{it will be}} called {{a}} fixed trie. The following
	describes a particular implementation that is aimed for the
	speed of access at the expense of ideal size savings."

	[[4th paragraph]] "[...] Split {{the}} number of bits in a key
	(code point, 21 bits) {{into}} 2 components (e.g. 15 and 8). The
	first is the number of bits in the index of {{the}} trie and the
	other is {{the}} number of bits {{in each}} page of {{the}}
	trie. The layout of trie is then an array of size
	2^^bits-of-index followed an array of memory chunks of size
	2^^bits-of-page/size-of-element."

	[[5th paragraph]] "[...] {{The}} slots of {{the}} index all have
	to contain {{the same[?] number of pages}}. The lookup is then
	just a couple of operations - slice {{the}} upper bits, {{then}}
	look{{ }}up {{the}} index for these{{.}} The pseudo-code is:"

	[[Following the code example]] "[...] Where if the elemsPerPage
	is a power of 2 the whole process is a handful of simple
	instructions and 2 array reads.  {{Subsequent}} levels of
	{{the}} trie are introduced by recursing {{}} this notion - the
	index array is treated as values. The number of bits in {{the}}
	index is then again split into 2 parts, with pages over
	'current-index' and {{the}} new 'upper-index'."

	[[Next paragraph]] "For completeness the level 1 trie is simply
	an array. {{The}} current implementation takes advantage of
	bit-packing values when the range is known to be limited in {{}}
	advance (such as bool){{.}} {{S}}ee also BitPacked for enforcing
	it manually.  [...]"

	[[Last paragraph]] "The process of construction of a trie is
	more involved and is hidden from the user in a form of
	{{convenience}} functions: codepointTrie, codepointSetTrie and
	even more convenient toTrie. In general a set or built-in AA
	with dchar type can be turned into a trie. The trie object in
	this module is {{}} read-only (immutable){{;}} it's effectively
	frozen after construction."


The grammar nazi has run out of steam, so no more grammar nitpicks for now. ;-) But there are still the following questions:

- Why is isControl() not pure nothrow?

- Why are the isX() functions @system? I would have expected they should
  be at least @trusted? (Or are there technical problems / compiler bugs
  preventing this?)

That's all for now. I hope you don't mind me allowing the grammar nazi to take over for a bit. I want Phobos documentation to be professional quality. :)


T

-- 
The trouble with TCP jokes is that it's like hearing the same joke over and over.
January 17, 2013
17-Jan-2013 22:48, H. S. Teoh пишет:
> On Wed, Jan 16, 2013 at 02:48:30PM +0400, Dmitry Olshansky wrote:
>> 11-Jan-2013 23:31, Dmitry Olshansky пишет:
>>>
>>> The code, including extra tests and a benchmark is here:
>>> https://github.com/blackwhale/gsoc-bench-2012
>>>
>>> And documentation:
>>> http://blackwhale.github.com/phobos/uni.html
>>>
>>
>> First of all, @safe pure and nothrow is back. Let me know if
>> something is still not.
>>
>> OK, I've made an extra pass through docs with these things in mind:
>> - getting the introduction & terminology part right
>> - more explanations and details where applicable
>>   (let me if that's too much / too little / wrong)
>> - hiding away the truly generic (and not easy to use) Trie from
>> documentation
>> - old deprecated stuff is hidden from docs to discourage its use
> [...]
>
> Looks much better now!
>
> Some nitpicks:

Great ;)

> - Under Overview:
>
> 	[4th paragraph] "It's recognized that an application may need
> 	further enhancements and extensions. It could be the need for
> 	less commonly known algorithms or tailoring existing ones for
> 	regional-specific needs. To help users with building any extra
> 	functionality beyond the core primitives the module provides:"
>
>    The grammar nazi in me thinks a better wording might be (changes
>    delimited by {{}}):
>
> 	"It's recognized that an application may need further
> 	enhancements and extensions{{, such as}} less{{-}}commonly known
> 	algorithms{{,}} or tailoring existing ones for
> 	{{region}}-specific needs. To help users with building any extra
> 	functionality beyond the core primitives{{,}} the module
> 	provides:"
>
>    The second item in the subsequent list:
>
> 	A way to construct optimal packed multi-stage tables also known
> 	as a special case of Trie. {{The functions}} codepointTrie,
> 	codepointSetTrie construct custom tries that map dchar to value.
> 	The end result is {{a}} fast and predictable Ο(1) lookup that
> 	powers functions like isAlpha {{and}} combiningClass{{,}} but
> 	for user-defined data sets.
>
>    The last item in the list:
>
> 	Access to the commonly{{-}}used predefined sets of code points.
> 	The commonly{{-}}defined one{{s}} can be observed in the CLDR
> 	utility, on {{the}} page property index. {{S}}upported ones
> 	include Script, Block and General Category. See unicode for easy
> 	{{}} compile-time checked queries.
>
> - Under Terminology:
>
> 	[[3rd paragraph]] "The minimal bit combination that can represent
> 	a unit of encoded text for processing or interchange. Depending
> 	on the encoding this could be: 8-bit code units in the UTF-8
> 	(($D char)), [...]"
>
>    I think you transposed the "$(" here. :)
>

Looks like one commit wasn't pushed :(
I'll peruse you wording though.

>    The last sentence in this section appears to be truncated. Maybe a
>    runaway DDoc macro somewhere earlier?
>
> - Under Construction of lookup tables, the grammar nazi says:
>
> 	[[1st sentence]] "{{The}} Unicode standard describes a set of
> 	algorithms that {{}} depend on having {{the}} ability to quickly
> 	look{{ }}up various properties of a code point. Given the the
> 	codespace of about 1 million code points, it is not a trivial
> 	task to providing a space{{-}}efficient solution for the {{}}
> 	multitude of properties."
>
> 	[[2nd paragraph]] "[...] Hash-tables {{have}} enormous memory
> 	footprint and binary search over intervals is not fast enough
> 	for some heavy-duty algorithms."
>
> 	[[3rd paragraph]] "{{(P }}The recommended solution (see Unicode
> 	Implementation Guidelines) {{}} is using multi-stage tables{{,}}
> 	that is{{,}} {{an instance}} of Trie with integer keys and {{a}}
> 	fixed number of stages. For the {{remainder}} of {{this}}
> 	section {{it will be}} called {{a}} fixed trie. The following
> 	describes a particular implementation that is aimed for the
> 	speed of access at the expense of ideal size savings."
>
> 	[[4th paragraph]] "[...] Split {{the}} number of bits in a key
> 	(code point, 21 bits) {{into}} 2 components (e.g. 15 and 8). The
> 	first is the number of bits in the index of {{the}} trie and the
> 	other is {{the}} number of bits {{in each}} page of {{the}}
> 	trie. The layout of trie is then an array of size
> 	2^^bits-of-index followed an array of memory chunks of size
> 	2^^bits-of-page/size-of-element."
>
> 	[[5th paragraph]] "[...] {{The}} slots of {{the}} index all have
> 	to contain {{the same[?] number of pages}}. The lookup is then
> 	just a couple of operations - slice {{the}} upper bits, {{then}}
> 	look{{ }}up {{the}} index for these{{.}} The pseudo-code is:"
>
> 	[[Following the code example]] "[...] Where if the elemsPerPage
> 	is a power of 2 the whole process is a handful of simple
> 	instructions and 2 array reads.  {{Subsequent}} levels of
> 	{{the}} trie are introduced by recursing {{}} this notion - the
> 	index array is treated as values. The number of bits in {{the}}
> 	index is then again split into 2 parts, with pages over
> 	'current-index' and {{the}} new 'upper-index'."
>
> 	[[Next paragraph]] "For completeness the level 1 trie is simply
> 	an array. {{The}} current implementation takes advantage of
> 	bit-packing values when the range is known to be limited in {{}}
> 	advance (such as bool){{.}} {{S}}ee also BitPacked for enforcing
> 	it manually.  [...]"
>
> 	[[Last paragraph]] "The process of construction of a trie is
> 	more involved and is hidden from the user in a form of
> 	{{convenience}} functions: codepointTrie, codepointSetTrie and
> 	even more convenient toTrie. In general a set or built-in AA
> 	with dchar type can be turned into a trie. The trie object in
> 	this module is {{}} read-only (immutable){{;}} it's effectively
> 	frozen after construction."
>
>
> The grammar nazi has run out of steam, so no more grammar nitpicks for
> now. ;-) But there are still the following questions:
>
> - Why is isControl() not pure nothrow?
>

Missed this one.

> - Why are the isX() functions @system? I would have expected they should
>    be at least @trusted? (Or are there technical problems / compiler bugs
>    preventing this?)
>

M-hm I'm seeing this in my sources:
bool isAlpha(dchar c) @safe pure nothrow
{...}

The DDoc however shows @system.

A compiler bug?

> That's all for now. I hope you don't mind me allowing the grammar nazi
> to take over for a bit. I want Phobos documentation to be professional
> quality. :)
>

Sure, thanks.

>
> T
>


-- 
Dmitry Olshansky