Thread overview
Re: Proposed Phobos equivalent of wcswidth()
Jan 15, 2018
H. S. Teoh
Jan 16, 2018
Dmitry Olshansky
Jan 16, 2018
H. S. Teoh
Jan 17, 2018
Dmitry Olshansky
Jan 17, 2018
H. S. Teoh
Jan 18, 2018
Dmitry Olshansky
Jan 19, 2018
H. S. Teoh
Jan 20, 2018
Dmitry Olshansky
January 15, 2018
On Sat, Jan 13, 2018 at 09:26:52AM -0800, H. S. Teoh via Digitalmars-d wrote: [...]
> 	https://github.com/quickfur/strwidth
[...]

One thing I'm seeking help with, and this is mainly directed at Dmitry Olshansky but can be anyone here who knows the internal workings of std.uni well enough, is how to transform the Trie generated by the static ctor into compile-time TrieNode declarations.  This is one blocker for my turning this code into a Phobos PR, because I don't want to incur the cost of initializing this trie at runtime.

Also, on a related note, there exist nicer interfaces in std.uni for constructing Tries that map ranges of codepoints to non-boolean values, but none of these are available publicly.  The current implementation in strwidth only uses the public API of std.uni, so the construction of the trie is pretty horrendous (looping over individual codepoints and creating an AA of individual codepoints -- including very large ranges like the entire Unicode plane 2).  I wonder if some of these facilities should be made public so that user code that needs to construct codepoint tries that include large ranges of codepoints can do so more efficiently.


T

-- 
This sentence is false.
January 16, 2018
On Monday, 15 January 2018 at 19:52:07 UTC, H. S. Teoh wrote:
> On Sat, Jan 13, 2018 at 09:26:52AM -0800, H. S. Teoh via Digitalmars-d wrote: [...]
>> 	https://github.com/quickfur/strwidth
> [...]
>
> One thing I'm seeking help with, and this is mainly directed at Dmitry Olshansky but can be anyone here who knows the internal workings of std.uni well enough, is how to transform the Trie generated by the static ctor into compile-time TrieNode declarations.  This is one blocker for my turning this code into a Phobos PR, because I don't want to incur the cost of initializing this trie at runtime.


Checkout my horribly named repo gsoc-uni-benchmark:

https://github.com/DmitryOlshansky/gsoc-bench-2012/blob/master/gen_uni.d

This is what generates unicode tables.
Need to revise it, as folks were delicate enough to hand-patch auto-generated code in Phobos.

Maybe make some of that user-acessible.

>
> T

January 16, 2018
On Tue, Jan 16, 2018 at 05:49:11PM +0000, Dmitry Olshansky via Digitalmars-d wrote:
> On Monday, 15 January 2018 at 19:52:07 UTC, H. S. Teoh wrote:
[...]
> > One thing I'm seeking help with, and this is mainly directed at Dmitry Olshansky but can be anyone here who knows the internal workings of std.uni well enough, is how to transform the Trie generated by the static ctor into compile-time TrieNode declarations.  This is one blocker for my turning this code into a Phobos PR, because I don't want to incur the cost of initializing this trie at runtime.
> 
> 
> Checkout my horribly named repo gsoc-uni-benchmark:
> 
> https://github.com/DmitryOlshansky/gsoc-bench-2012/blob/master/gen_uni.d
> 
> This is what generates unicode tables.
> Need to revise it, as folks were delicate enough to hand-patch
> auto-generated code in Phobos.
> 
> Maybe make some of that user-acessible.
[...]

Whoa. There's some pretty cool stuff in there!  Thanks, I've started experimenting with pre-generating the width table.  Pretty neat. There's a lot of hidden gems in std.uni that I never knew existed, hidden away under `private`. :-D

One thing, though: I think it would benefit us all if we could import at least gen_uni into Phobos, so that in the future when we need to update std.uni to a new version of Unicode, it can be (mostly) automated.  It's better to have the tools to generate the tables in Phobos itself, than to be dependent on an external repo that may go out-of-sync eventually.

When I get around to making a PR for strwidth AKA displayWidth, the plan is to check-in compileWidth.d in some form into Phobos somewhere, so that somebody else can pick it up and improve the implementation in the future if I'm not around / unavailable.

If we can get gen_uni into Phobos, perhaps we can even include the displayWidth table generation in gen_uni too, so that all the table generation code is in one place.


T

-- 
Having a smoking section in a restaurant is like having a peeing section in a swimming pool. -- Edward Burr
January 17, 2018
On Tuesday, 16 January 2018 at 23:01:19 UTC, H. S. Teoh wrote:
> On Tue, Jan 16, 2018 at 05:49:11PM +0000, Dmitry Olshansky via Digitalmars-d wrote:
>> On Monday, 15 January 2018 at 19:52:07 UTC, H. S. Teoh wrote:
> [...]
>> > One thing I'm seeking help with, and this is mainly directed at Dmitry Olshansky but can be anyone here who knows the internal workings of std.uni well enough, is how to transform the Trie generated by the static ctor into compile-time TrieNode declarations.  This is one blocker for my turning this code into a Phobos PR, because I don't want to incur the cost of initializing this trie at runtime.
>> 
>> 
>> Checkout my horribly named repo gsoc-uni-benchmark:
>> 
>> https://github.com/DmitryOlshansky/gsoc-bench-2012/blob/master/gen_uni.d
>> 
>> This is what generates unicode tables.
>> Need to revise it, as folks were delicate enough to hand-patch
>> auto-generated code in Phobos.
>> 
>> Maybe make some of that user-acessible.
> [...]
>
> Whoa. There's some pretty cool stuff in there!  Thanks, I've started experimenting with pre-generating the width table.  Pretty neat. There's a lot of hidden gems in std.uni that I never knew existed, hidden away under `private`. :-D

The intent is to open that up somehow, to allow folks to make their own extended versions of std.uni. Unicode is all about “tailoring” - adjusting algorithm to your specific regional preferences hy messing with tables.

I think there is at least 1 bug in Bugzilla on this.
>
> One thing, though: I think it would benefit us all if we could import at least gen_uni into Phobos, so that in the future when we need to update std.uni to a new version of Unicode, it can be (mostly) automated.  It's better to have the tools to generate the tables in Phobos itself, than to be dependent on an external repo that may go out-of-sync eventually.

Yes but it’s non-trivial at the moment, if you take a look at script to generate stuff it takes both 32-bit and 64-bit executables to populate tables.

I think having it in tools repo should be fine though. Last time I tried to update to Unicode 10, I found one table in Phobos that is missing from generator (ooops!).

>
> When I get around to making a PR for strwidth AKA displayWidth, the plan is to check-in compileWidth.d in some form into Phobos somewhere, so that somebody else can pick it up and improve the implementation in the future if I'm not around / unavailable.
>
> If we can get gen_uni into Phobos, perhaps we can even include the displayWidth table generation in gen_uni too, so that all the table generation code is in one place.

Right. A good step would be to move it to tools, then add your code.


>
>
> T


January 17, 2018
On Wed, Jan 17, 2018 at 05:06:05AM +0000, Dmitry Olshansky via Digitalmars-d wrote:
> On Tuesday, 16 January 2018 at 23:01:19 UTC, H. S. Teoh wrote:
[...]
> > One thing, though: I think it would benefit us all if we could import at least gen_uni into Phobos, so that in the future when we need to update std.uni to a new version of Unicode, it can be (mostly) automated.  It's better to have the tools to generate the tables in Phobos itself, than to be dependent on an external repo that may go out-of-sync eventually.
> 
> Yes but it’s non-trivial at the moment, if you take a look at script to generate stuff it takes both 32-bit and 64-bit executables to populate tables.
> 
> I think having it in tools repo should be fine though. Last time I tried to update to Unicode 10, I found one table in Phobos that is missing from generator (ooops!).

I took a first stab at integrating this into dlang/tools:

	https://github.com/quickfur/tools/tree/unicode_gen

So far, I can get the 64-bit generator to run and produce the generated unicode_*.d files. Unfortunately they are missing the 32-bit data, because I couldn't get a 32-bit dmd toolchain working on my PC.

Maybe you could take a look and submit PRs against that branch for any fixes you'd like to get in?  I'll see if I can somehow get 32-bit working on my PC.

Alternatively, maybe the solution is to hack the Trie code so that it uses explicit int sizes rather than size_t, then we can use it to generate both 32-bit and 64-bit tables without requiring the host platform to support both.  I imagine we may have problems getting the tools repo to build on the autotester once we integrate gen_uni into the makefile, unless we do something like this.


> > When I get around to making a PR for strwidth AKA displayWidth, the plan is to check-in compileWidth.d in some form into Phobos somewhere, so that somebody else can pick it up and improve the implementation in the future if I'm not around / unavailable.
> > 
> > If we can get gen_uni into Phobos, perhaps we can even include the displayWidth table generation in gen_uni too, so that all the table generation code is in one place.
> 
> Right. A good step would be to move it to tools, then add your code.
[...]

Good idea.  Well, I started with the branch linked above in my fork of dlang/tools.  If I can get it off the ground, I'll add the displayWidth stuff in as well, then formulate a PR to add displayWidth to std.uni.

Well, technically I don't need to wait for that, since I could just add the precomputed table directly into std/internal/unicode_tables.d. But it's probably better to let the generator do the job instead.  A precomputed table is rather hard to review for correctness when it comes PR review time. :-D


T

-- 
Don't get stuck in a closet---wear yourself out.
January 18, 2018
On Wednesday, 17 January 2018 at 22:59:58 UTC, H. S. Teoh wrote:
> I took a first stab at integrating this into dlang/tools:
>
> 	https://github.com/quickfur/tools/tree/unicode_gen
>
> So far, I can get the 64-bit generator to run and produce the generated unicode_*.d files. Unfortunately they are missing the 32-bit data, because I couldn't get a 32-bit dmd toolchain working on my PC.



>
> Maybe you could take a look and submit PRs against that branch for any fixes you'd like to get in?  I'll see if I can somehow get 32-bit working on my PC.
>
> Alternatively, maybe the solution is to hack the Trie code so that it uses explicit int sizes rather than size_t, then we can use it to generate both 32-bit and 64-bit tables without requiring the host platform to support both.

Yes, I guess we have to allow word size to be redefined. I just wanted fastest version by default w/o possibility to screw up on the user side of things.

Also forgot to mention that can pass BitPacked!(ubyte,2) to Trie template as value type to use 2 bit per value. Should reduce your width table 4-fold. Just saying;)
January 19, 2018
On Thu, Jan 18, 2018 at 06:42:26PM +0000, Dmitry Olshansky via Digitalmars-d wrote: [...]
> Also forgot to mention that can pass BitPacked!(ubyte,2) to Trie template as value type to use 2 bit per value. Should reduce your width table 4-fold.  Just saying;)

Thanks for the tip!  Indeed, the table size was reduced 4-fold. Awesome.

However, now I'm finding that it no longer works properly when loaded from the precompiled data.  It appears to have something to do with the default value for the width table being 1 rather than ubyte.init, and so far I couldn't figure out how to get the Trie ctor that takes .offsets, .sizes, .data to specify a default value.  So now the trie is returning the wrong value for certain dchar ranges. :-(


T

-- 
Some ideas are so stupid that only intellectuals could believe them. -- George Orwell
January 20, 2018
On Friday, 19 January 2018 at 19:33:28 UTC, H. S. Teoh wrote:
> On Thu, Jan 18, 2018 at 06:42:26PM +0000, Dmitry Olshansky via Digitalmars-d wrote: [...]
>> Also forgot to mention that can pass BitPacked!(ubyte,2) to Trie template as value type to use 2 bit per value. Should reduce your width table 4-fold.  Just saying;)
>
> Thanks for the tip!  Indeed, the table size was reduced 4-fold. Awesome.
>
> However, now I'm finding that it no longer works properly when loaded from the precompiled data.  It appears to have something to do with the default value for the width table being 1 rather than ubyte.init, and so far I couldn't figure out how to get the Trie ctor that takes .offsets, .sizes, .data to specify a default value.

Why would you need a default in a low-level construction? I think it naturally takes the tables with whatever was stored in there. There is no processing.

So the default has to be explicitly stored during building of trie.

> So now the trie is returning the wrong value for certain dchar ranges. :-(
>