June 13, 2004
In article <cagrbo$1rlb$1@digitaldaemon.com>, Arcane Jill says...
>

>I've got some ideas, but it's too early in the morning for me right now. Will get back to you later when I've woken up a bit.

I'm a bit more awake now. The approach that I took when I had to do this sort of thing for my employer some years ago turned out to be wrong, in hindsight. But I learn from my mistakes. Back then, I had a tool to parse the Unicode database files (no problem there) into a custom format binary file, which then got turned into a byte array and stuck into the source code. It was a bad idea because my format was not extendable, and it included only those parts of the database which I actually needed. For this project, we need all of it.

I have an approach now which will work, and I've got some tests up and running to prove that it works. It's a two-stage process. In stage one, a function I wrote parses the Unicode database files and produces some HUMUNGOUSLY large binary files, containing every scrap of information there is on Unicode, in an easily accessible form. Then a second phase function comes along and uses those large files as input, creating as its output - D source files. These end up quite small, because the function figures out the best way to pack the data. The source files created declare const lookup tables (using your 12-bit/9-bit split with duplicate tables removed).

This approach leaves each Unicode property having its own independent source file(s). Since each source file will become a single .obj file, when you link with the library, you will only get the data for those properties you actually need. If you never call a function to get the bidi-combining-class, for example, then that function, and the data to support it, won't even get linked in.

And the beauty of this approach is that it is completely extendable to all Unicode properties in all of their files.

Oh, and there's another good thing too. Since the source code writing is automated, it follows that variations of lookup algorithms has just happen automagically. For example, the isASCIIHexDigit() property can be implemented very efficiently with only a tiny amount of data and a slightly modified lookup function. The source-code-writing tool could figure that out and use the smaller data lookup.

I haven't started on any actual Unicode /algorithms/ yet - just getting the fast+small property lookups working was quite a challenge.

So it's going well. If you want to email me privately to consult, you can head over to dsource and post me a message there. This is going to be fun!

Jill


June 13, 2004
Arcane Jill wrote:
> In article <cagrbo$1rlb$1@digitaldaemon.com>, Arcane Jill says...
> 
> 
>>I've got some ideas, but it's too early in the morning for me right now. Will
>>get back to you later when I've woken up a bit.
> 
> 
> I'm a bit more awake now. The approach that I took when I had to do this sort of
> thing for my employer some years ago turned out to be wrong, in hindsight. But I
> learn from my mistakes. Back then, I had a tool to parse the Unicode database
> files (no problem there) into a custom format binary file, which then got turned
> into a byte array and stuck into the source code. It was a bad idea because my
> format was not extendable, and it included only those parts of the database
> which I actually needed. For this project, we need all of it.
> 
> I have an approach now which will work, and I've got some tests up and running
> to prove that it works. It's a two-stage process. In stage one, a function I
> wrote parses the Unicode database files and produces some HUMUNGOUSLY large
> binary files, containing every scrap of information there is on Unicode, in an
> easily accessible form. Then a second phase function comes along and uses those
> large files as input, creating as its output - D source files. These end up
> quite small, because the function figures out the best way to pack the data. The
> source files created declare const lookup tables (using your 12-bit/9-bit split
> with duplicate tables removed).
> 
> This approach leaves each Unicode property having its own independent source
> file(s). Since each source file will become a single .obj file, when you link
> with the library, you will only get the data for those properties you actually
> need. If you never call a function to get the bidi-combining-class, for example,
> then that function, and the data to support it, won't even get linked in.
> 
> And the beauty of this approach is that it is completely extendable to all
> Unicode properties in all of their files.
> 
> Oh, and there's another good thing too. Since the source code writing is
> automated, it follows that variations of lookup algorithms has just happen
> automagically. For example, the isASCIIHexDigit() property can be implemented
> very efficiently with only a tiny amount of data and a slightly modified lookup
> function. The source-code-writing tool could figure that out and use the smaller
> data lookup.
> 
> I haven't started on any actual Unicode /algorithms/ yet - just getting the
> fast+small property lookups working was quite a challenge.
> 
> So it's going well. If you want to email me privately to consult, you can head
> over to dsource and post me a message there. This is going to be fun!


It sounds like you're really excited about this one ;). Your ideas sound good as well, but some comments:

- the optimal page size is different for each Unicode property. For example, in the new unichar module uses 128 elements per page for the character types and 512 elements per page for the mapping tables.
It would be optimal if the data creation tool would automatically figure out the best size and include the corresponding constants in the source file. A simple brute-force try-them-all approach should suffice, since there really are only about half a dozen realistic page sizes (they need to be a power of 2).

- a simple RLE compression of the final data that is compiled into the executable has proven to be very effective, since Unicode data usually contains lots of big gaps and ranges with the same properties. This dramatically reduces the size of the compiled executable.

- if you have multiple properties of the same type it can be a huge space saver to use the same "page pool" when decomposing the data into small pages. This worked well in unichar, which now has a single page pool for the lower, upper and title mappings. The combined pool data is only slightly larger than the pool for a single mapping.

- I'm not convinced that the lookup tables and algorithms should be created in a completely automatic way. A certain level of automation is obviously necessary, but I think it would pay off if the data can be filtered before it is decomposed into the lookup tables. The algorithms for accessing those tables would have to be adaptable too. Another example from the unichar module: the case mapping data is now stored as offsets relative to the original character index. This has two advantages. Number one is that the biggest offset fits very comfortably into 2 bytes, so we save one byte per element. The second advantage is that this dramatically increases the number of pages with the same contents, so the page pool ends up being a lot smaller.

- and my last concern: it seems that you want to develop a very general tool to implement every aspect of the Unicode standard. That is very comendable and nothing is wrong with it in itself, but I would advice you to reflect on the amount of work that is necessary to implement all that stuff. I have no idea how much time you can put into this project, but I know that my own time is unfortunately very limited. If you are in a similar situation it may be wise to tune the goals down a bit and progress in smaller steps, implementing one Unicode algorithm at a time. There is not much use in a full Unicode library that ends up being vaporware.
Then again if you DO have the time, please do not let my skepticism dampen your enthusiasm ;).


Hauke

P.S.: I haven't found any Unicode-related project on dsource.org. What were you referring to when you said I can contact you there?



June 13, 2004
In article <caigii$15jj$1@digitaldaemon.com>, Hauke Duden says...

>It would be optimal if the data creation tool would automatically figure out the best size and include the corresponding constants in the source file.

That's what I figured.

>- I'm not convinced that the lookup tables and algorithms should be created in a completely automatic way. A certain level of automation is obviously necessary, but I think it would pay off if the data can be filtered before it is decomposed into the lookup tables.

I'd thought of that.

>The algorithms for accessing those tables would have to be adaptable too. Another example from the unichar module: the case mapping data is now stored as offsets relative to the original character index.

Did that too. And I reduced the titlecase mapping down to almost nothing by subtracting it from the uppercase mapping.


>- and my last concern: it seems that you want to develop a very general tool to implement every aspect of the Unicode standard. That is very comendable and nothing is wrong with it in itself, but I would advice you to reflect on the amount of work that is necessary to implement all that stuff.

Panick ye not. I'm just thinking ahead. For the moment, it's really just property access I'm doing, then come the normalization algorithms. And then I'll stop, and go and redo Ints a bit better. I do have an idea of how much work is involved, but I've done this before (in C++, and less well) so I know what's involved.


>There is not much use in a full Unicode library that ends up being vaporware.

Well it can't ever be that if it's open source. If you or I get bored with it and drop out, someone else can carry on.


>Then again if you DO have the time, please do not let my skepticism dampen your enthusiasm ;).

Cool.


>P.S.: I haven't found any Unicode-related project on dsource.org. What were you referring to when you said I can contact you there?

Ah, no, there isn't any. But I have a user account there, and the Deimos project. My username is "Arcane Jill". It seems to be possible to send private messages to members. I mentioned that as a possiblity because I am reluctant to post my email address on a public forum.

Jill


June 13, 2004
Arcane Jill wrote:
>>- and my last concern: it seems that you want to develop a very general tool to implement every aspect of the Unicode standard. That is very comendable and nothing is wrong with it in itself, but I would advice you to reflect on the amount of work that is necessary to implement all that stuff.
> 
> 
> Panick ye not. I'm just thinking ahead. For the moment, it's really just
> property access I'm doing, then come the normalization algorithms. And then I'll
> stop, and go and redo Ints a bit better. I do have an idea of how much work is
> involved, but I've done this before (in C++, and less well) so I know what's
> involved.

Me, panicking? No chance ;).
Just wanted to make sure you know about the scope of this.


>>There is not much use in a full Unicode library that ends up being vaporware.
> 
> 
> Well it can't ever be that if it's open source. If you or I get bored with it
> and drop out, someone else can carry on.

With luck someone might. But it is an old story in open source projects that there are lots of initial enthusiasts, but very little people who have enough endurance to stay active over a longer period of time. And there's also the question of experience and skill...

I think it is prudent to not count on external help to magically show up. With some projects it does, with many it doesn't. So we should make sure that we don't bite off more than we can chew.

>>Then again if you DO have the time, please do not let my skepticism dampen your enthusiasm ;).
> 
> 
> Cool.
> 
> 
> 
>>P.S.: I haven't found any Unicode-related project on dsource.org. What were you referring to when you said I can contact you there?
> 
> 
> Ah, no, there isn't any. But I have a user account there, and the Deimos
> project. My username is "Arcane Jill". It seems to be possible to send private
> messages to members. I mentioned that as a possiblity because I am reluctant to
> post my email address on a public forum.

Ah, ok. So the reply address you use with your NG posts (Arcane_member@...) is invalid? Anyway, mine is not, so if you need to contact me...

Hauke

June 14, 2004
Arcane Jill wrote:

<snip>
> Oky doke - here goes. One thing though - this is merely a workaround for
> existing bugs, it does not really add any new functionality beyond what such
> arrays are supposed to do already. So I don't imagine you will use this code.
> You'd probably prefer to just fix the bugs, then a workaround won't be needed at
> all.
<snip>

Just as I've been writing the bit offset implementation that we've been talking about....

http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/495

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment.  Please keep replies on the 'group where everyone may benefit.
June 14, 2004
In article <cak50j$n70$1@digitaldaemon.com>, Stewart Gordon says...
>
>Just as I've been writing the bit offset implementation that we've been talking about....

Excellent!



June 15, 2004
Walter: It would sure be nice to also have the ireplace() and icount() functions
added to Phobos...if possible. Anyway, I decided to go ahead and wrote them up
from your replace() and count() as I had done with ifind() and irfind() per your
advice. I sure hope it's ok to bug you just a little for theses kind of updates.
;)

// Note these functions build off the ifind() defined in a previous post.

#/********************************************
# * Replace occurrences of from[] with to[] in s[].
# *
# * (A case insensitive version of std.string.replace)
# */
#
#char[] ireplace
#(
#    in char[] s,
#    in char[] sfrom,
#    in char[] sto
#)
#{
#    char[] srtn;
#    int    ip;
#    int    ix;
#
#    debug(string) printf("ireplace('%.*s','%.*s','%.*s')\n", s, sfrom, sto);
#
#    // if the length of any of the strings equals zero return "s",
#    // except the "to" can be an "" (Empty String).
#    if ( sfrom.length == 0 || s.length == 0 ||
#         sfrom.length > s.length - 1 || sto.length > s.length - 1 )
#        return s;
#
#    // find the position of the string from's 1st char
#    ip = ifind( s, cast(char)sfrom[ 0 ] );
#
#    // if the string from's 1st char was not found, return "s"
#    if ( ip == -1 || ip + sfrom.length > s.length - 1 )
#        return s;
#
#    if ( ip > 0 ) srtn = s[ 0 .. ip ];
#
#    for ( ix = ip; ix <  ( s.length - sfrom.length ) + 1; ix++ )
#    {
#        if ( icmp( s[ ix .. ( ix + sfrom.length ) ], sfrom ) == 0 )
#        {
#            srtn ~= sto;
#            ix += sfrom.length - 1;
#        }
#        else
#            srtn ~= s[ ix .. ( ix + 1 ) ];
#    }
#
#    if ( ix < s.length - 1)
#    {
#        ip = s.length - ix;
#        srtn ~= s[ ix .. ix + ip ];
#    }
#
#    return srtn;
#
#} // end ireplace( in char[], in char[], in char[] )
#
#unittest
#{
#    debug(string) printf("string.ireplace.unittest\n");
#
#    char[] s = "This is a FOO foO list";
#    char[] from = "fOo";
#    char[] to = "silly";
#    char[] r;
#    int i;
#
#    r = ireplace(s, from, to);
#    i = icmp(r, "This is a silly silly list");
#    assert(i == 0);
#}
#
#/***********************************************
# * Count up all instances of sub[] in s[].
# *
# * (A case insensitive version of std.string.count)
# */
#
#int icount
#(
#    in char[] s,
#    in char[] sub
#)
#{
#    int i;
#    int j;
#    int icounter = 0;
#
#    if ( s.length == 0 || sub.length == 0 || sub.length > s.length ) return 0;
#
#    for ( i = 0; i < s.length; i += j + sub.length )
#    {
#	j = ifind( s[ i .. s.length ], sub);
#
#	if ( j == -1 ) break;
#	icounter++;
#    }
#
#    return icounter;
#
#} // end icount( in char[], in char[] )
#
#unittest
#{
#    debug(string) printf("string.icount.unittest\n");
#
#    char[] s = "This is a foFofof list";
#    char[] sub = "FOf";
#    int i;
#
#    i = icount(s, sub);
#    assert(i == 2);
#}


-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
1 2 3 4 5 6 7 8 9
Next ›   Last »