June 09, 2004
Walter wrote:

>Another option is to only allow bit slicing on byte boundaries, and only
>allow pointers to bits if they are in bit 0 of a byte.
>  
>
Yeah I've been thinking this is probably the best option.  Users could make there own bit-pointers for the extra 3 bits.  Parhaps you could enable something like (for the boundary approach):

bit [] array;
...
bit * bp = &array[0];

bp[1] = 1; //Access bit 1 in bp pointer 1

bp[1000] //Try to access bit 1000

A third option I was thinking about:

Use an 32-bits to always offset the bit array from the start of the bit array (thus requiring 64-bits).  I'm sure that that would enable some parts of the algorithm to be optimised, such as interating though the loop and slicing, only one value out of the two would need to be incremented.

When converting to another pointer type (such as void) compute the byte boundary, and lose the bit location information <- that could also be done with stewards suggestion.  Now if the user wrote something like:

byte * bp = cast(byte*) &array[0];

Then the compiler could optimise out the extra 32-bits.

Rational: We don't have any bit pointer at the moment so we have no performace to lose.  This way a bit pointer is more functional and you still can slice along the boundary.

-- 
-Anderson: http://badmama.com.au/~anderson/
June 10, 2004
In article <ca7pet$fn7$1@digitaldaemon.com>, Vathix says...
>
>Hello, I just wanted to let you know that I wrote those functions awhile ago for a String class that can be found at www.dprogramming.com/stringclass.d . It contains all the free functions, and a few others such as findany(), endswith(), etc; and case insensitive versions. I haven't said much about it because it's completely based off Walter's code, so it belongs to him. The class can be stripped out to just use the functions. If the code isn't good enough, just ignore me; have fun!
>
>

Vathix: I looked over your stringclass.d code that's based off of Walter's
string.d, and it does looks a lot more in line with what he'll accept. Myself,
I'd just like to have the ifind(char[],char[])/ifind(char[], char) and
irfind(char[], char[])/irfind(char[], char) functions in std.string.

Anyways, it I would seem my "C" skills have gotten a bit rusty, cause I've been relearning a lot of things I had forgotten about in trying to create a good version of the ifind() and irfind() functions for general use.

So please feel free to post those ifind code potions and see what Walter thinks.
Cause currently I'm beginning to think my versions are going to still be a
little bulkier than what Walter will allow in, and I would like to move on and
finish up my propercase() function (which is why I even mention that the
ifind()/irfind() were missing in the first place). :) And beside that, Walter is
a very busy man and I don't want to waste his time.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
June 10, 2004
In article <ca7lhq$9eo$1@digitaldaemon.com>, Hauke Duden says...
>
>Arcane Jill wrote:
>> Full casing (but not simple casing) has localized exceptions ONLY for Tukish, Lithuanian and Azeri. In principle, other exceptions could be added in the future. Simple casing is completely locale independent.
>
>Ouch. I didn't know that. Makes me feel happy that I stayed away from the special casings up to now ;).
>
>Thanks for clearing up the misunderstanding!
>
>Hauke


Actually, I think I'd quite like to have a bash at writing some of the Unicode algorithms. I've finished the Int class now (Well, almost. I've just got to add a little bit of memory management, but I know how to do that now). After that, I was planning to move onto the next bit of my crypto lib (random numbers). But - the Unicode functions would be relatively quick to write (compared with the crypto stuff), and it would be quite nice to have a break and do something else for a change.

If I do that, I'll need to collaborate with you, Hauke. There's no point in duplicating effort, and we could do with a common format for the compiled unicode data files. In a way, that's YOUR area of expertize, not mine, because you seemed to know that >>9 was more efficient than [n], something I wouldn't have known. Also, we mustn't forget the UPR format I mentioned, which has the benefits of being binary, easily parsable, extendable, publicly available, open source, and easily updateable with each new version of Unicode.

I could do normalization functions first - canonical/compatibility equivalence; finding glyph boundaries, that sort of thing. But I don't want to be treading on your toes, which I would be if I went and invented a new format for the compiled data. So I don't want to do that without collaboration.

My concern is that you probably only compiled in enough information to do simple casing, so I wouldn't be able to extract normalization/boundary information from your compiled format. (But I'm guessing, as I haven't studied your source code in depth).

I think it would be great if a D standard library had FULL Unicode support. Even C++ and Java don't do that. (And that's not even mentioning Java's crippled 16-bit chars). It would effectively turn D into the language of choice for Unicode apps.

Anyway, let me know what you think (and check out UPR, if you have time. The URL is http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/, with the format itself documented at http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/format.html - or we could invent our own, but why re-invent the wheel?).

Arcane Jill


June 10, 2004
Hauke Duden wrote:
<snip>
> The interesting entries are the last three. Their format is UPPER;LOWER;TITLE. So both letters have an upper and title mapping to 03A3 and no lower mapping.

So that's why the uppercase form is given twice.  I couldn't find a key to the columns anywhere.

<snip>
>> So, which characters do the one-to-many mappings bring about?
> 
> An example of a character with special casing is 1FB2 (GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI). Its upper case maps to 1FBA + 0399 (GREEK CAPITAL LETTER ALPHA WITH VARIA + GREEK CAPITAL LETTER IOTA).

Yes, there are two possible meanings of "one-to-many mapping".  A letter  splitting into two letters when the case is changed.  Or a letter that case-converts to different letters depending on context, like Greek sigma.

How does it handle the title case of Welsh digraphs, for example?  Or is that another localised exception yet to be written into the standard?

>>>> And the mappings that there are aren't language independent.
>>>
>>> Huh? Casing is not effected by locale. Maybe you are thinking about collation?
>>
>> What do you mean by that?
> 
> Collation is a locale dependent comparison of strings. I.e. it defines the "phone book" ordering of strings in a particular language.

I didn't think that had anything to do with the fact that, e.g. in Turkish, the uppercase form of 0069 is 0130 instead of 0049.

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment.  Please keep replies on the 'group where everyone may benefit.
June 10, 2004
Walter wrote:

> Another option is to only allow bit slicing on byte boundaries, and only allow pointers to bits if they are in bit 0 of a byte.

Yes, that was another suggestion in the debate.  But I'm inclined to believe some of my experiments could be put to practical use.

I'll probably do some more experimenting over the weekend....

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment.  Please keep replies on the 'group where everyone may benefit.
June 10, 2004
Arcane Jill wrote:
> Actually, I think I'd quite like to have a bash at writing some of the Unicode
> algorithms. I've finished the Int class now (Well, almost. I've just got to add
> a little bit of memory management, but I know how to do that now). After that, I
> was planning to move onto the next bit of my crypto lib (random numbers). But -
> the Unicode functions would be relatively quick to write (compared with the
> crypto stuff), and it would be quite nice to have a break and do something else
> for a change.
> 
> If I do that, I'll need to collaborate with you, Hauke. There's no point in
> duplicating effort, and we could do with a common format for the compiled
> unicode data files. In a way, that's YOUR area of expertize, not mine, because
> you seemed to know that >>9 was more efficient than [n], something I wouldn't
> have known. Also, we mustn't forget the UPR format I mentioned, which has the
> benefits of being binary, easily parsable, extendable, publicly available, open
> source, and easily updateable with each new version of Unicode.
> 
> I could do normalization functions first - canonical/compatibility equivalence;
> finding glyph boundaries, that sort of thing. But I don't want to be treading on
> your toes, which I would be if I went and invented a new format for the compiled
> data. So I don't want to do that without collaboration.

I think it is a good idea to coordinate our Unicode efforts. I haven't written any code for normalization, so this would be really useful.

What I have worked on lately (when I found a few minutes of spare time) is a string interface that abstracts from the specific encoding plus implementations for some common encodings (UTF-X, Latin-1, ASCII, system code page). It includes the usual string functions like comparing (characters ordered by index - no collation), searching, concatenation, etc. Caseless comparison and searching is also implemented (using the simple lower mapping - no full case folding).

So if you write normalization routines that would be great!

> My concern is that you probably only compiled in enough information to do simple
> casing, so I wouldn't be able to extract normalization/boundary information from
> your compiled format. (But I'm guessing, as I haven't studied your source code
> in depth).

That's true - the unichar data only contains the case mappings and character type info. I think it is important to separate the different Unicode tables, so that using a single Unicode routine won't cause ALL the data to be linked into the program.

> I think it would be great if a D standard library had FULL Unicode support. Even
> C++ and Java don't do that. (And that's not even mentioning Java's crippled
> 16-bit chars). It would effectively turn D into the language of choice for
> Unicode apps.

I agree - that is my goal as well. In fact I see it as an opportunity to influence the language in its early stages so that it will have standardized(!) Unicode support. It prevents every component developer from implementing his own, which can cause lots of unnecessary bloat (Unicode data isn't small...).

> Anyway, let me know what you think (and check out UPR, if you have time. The URL
> is http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/, with the format
> itself documented at
> http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/format.html - or we
> could invent our own, but why re-invent the wheel?).

I haven't had time to look at it yet, but I promise to do so ;).


Hauke
June 10, 2004
Stewart Gordon wrote:
> Hauke Duden wrote:
> <snip>
> 
>> The interesting entries are the last three. Their format is UPPER;LOWER;TITLE. So both letters have an upper and title mapping to 03A3 and no lower mapping.
> 
> 
> So that's why the uppercase form is given twice.  I couldn't find a key to the columns anywhere.

The file format is described here:
http://www.unicode.org/Public/UNIDATA/UCD.html#UCD_Files
(see the section about UnicodeData.txt)

> How does it handle the title case of Welsh digraphs, for example?  Or is that another localised exception yet to be written into the standard?

If you know the index of that character you can look it up in this file:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt


Hauke


June 10, 2004
Hauke Duden wrote:

<snip>
> The file format is described here:
> http://www.unicode.org/Public/UNIDATA/UCD.html#UCD_Files
> (see the section about UnicodeData.txt)

The first column has been omitted from that list.

>> How does it handle the title case of Welsh digraphs, for example?  Or is that another localised exception yet to be written into the standard?
> 
> If you know the index of that character you can look it up in this file:
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

No, Welsh digraphs isn't a character.  They are various single letters each composed of two characters.  CH, DD, FF, LL, NG, RH, TH (have I missed one?) are all single letters in Welsh, but AFAICF each doesn't have its own Unicode character, leaving the regular Latin letters in the ranges 0043..0054 and 0063..0074 to be combined to make them.

FWIS these digraphs are title-cased together, e.g. properly LLanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch not
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch

OST, looking at the Welsh Wikipedia, they generally seem to be written in mixed case.  Who is right?  Or is it a matter of preference?

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment.  Please keep replies on the 'group where everyone may benefit.
June 10, 2004
In article <ca9gkh$2f2$1@digitaldaemon.com>, Stewart Gordon says...
>
>>> How does it handle the title case of Welsh digraphs, for example?  Or is that another localised exception yet to be written into the standard?

I know nothing about Welsh, but, I do know that Welsh is NOT an exception according to the rules of Unicode. Therefore, according to the rules:

Lowercase:  llan...
Titlecase:  Llan...
Uppercase:  LLAN...

Now, what I'm about to say may possibly make me a little unpopular. I /hope/ not, but I wish to be accurate, and, well, if what can I say? Please don't shoot the messenger! The fact is, /if/ Unicode has got it wrong, then the place to complain about it is the Unicode Consortium public forum at http://www.unicode.org/consortium/distlist.html - NOT the D forum. Our job is to implement the Unicode standard as it exists today at revision 4.1 - even if we think that standard is wrong. It would be inappropriate for us to start tweaking it here and there just because we don't like bits of it. Errors and omissions in the standard are certainly possible (and even likely), but a standard is a standard, and such errors will inevitably be fixed in the course of time. If and when the standard changes, that's when we should change with it.

I mean - it's not like C++ or Java can do any better!

My sincere apologies in advance if I've offended anyone. Arcane Jill


June 10, 2004
In article <ca7qp4$hrl$1@digitaldaemon.com>, Walter says...
>
>There's no need to .dup the strings. Just have a loop that looks like this:
>
>for (i = 0; i < string1.length; i++)
>{    char c = toupper(string1[i]);
>    if (c != toupper(string2[i]))
>        goto nomatch;
>}
>
>Note that it compares character by character without needing to allocate memory. In fact, just copy the logic in find() and rfind(), replacing memchr and memcmp with case insensitive loops, write some unit tests, and you'll be there.
>
>

Walter: Ok, per your advice I've copied the original find() / rfind() functions
from std.string, and modified them into ifind() / irfind(). I sure hope these
will make the grade <g>.

Below I've used the "#" to keep the indenting in tact, so just replace the "#" a "" and all the code will be ready to copy and paste.

Thxs for giving this chance to add something to "D!" :)

#// string.d
#// Written by Walter Bright
#// Copyright (c) 2001 Digital Mars
#// All Rights Reserved
#// www.digitalmars.com
#
#// String handling functions.
#debug=string;		// uncomment to turn on debugging printf's
#
#debug(string)
#{
#    import std.c.stdio;	// for printf()
#}
#
#import std.string;
#
#/******************************************
# * Find first occurrance of c in string s.
# * Return index in s where it is found.
# * Return -1 if not found.
# *
# * (A case insensitive version of std.string.find)
# */
#
#int ifind
#(
#    in char[] s,
#    in char   c
#)
#{
#    char c1 = c;
#    char c2;
#
#    c1 = ( c1 >= '\x41' && c1 <= '\x5A' ? c1 + '\x20' : c1 );
#
#    for (int i = 0; i  < s.length; i++)
#    {
#        c2 = s[ i ];
#        c2 = ( c2 >= '\x41' && c2 <= '\x5A' ? c2 + '\x20' : c2 );
#        if ( c1 == c2 ) return i;
#    }
#
#    return -1;
#} // end ifind( in char[], in char )
#
#unittest
#{
#    debug(string) printf("string.ifind(char[],char).unittest\n");
#
#    int i;
#
#    i = ifind( null, cast(char)'a' );
#    assert( i == -1 );
#    i = ifind( "def", cast(char)'a' );
#    assert( i == -1 );
#    i = ifind( "abba", cast(char)'a' );
#    assert( i == 0 );
#    i = ifind( "def", cast(char)'f' );
#    assert( i == 2 );
#}
#
#/******************************************
# * Find last occurrance of c in string s.
# * Return index in s where it is found.
# * Return -1 if not found.
# *
# * (A case insensitive version of std.string.rfind)
# */
#
#int irfind
#(
#    in char[] s,
#    in char   c
#)
#{
#    char c1 = c;
#    char c2;
#
#    c1 = ( c1 >= '\x41' && c1 <= '\x5A' ? c1 + '\x20' : c1 );
#
#    for (int i = s.length - 1; i >= 0; i--)
#    {
#        c2 = s[ i ];
#        c2 = ( c2 >= '\x41' && c2 <= '\x5A' ? c2 + '\x20' : c2 );
#
#        //debug(string) printf("pos=%d, s=%.*s, c=%c, c1=%c, c2=%c\n", i, s, c,
c1, c2);
#
#        if ( c1 == c2 ) return i;
#    }
#
#    return -1;
#
#} // irfind( in char[], in char )
#
#unittest
#{
#    debug(string) printf("string.irfind(char[],char).unittest\n");
#
#    int i;
#
#    i = irfind(null, cast(char)'a');
#    assert(i == -1);
#    i = irfind("def", cast(char)'a');
#    assert(i == -1);
#    i = irfind("abba", cast(char)'a');
#    assert(i == 3);
#    i = irfind("def", cast(char)'f');
#    assert(i == 2);
#}
#
#/*************************************
# * Find first occurrance of sub[] in string s[].
# * Return index in s[] where it is found.
# * Return -1 if not found.
# *
# * (A case insensitive version of std.string.find)
# */
#
#int ifind
#(
#    in char[] s,
#    in char[] sub
#)
#    out ( result )
#    {
#	if ( result == -1 )
#	{
#	}
#	else
#	{
#	    assert( 0 <= result && result < s.length - sub.length + 1 );
#	}
#    }
#    body
#    {
#        int  ip = -1;
#
#	if ( sub.length == 0 || s.length == 0 || sub.length > s.length - 1 )
#            return -1; // was return 0;
#
#	if ( sub.length == 1 )
#	{
#            return ifind( s, cast(char)sub[ 0 ] );
#	}
#	else
#	{
#            ip = ifind( s, cast(char)sub[ 0 ] );
#
#            if ( ip == -1 || ( ip + sub.length > s.length - 1 ) )
#                return -1;
#            else
#            {
#                for ( int x = ip; x < ( s.length - sub.length ) + 1; x++ )
#                    if ( icmp( s[ x .. ( x + sub.length ) ], sub ) == 0 )
return x;
#            }
#	}
#
#	return -1;
#
#    }
#// end ifind( in char[] s, in char[] sub )
#
#unittest
#{
#    debug(string) printf("string.ifind(char[],char[]).unittest\n");
#
#    int i;
#
#    i = ifind(null, "a");
#    assert(i == -1);
#    i = ifind("def", "a");
#    assert(i == -1);
#    i = ifind("abba", "a");
#    assert(i == 0);
#    i = ifind("def", "f");
#    assert(i == 2);
#    i = ifind("dfefffg", "fff");
#    assert(i == 3);
#    i = ifind("dfeffgfff", "fff");
#    assert(i == 6);
#}
#
#/*************************************
# * Find last occurrance of sub in string s.
# * Return index in s where it is found.
# * Return -1 if not found.
# *
# * (A case insensitive version of std.string.rfind)
# */
#
#int irfind(char[] s, char[] sub)
#    out (result)
#    {
#	if (result == -1)
#	{
#	}
#	else
#	{
#	    assert(0 <= result && result < s.length - sub.length + 1);
#	}
#    }
#    body
#    {
#        int  ip = -1;
#
#	if ( sub.length == 0 || s.length == 0 || sub.length > s.length - 1 )
#            return -1; // was return 0;
#
#	if ( sub.length == 1 )
#	{
#            return irfind( s, cast(char)sub[ 0 ] );
#	}
#	else
#	{
#            ip = irfind( s, cast(char)sub[ 0 ] );
#
#            //debug(string) printf("1) ip=%d\n", ip);
#
#            if ( ip == -1 )
#                return -1;
#            else
#            {
#                //debug(string) printf("2) ip=%d\n", ip);
#
#                for ( int x = ip; x >= 0; x-- )
#                    if ( icmp( s[ x .. ( x + sub.length ) ], sub ) == 0 )
return x;
#            }
#	}
#
#	return -1;
#    }
#// end irfind( in char[], in char[] )
#
#unittest
#{
#    debug(string) printf("string.rifind(char[],char[]).unittest\n");
#
#    int i;
#
#    i = irfind("abcdefcdef", "c");
#    assert(i == 6);
#    i = irfind("abcdefcdef", "cd");
#    assert(i == 6);
#    i = irfind("abcdefcdef", "x");
#    assert(i == -1);
#    i = irfind("abcdefcdef", "xy");
#    assert(i == -1);
#    i = irfind("abcdefcdef", "");
#    assert(i == -1);
#    i = irfind( "abcdefcdef", "def" );
#    assert(i == 7);
#}
#
#// Testing the above functions
#int main()
#{
#    printf("string.ifind(char[],char).unittest\n");
#
#    printf("ifind( null, cast(char)'a') = %d, ans = -1\n", ifind( null,
cast(char)'a' ) );
#    printf("ifind( \"def\", cast(char)'a' ) = %d, ans = -1\n", ifind( "def",
cast(char)'a' ) );
#    printf("ifind( \"abba\", cast(char)'a' ) = %d, ans = 0\n", ifind( "abba",
cast(char)'a' ) );
#    printf("ifind( \"def\", cast(char)'f' ) = %d, ans = 2\n", ifind( "def",
cast(char)'f' ) );
#
#    printf("\n\n");
#
#    printf("string.irfind(char[],char).unittest\n");
#    printf("irfind( null, cast(char)'a' ) = %d, ans = -1\n", irfind( null,
cast(char)'a' ) );
#    printf("irfind( \"def\", cast(char)'a' ) = %d, ans= -1\n",irfind( "def",
cast(char)'a') );
#    printf("irfind( \"abba\", cast(char)'a' ) = %d, ans= 3\n", irfind( "abba",
cast(char)'a') );
#    printf("irfind( \"def\", cast(char)'f' ) = %d, ans = 2\n", irfind( "def",
cast(char)'f') );
#
#    printf("\n\n");
#
#    printf("string.ifind(char[],char[]).unittest\n");
#    printf("ifind( null, \"a\" ) = %d, ans = -1\n", ifind( null, "a" ) );
#    printf("ifind( \"def\", \"a\" ) = %d, ans = -1\n", ifind( "def", "a" ) );
#    printf("ifind( \"abba\", \"a\" ) = %d, ans= 0\n", ifind( "abba", "a" ) );
#    printf("ifind( \"def\", \"f\") = %d, ans = 2\n", ifind( "def", "f" ) );
#    printf("ifind( \"dfefffg\", \"fff\") = %d, ans = 3\n", ifind( "dfefffg",
"fff" ) );
#    printf("ifind( \"dfeffgfff\", \"fff\") = %d, ans = 6\n", ifind(
"dfeffgfff", "fff" ) );
#
#    printf("\n\n");
#
#    printf("string.rifind(char[],char[]).unittest\n");
#    printf("irfind( \"abcdefcdef\", \"c\" ) = %d, ans = 6\n", irfind(
"abcdefcdef", "c" ) );
#    printf("irfind( \"abcdefcdef\", \"cd\" ) = %d, ans = 6\n", irfind(
"abcdefcdef", "cd" ) );
#    printf("irfind( \"abcdefcdef\", \"x\" ) = %d, ans= -1\n", irfind(
"abcdefcdef", "x" ) );
#    printf("irfind( \"abcdefcdef\", \"xy\") = %d, ans = -1\n", irfind(
"abcdefcdef", "xy" ) );
#    printf("irfind( \"abcdefcdef\", \"\") = %d, ans = -1\n", irfind(
"abcdefcdef", "" ) );
#    printf("irfind( \"abcdefcdef\", \"def\") = %d, ans = 7\n", irfind(
"abcdefcdef", "def" ) );
#
#    return 0;
#}

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"