June 09, 2004
Stewart Gordon wrote:
> Hauke Duden wrote:
> 
> <snip>
> 
>> In any case, in Unicode upper and lower case characters do not have a constant offset to each other. That is only true for the ASCII subset.
> 
> 
> Yes, you do have a point there.  What's more, there isn't a 1:1 mapping between uppercase and lowercase characters.

You're wrong. the Unicode standard defines 1:1 case mappings (see http://www.unicode.org/Public/UNIDATA/UCD.html). There is also an additional "special casing" with one-to-many mappings but only a handful of characters are effected. It would be nice to support that too, but for everyday work the 1:1 mappings are usually sufficient.


  And the mappings that there
> are aren't language independent.

Huh? Casing is not effected by locale. Maybe you are thinking about collation?

Hauke
June 09, 2004
In article <ca71c4$2b8l$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <ca54is$2h2r$1@digitaldaemon.com>, David L. Davis says...
>
>> sStr[ iStrPos ] + 0x20
>
>Ah! Now these old ASCII habits really should be dropped. Hauke has written this magnificent charToUpper() routine. It should be used.
>
>> I feel like a young Skywalker in training, learning how to best use "The Force!"
>
>Other than that: Impressive - Obi Won has taught you well. (Hope I'm not too
>discouraging).  :)
>
>Jill
>
>

Jill: Don't sweat it, all your advice has been encouraging! :) If I wasn't getting any feedback at all from anyone, now that would be "discouraging" in my mind...again thxs for your advice.

Afterall, if these functions meet Walter and the "D" forum's approval, they just might become a part of the std.string for everyone to use. After work, I'll check out Hauke's charToLower() function, and see what kind of requirements it has. And if it looks like a good fix, I'll ask Hauke if I may use it...giving him full credit for his work of course. :)

David

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
June 09, 2004
Hauke Duden wrote:

> Stewart Gordon wrote:
<snip>
>> Yes, you do have a point there.  What's more, there isn't a 1:1 mapping between uppercase and lowercase characters.
> 
> You're wrong. the Unicode standard defines 1:1 case mappings (see http://www.unicode.org/Public/UNIDATA/UCD.html).

There seems to be a contradiction here.  That file indicates that UnicodeData.txt only contains 1:1 mappings.  But just as I wondered, there's a 2:1 mapping in 03C2 and 03C3.

> There is also an additional "special casing" with one-to-many mappings but only a handful of characters are effected. It would be nice to support that too, but for everyday work the 1:1 mappings are usually sufficient.

So, which characters do the one-to-many mappings bring about?

>> And the mappings that there are aren't language independent.
> 
> Huh? Casing is not effected by locale. Maybe you are thinking about collation?

What do you mean by that?

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the
unfortunate victim of intensive mail-bombing at the moment.  Please keep
replies on the 'group where everyone may benefit.
June 09, 2004
In article <ca7cgp$2svc$1@digitaldaemon.com>, Stewart Gordon says...
>
>There seems to be a contradiction here.  That file indicates that UnicodeData.txt only contains 1:1 mappings.  But just as I wondered, there's a 2:1 mapping in 03C2 and 03C3.

Look, it's perfectly simple. Everybody's right. And because everybody's right, everybody's accusing everybody else of being wrong. THERE ARE TWO ANSWERS.

"Simple casing" is a one to mapping from character to character, and is locale-independent.

"Full casing" is a a one to many mapping from string to string, and is ALMOST locale independent, but not quite.

Hauke's brilliant library supports simple casing, not full casing. That's why both the input and the output are characters, not strings.



>So, which characters do the one-to-many mappings bring about?

For example, the German character 'ß' uppercases to "SS" when using full casing, but it stays as 'ß' using simple casing.



>>> And the mappings that there are aren't language independent.
>> 
>> Huh? Casing is not effected by locale. Maybe you are thinking about collation?
>
>What do you mean by that?

Full casing (but not simple casing) has localized exceptions ONLY for Tukish, Lithuanian and Azeri. In principle, other exceptions could be added in the future. Simple casing is completely locale independent.

Collation is a different kettle of fish, and we currently have no libraries to support it.

Arcane Jill


June 09, 2004
Stewart Gordon wrote:
> Hauke Duden wrote:
> 
>> Stewart Gordon wrote:
> 
> <snip>
> 
>>> Yes, you do have a point there.  What's more, there isn't a 1:1 mapping between uppercase and lowercase characters.
>>
>>
>> You're wrong. the Unicode standard defines 1:1 case mappings (see http://www.unicode.org/Public/UNIDATA/UCD.html).
> 
> 
> There seems to be a contradiction here.  That file indicates that UnicodeData.txt only contains 1:1 mappings.  But just as I wondered, there's a 2:1 mapping in 03C2 and 03C3.

Where did you get that information? From the data file
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt:

03C2;GREEK SMALL LETTER FINAL SIGMA;Ll;0;L;;;;;N;;;03A3;;03A3
03C3;GREEK SMALL LETTER SIGMA;Ll;0;L;;;;;N;;;03A3;;03A3

The interesting entries are the last three. Their format is UPPER;LOWER;TITLE. So both letters have an upper and title mapping to 03A3 and no lower mapping.

 >> There is also an additional "special casing" with one-to-many mappings
>> but only a handful of characters are effected. It would be nice to support that too, but for everyday work the 1:1 mappings are usually sufficient.
> 
> 
> So, which characters do the one-to-many mappings bring about?

An example of a character with special casing is 1FB2 (GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI). Its upper case maps to 1FBA + 0399 (GREEK CAPITAL LETTER ALPHA WITH VARIA + GREEK CAPITAL LETTER IOTA).

>>> And the mappings that there are aren't language independent.
>>
>>
>> Huh? Casing is not effected by locale. Maybe you are thinking about collation?
> 
> 
> What do you mean by that?

Collation is a locale dependent comparison of strings. I.e. it defines the "phone book" ordering of strings in a particular language.

Hauke
June 09, 2004
Arcane Jill wrote:
> Full casing (but not simple casing) has localized exceptions ONLY for Tukish,
> Lithuanian and Azeri. In principle, other exceptions could be added in the
> future. Simple casing is completely locale independent.

Ouch. I didn't know that. Makes me feel happy that I stayed away from the special casings up to now ;).

Thanks for clearing up the misunderstanding!

Hauke
June 09, 2004
"David L. Davis" <SpottedTiger@yahoo.com> wrote in message news:ca69ie$1813$1@digitaldaemon.com...
> In article <ca5ct6$2vif$1@digitaldaemon.com>, Walter says...
> >
> >The function uppercases the input string. It shouldn't modify its inputs.
> >
>
> Walter: Third time around is normally the "Charm!" Anywayz, I've been
hammering
> away at these two functions ifind() and irfind(), and I believe I've make
them
> much better than before, thanks to both you and Jill for the advice.
>

Hello, I just wanted to let you know that I wrote those functions awhile ago for a String class that can be found at www.dprogramming.com/stringclass.d . It contains all the free functions, and a few others such as findany(), endswith(), etc; and case insensitive versions. I haven't said much about it because it's completely based off Walter's code, so it belongs to him. The class can be stripped out to just use the functions. If the code isn't good enough, just ignore me; have fun!


June 09, 2004
There's no need to .dup the strings. Just have a loop that looks like this:

for (i = 0; i < string1.length; i++)
{    char c = toupper(string1[i]);
    if (c != toupper(string2[i]))
        goto nomatch;
}

Note that it compares character by character without needing to allocate memory. In fact, just copy the logic in find() and rfind(), replacing memchr and memcmp with case insensitive loops, write some unit tests, and you'll be there.


June 09, 2004
Another option is to only allow bit slicing on byte boundaries, and only allow pointers to bits if they are in bit 0 of a byte.


June 09, 2004
In article <ca7qp5$hrl$2@digitaldaemon.com>, Walter says...
>
>Another option is to only allow bit slicing on byte boundaries, and only allow pointers to bits if they are in bit 0 of a byte.
>
>

That's EXACTLY what my workaround does. You can have the code for free if you want.

Jill