View mode: basic / threaded / horizontal-split · Log in · Help
July 13, 2004
upper case
Overall, D fully integrates unicode strings, in data structures as well as in
the various functions provided. But there seem to be some little things forgoten
on the way in std.string:

Everything concerning upper-case and lower-case characters only process non
accentuated roman letters. This is the behaviour I would expect for functions
processing ANSI strings, but since D string encode unicode characters, it might
be a good idea to extend their behaviour to other characters like accentuted
roman letters, cyrilic letters, and so on... those also have upper-case and
lower-case forms.

for the sake of efficiency, clarity or something, maybe those could be supplied
as separated functions. maybe not. But anyway, i think this would have its place
in std.string. Otherwise, include something like "assert(language is english);"
in the preconditions of the functions ;)

Of course, this is not difficult to be implemented by the programmer who needs
it. But neither would be the current version which processes only non
actentuated roman letters. So if it is considered worth including for this case,
why not for the other?
July 13, 2004
Re: upper case
In article <ccve36$r2o$1@digitaldaemon.com>, FLorian Rivoal says...
>
>Overall, D fully integrates unicode strings, in data structures as well as in
>the various functions provided. But there seem to be some little things forgoten
>on the way in std.string:
>
>Everything concerning upper-case and lower-case characters only process non
>accentuated roman letters. This is the behaviour I would expect for functions
>processing ANSI strings, but since D string encode unicode characters, it might
>be a good idea to extend their behaviour to other characters like accentuted
>roman letters, cyrilic letters, and so on... those also have upper-case and
>lower-case forms.
>
>for the sake of efficiency, clarity or something, maybe those could be supplied
>as separated functions. maybe not. But anyway, i think this would have its place
>in std.string. Otherwise, include something like "assert(language is english);"
>in the preconditions of the functions ;)
>
>Of course, this is not difficult to be implemented by the programmer who needs
>it. But neither would be the current version which processes only non
>actentuated roman letters. So if it is considered worth including for this case,
>why not for the other?


Panicke ye not. The full Unicode caseing algorithms are on their way, complete
with locale-sensitivity as required by Turkish, Azeri and Lithuanian, and
context-sensitivity as required by Greek and a few others. Just wait a little
bit longer.

Right now, the functions getSimpleLowercaseMapping(),
getSimpleUppercaseMapping() and getSimpleTitlecaseMapping() in
etc.unicode.unicode perform case "Default Simple Case Mapping" as defined by the
Unicode standard. "Default" means not locale sensitive, and "Simple" means "one
character at a time, as defined in UnicodeData.txt". They perform case mappings
on a character-by-character basis, and work for ALL languages (except Turkish,
Azeri and Lithuanian, which will have to wait for the next version).

The forthcoming version will do everything. Including casefolding and
normalization. It's a few weeks away, unfortunately, so be patient.

It would not have been possible for std.string to do all that you require,
because a Unicode casing algorithm cannot possibly work unless it can first
access all the Unicode properties. std.string does not have that advantage -
hence etc.unicode.unicode. One day in the future, it is my hope that all of this
will be integrated into Phobos.

Arcane Jill.

Oh - PS - must apologize. A pre-linked downloadable version of
etc.unicode.unicode is STILL not available (so it's still just source code). The
reason for this was that it was my birthday last weekend, and I was partying
instead of coding. Since I actually have a day job, it will have to wait until
next weekend now.
July 13, 2004
Re: upper case
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message
news:cd04a3$2280$1@digitaldaemon.com...
> In article <ccve36$r2o$1@digitaldaemon.com>, FLorian Rivoal says...

> The forthcoming version will do everything. Including casefolding and
> normalization. It's a few weeks away, unfortunately, so be patient.

Sounds great. Thank you Jill in advance.
I think D is lack of good and consistent String class as java has.

For example, recently I stuck with:
Object {
...
char[] toString()
...
}
but I need wchar[] at least for supporting non ASCII languages. DMD
complains about another return type.

It seems that many good libs are coming out to the first versions very soon.
I looking forward for first DTL also.

> Oh - PS - must apologize. A pre-linked downloadable version of
> etc.unicode.unicode is STILL not available (so it's still just source
code). The
> reason for this was that it was my birthday last weekend,

Congratulations! It's a good reason for the rest. :))
July 13, 2004
Re: upper case
Blandger wrote:
> "Arcane Jill" <Arcane_member@pathlink.com> wrote in message
> news:cd04a3$2280$1@digitaldaemon.com...
> 
>>In article <ccve36$r2o$1@digitaldaemon.com>, FLorian Rivoal says...
> 
> 
>>The forthcoming version will do everything. Including casefolding and
>>normalization. It's a few weeks away, unfortunately, so be patient.
> 
> 
> Sounds great. Thank you Jill in advance.
> I think D is lack of good and consistent String class as java has.
> 
> For example, recently I stuck with:
> Object {
> ...
> char[] toString()
> ...
> }
> but I need wchar[] at least for supporting non ASCII languages. DMD
> complains about another return type.
> 
> It seems that many good libs are coming out to the first versions very soon.
> I looking forward for first DTL also.


I'm currently working on this. A String interface that abstracts from 
the specific encoding + a bunch of implementations for the most common 
ones (UTF-8, 16, 32, system codepage, etc...). It provides some very 
useful (IMHO) functionality too (like "split", which is so rarely 
implemented in non-script languages).

It is near completion and needs only a few more hours of work on 
documentation and testing. I hope to find the time within the next one 
or two weeks.

Hauke
July 13, 2004
Strings (was Re: upper case)
In article <cd0bgb$2g5g$1@digitaldaemon.com>, Hauke Duden says...

>I'm currently working on this. A String interface that abstracts from 
>the specific encoding + a bunch of implementations for the most common 
>ones (UTF-8, 16, 32, system codepage, etc...). It provides some very 
>useful (IMHO) functionality too (like "split", which is so rarely 
>implemented in non-script languages).

Hauke, dude, did anyone ever tell you you're brilliant? Well, I'll say it anyway
- you're brilliant. We need this.

I've always been annoyed that, while std.string has got some amazing functions
in it, like find() and so forth, they ONLY work chars! Huh????

I reckon that now that we have templates, find() should be made to work for ANY
kind of array - no need to limit it even to strings. Same for all the other nice
stringy functions.



>It is near completion and needs only a few more hours of work on 
>documentation and testing. I hope to find the time within the next one 
>or two weeks.
>
>Hauke

Yay. Looking forward to it.

Jill
July 13, 2004
Re: Strings (was Re: upper case)
Arcane Jill wrote:
> In article <cd0bgb$2g5g$1@digitaldaemon.com>, Hauke Duden says...
> 
> 
>>I'm currently working on this. A String interface that abstracts from 
>>the specific encoding + a bunch of implementations for the most common 
>>ones (UTF-8, 16, 32, system codepage, etc...). It provides some very 
>>useful (IMHO) functionality too (like "split", which is so rarely 
>>implemented in non-script languages).
> 
> 
> Hauke, dude, did anyone ever tell you you're brilliant? Well, I'll say it anyway
> - you're brilliant. We need this.

Not recently, so thank you very much ;).

> I've always been annoyed that, while std.string has got some amazing functions
> in it, like find() and so forth, they ONLY work chars! Huh????
> 
> I reckon that now that we have templates, find() should be made to work for ANY
> kind of array - no need to limit it even to strings. Same for all the other nice
> stringy functions.

Yes. I've written a mixin that contains the string algorithms and that 
is used in the String classes. I've also gone to some length to ensure 
that the character decoding stuff can be inlined into the mixed-in 
algorithms. So performance will (hopefully - I haven't done any tests 
yet) be good.

Hauke
July 13, 2004
Re: upper case
In article <cd085g$29tq$1@digitaldaemon.com>, Blandger says...

>but I need wchar[] at least for supporting non ASCII languages.

Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is available to
char[] arrays.

#    char[] s = "&#1041;&#1075;&#1047;&#1049; &#10077;&#13181;&#9283;&#10078;
&#5797;&#5801;&#5804; &#1600;&#1601;&#1602;";

is perfectly legal. (And you can use etc.unicode's getSimpleUppercaseMapping()
to uppercase it too).

Arcane Jill
July 13, 2004
Re: upper case
In article <cd0jdn$2sru$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <cd085g$29tq$1@digitaldaemon.com>, Blandger says...
>
>>but I need wchar[] at least for supporting non ASCII languages.
>
>Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is available to
>char[] arrays.
>
>#    char[] s = "&#1041;&#1075;&#1047;&#1049; &#10077;&#13181;&#9283;&#10078;
>&#5797;&#5801;&#5804; &#1600;&#1601;&#1602;";
>
>is perfectly legal. (And you can use etc.unicode's getSimpleUppercaseMapping()
>to uppercase it too).
>
>Arcane Jill


Okay, so it doesn't come out right on this forum!
But it will work in D source.
July 13, 2004
Re: upper case
"Hauke Duden" <H.NS.Duden@gmx.net> wrote in message
news:cd0bgb$2g5g$1@digitaldaemon.com...
> Blandger wrote:

> I'm currently working on this. A String interface that abstracts from
> the specific encoding + a bunch of implementations for the most common
> ones (UTF-8, 16, 32, system codepage, etc...). It provides some very
> useful (IMHO) functionality too (like "split", which is so rarely
> implemented in non-script languages).

Wow! Nice to hear it. :)

> It is near completion and needs only a few more hours of work on
> documentation and testing. I hope to find the time within the next one
> or two weeks.

Good. Don't hurry much, just make it good, consistent and handy for working
with. Thanks!
July 13, 2004
Re: upper case
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message
news:cd0jdn$2sru$1@digitaldaemon.com...
> In article <cd085g$29tq$1@digitaldaemon.com>, Blandger says...
>
> >but I need wchar[] at least for supporting non ASCII languages.
>
> Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is
available to
> char[] arrays.
>
> #    char[] s = "&#1041;&#1075;&#1047;&#1049;
&#10077;&#13181;&#9283;&#10078;
> &#5797;&#5801;&#5804; &#1600;&#1601;&#1602;";
>
> is perfectly legal. (And you can use etc.unicode's
getSimpleUppercaseMapping() to uppercase it too).

Thanks for addition.

You are right it's legal but it looks (and I think works) ugly. It seems to
me there is no 'normal way' to work with upper/lowecase, sort, search,
collate, replace, code pages stuff  with non ASCII letters within Phobos in
this case . Or am I something missed ??
« First   ‹ Prev
1 2 3
Top | Discussion index | About this forum | D home