July 13, 2004
Overall, D fully integrates unicode strings, in data structures as well as in the various functions provided. But there seem to be some little things forgoten on the way in std.string:

Everything concerning upper-case and lower-case characters only process non accentuated roman letters. This is the behaviour I would expect for functions processing ANSI strings, but since D string encode unicode characters, it might be a good idea to extend their behaviour to other characters like accentuted roman letters, cyrilic letters, and so on... those also have upper-case and lower-case forms.

for the sake of efficiency, clarity or something, maybe those could be supplied as separated functions. maybe not. But anyway, i think this would have its place in std.string. Otherwise, include something like "assert(language is english);" in the preconditions of the functions ;)

Of course, this is not difficult to be implemented by the programmer who needs it. But neither would be the current version which processes only non actentuated roman letters. So if it is considered worth including for this case, why not for the other?


July 13, 2004
In article <ccve36$r2o$1@digitaldaemon.com>, FLorian Rivoal says...
>
>Overall, D fully integrates unicode strings, in data structures as well as in the various functions provided. But there seem to be some little things forgoten on the way in std.string:
>
>Everything concerning upper-case and lower-case characters only process non accentuated roman letters. This is the behaviour I would expect for functions processing ANSI strings, but since D string encode unicode characters, it might be a good idea to extend their behaviour to other characters like accentuted roman letters, cyrilic letters, and so on... those also have upper-case and lower-case forms.
>
>for the sake of efficiency, clarity or something, maybe those could be supplied as separated functions. maybe not. But anyway, i think this would have its place in std.string. Otherwise, include something like "assert(language is english);" in the preconditions of the functions ;)
>
>Of course, this is not difficult to be implemented by the programmer who needs it. But neither would be the current version which processes only non actentuated roman letters. So if it is considered worth including for this case, why not for the other?


Panicke ye not. The full Unicode caseing algorithms are on their way, complete with locale-sensitivity as required by Turkish, Azeri and Lithuanian, and context-sensitivity as required by Greek and a few others. Just wait a little bit longer.

Right now, the functions getSimpleLowercaseMapping(), getSimpleUppercaseMapping() and getSimpleTitlecaseMapping() in etc.unicode.unicode perform case "Default Simple Case Mapping" as defined by the Unicode standard. "Default" means not locale sensitive, and "Simple" means "one character at a time, as defined in UnicodeData.txt". They perform case mappings on a character-by-character basis, and work for ALL languages (except Turkish, Azeri and Lithuanian, which will have to wait for the next version).

The forthcoming version will do everything. Including casefolding and normalization. It's a few weeks away, unfortunately, so be patient.

It would not have been possible for std.string to do all that you require, because a Unicode casing algorithm cannot possibly work unless it can first access all the Unicode properties. std.string does not have that advantage - hence etc.unicode.unicode. One day in the future, it is my hope that all of this will be integrated into Phobos.

Arcane Jill.

Oh - PS - must apologize. A pre-linked downloadable version of etc.unicode.unicode is STILL not available (so it's still just source code). The reason for this was that it was my birthday last weekend, and I was partying instead of coding. Since I actually have a day job, it will have to wait until next weekend now.



July 13, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cd04a3$2280$1@digitaldaemon.com...
> In article <ccve36$r2o$1@digitaldaemon.com>, FLorian Rivoal says...

> The forthcoming version will do everything. Including casefolding and normalization. It's a few weeks away, unfortunately, so be patient.

Sounds great. Thank you Jill in advance.
I think D is lack of good and consistent String class as java has.

For example, recently I stuck with:
Object {
...
char[] toString()
...
}
but I need wchar[] at least for supporting non ASCII languages. DMD
complains about another return type.

It seems that many good libs are coming out to the first versions very soon. I looking forward for first DTL also.

> Oh - PS - must apologize. A pre-linked downloadable version of etc.unicode.unicode is STILL not available (so it's still just source
code). The
> reason for this was that it was my birthday last weekend,

Congratulations! It's a good reason for the rest. :))


July 13, 2004
Blandger wrote:
> "Arcane Jill" <Arcane_member@pathlink.com> wrote in message
> news:cd04a3$2280$1@digitaldaemon.com...
> 
>>In article <ccve36$r2o$1@digitaldaemon.com>, FLorian Rivoal says...
> 
> 
>>The forthcoming version will do everything. Including casefolding and
>>normalization. It's a few weeks away, unfortunately, so be patient.
> 
> 
> Sounds great. Thank you Jill in advance.
> I think D is lack of good and consistent String class as java has.
> 
> For example, recently I stuck with:
> Object {
> ...
> char[] toString()
> ...
> }
> but I need wchar[] at least for supporting non ASCII languages. DMD
> complains about another return type.
> 
> It seems that many good libs are coming out to the first versions very soon.
> I looking forward for first DTL also.


I'm currently working on this. A String interface that abstracts from the specific encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32, system codepage, etc...). It provides some very useful (IMHO) functionality too (like "split", which is so rarely implemented in non-script languages).

It is near completion and needs only a few more hours of work on documentation and testing. I hope to find the time within the next one or two weeks.

Hauke
July 13, 2004
In article <cd0bgb$2g5g$1@digitaldaemon.com>, Hauke Duden says...

>I'm currently working on this. A String interface that abstracts from the specific encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32, system codepage, etc...). It provides some very useful (IMHO) functionality too (like "split", which is so rarely implemented in non-script languages).

Hauke, dude, did anyone ever tell you you're brilliant? Well, I'll say it anyway - you're brilliant. We need this.

I've always been annoyed that, while std.string has got some amazing functions in it, like find() and so forth, they ONLY work chars! Huh????

I reckon that now that we have templates, find() should be made to work for ANY kind of array - no need to limit it even to strings. Same for all the other nice stringy functions.



>It is near completion and needs only a few more hours of work on documentation and testing. I hope to find the time within the next one or two weeks.
>
>Hauke

Yay. Looking forward to it.

Jill


July 13, 2004
Arcane Jill wrote:
> In article <cd0bgb$2g5g$1@digitaldaemon.com>, Hauke Duden says...
> 
> 
>>I'm currently working on this. A String interface that abstracts from the specific encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32, system codepage, etc...). It provides some very useful (IMHO) functionality too (like "split", which is so rarely implemented in non-script languages).
> 
> 
> Hauke, dude, did anyone ever tell you you're brilliant? Well, I'll say it anyway
> - you're brilliant. We need this.

Not recently, so thank you very much ;).

> I've always been annoyed that, while std.string has got some amazing functions
> in it, like find() and so forth, they ONLY work chars! Huh????
> 
> I reckon that now that we have templates, find() should be made to work for ANY
> kind of array - no need to limit it even to strings. Same for all the other nice
> stringy functions.

Yes. I've written a mixin that contains the string algorithms and that is used in the String classes. I've also gone to some length to ensure that the character decoding stuff can be inlined into the mixed-in algorithms. So performance will (hopefully - I haven't done any tests yet) be good.

Hauke




July 13, 2004
In article <cd085g$29tq$1@digitaldaemon.com>, Blandger says...

>but I need wchar[] at least for supporting non ASCII languages.

Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is available to char[] arrays.

#    char[] s = "&#1041;&#1075;&#1047;&#1049; &#10077;&#13181;&#9283;&#10078; &#5797;&#5801;&#5804; &#1600;&#1601;&#1602;";

is perfectly legal. (And you can use etc.unicode's getSimpleUppercaseMapping()
to uppercase it too).

Arcane Jill


July 13, 2004
In article <cd0jdn$2sru$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <cd085g$29tq$1@digitaldaemon.com>, Blandger says...
>
>>but I need wchar[] at least for supporting non ASCII languages.
>
>Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is available to char[] arrays.
>
>#    char[] s = "&#1041;&#1075;&#1047;&#1049; &#10077;&#13181;&#9283;&#10078; &#5797;&#5801;&#5804; &#1600;&#1601;&#1602;";
>
>is perfectly legal. (And you can use etc.unicode's getSimpleUppercaseMapping()
>to uppercase it too).
>
>Arcane Jill


Okay, so it doesn't come out right on this forum!
But it will work in D source.


July 13, 2004
"Hauke Duden" <H.NS.Duden@gmx.net> wrote in message news:cd0bgb$2g5g$1@digitaldaemon.com...
> Blandger wrote:

> I'm currently working on this. A String interface that abstracts from the specific encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32, system codepage, etc...). It provides some very useful (IMHO) functionality too (like "split", which is so rarely implemented in non-script languages).

Wow! Nice to hear it. :)

> It is near completion and needs only a few more hours of work on documentation and testing. I hope to find the time within the next one or two weeks.

Good. Don't hurry much, just make it good, consistent and handy for working with. Thanks!


July 13, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cd0jdn$2sru$1@digitaldaemon.com...
> In article <cd085g$29tq$1@digitaldaemon.com>, Blandger says...
>
> >but I need wchar[] at least for supporting non ASCII languages.
>
> Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is
available to
> char[] arrays.
>
> #    char[] s = "&#1041;&#1075;&#1047;&#1049;
&#10077;&#13181;&#9283;&#10078;
> &#5797;&#5801;&#5804; &#1600;&#1601;&#1602;";
>
> is perfectly legal. (And you can use etc.unicode's
getSimpleUppercaseMapping() to uppercase it too).

Thanks for addition.

You are right it's legal but it looks (and I think works) ugly. It seems to me there is no 'normal way' to work with upper/lowecase, sort, search, collate, replace, code pages stuff  with non ASCII letters within Phobos in this case . Or am I something missed ??


« First   ‹ Prev
1 2 3
Top | Discussion index | About this forum | D home