The Unicode Casing Algorithms (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » The Unicode Casing Algorithms (page 4)

June 07, 2004

Re: The Unicode Casing Algorithms

Posted by Roberto Mariottini
in reply to Walter

Roberto Mariottini

Posted in reply to Walter

In article <c9qgf3$1ec3$1@digitaldaemon.com>, Walter says...
>
>Right now, the std.ctype functions
>all take an argument of 'dchar'. This means the interface is correct for
>unicode, even if the current implementation fails to work on anything but
>ASCII.

7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)?

Ciao

June 07, 2004

Re: The Unicode Casing Algorithms

Posted by Arcane Jill
in reply to Roberto Mariottini

Arcane Jill

Posted in reply to Roberto Mariottini

In article <ca15r8$1uun$1@digitaldaemon.com>, Roberto Mariottini says...
>
>In article <c9qgf3$1ec3$1@digitaldaemon.com>, Walter says...
>>
>>Right now, the std.ctype functions
>>all take an argument of 'dchar'. This means the interface is correct for
>>unicode, even if the current implementation fails to work on anything but
>>ASCII.
>
>7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)?
>
>Ciao

Just ASCII.

WINDOWS-1252 (to give it its official encoding name) is all too often incorrectly declared as ISO-8859-1, thanks to Microsoft.

Okay, so Unicode is a superset of ISO-8859-1, which in turn is a superset of ASCII, so you *COULD* implement the ctype functions according to the ISO-8859-1 locale, but I suspect that would be terribly confusing to those for whom that was not their default locale.

WINDOWS-1252 conflicts with Unicode in the range 0x80 to 0x9F, so I wouldn't recommend that at all. Anyway, Linux users wouldn't like it. Microsoft have taken over enough of the world as it is without their invading D as well. ;-)

Jill

June 08, 2004

Re: The Unicode Casing Algorithms

Posted by Roberto Mariottini
in reply to Arcane Jill

Roberto Mariottini

Posted in reply to Arcane Jill

In article <ca173t$20v3$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <ca15r8$1uun$1@digitaldaemon.com>, Roberto Mariottini says...
>>
>>In article <c9qgf3$1ec3$1@digitaldaemon.com>, Walter says...
>>>
>>>Right now, the std.ctype functions
>>>all take an argument of 'dchar'. This means the interface is correct for
>>>unicode, even if the current implementation fails to work on anything but
>>>ASCII.
>>
>>7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)?
>>
>>Ciao
>
>Just ASCII.
>
>WINDOWS-1252 (to give it its official encoding name) is all too often incorrectly declared as ISO-8859-1, thanks to Microsoft.
>
>Okay, so Unicode is a superset of ISO-8859-1, which in turn is a superset of ASCII, so you *COULD* implement the ctype functions according to the ISO-8859-1 locale, but I suspect that would be terribly confusing to those for whom that was not their default locale.

I know. It's only that I'm italian, and the italian language needs at least
ISO-8859-1 (with collation, etc), ASCII is not sufficient.
Supporting only ASCII means supporting only english. While this can be
understandable for english-speaking people, I think that it's worth adding a
single bit and upgrade to ISO-8859-1, thus supporting english, spanish, french,
portuguese, german, italian, etc.

>WINDOWS-1252 conflicts with Unicode in the range 0x80 to 0x9F, so I wouldn't recommend that at all. Anyway, Linux users wouldn't like it. Microsoft have taken over enough of the world as it is without their invading D as well. ;-)

I don't know how D handles the interface with the S.O., but I think Windows would pass CP1252-encoded characters to getchar(), for example.

Ciao

June 08, 2004

Re: The Unicode Casing Algorithms

Posted by Arcane Jill
in reply to Roberto Mariottini

Arcane Jill

Posted in reply to Roberto Mariottini

In article <ca3pe5$24v$1@digitaldaemon.com>, Roberto Mariottini says...

>I know. It's only that I'm italian, and the italian language needs at least
>ISO-8859-1 (with collation, etc), ASCII is not sufficient.
>Supporting only ASCII means supporting only english. While this can be
>understandable for english-speaking people, I think that it's worth adding a
>single bit and upgrade to ISO-8859-1, thus supporting english, spanish, french,
>portuguese, german, italian, etc.

Hauke has now implemented utype - a drop-in replacement for ctype, which now supports all Unicode characters. (I don't know how he did it. I'm not completely convinced that it's backwardly compatible with ctype in the ASCII range, but even if it isn't, I'm sure it could be made so).

That, in conjunction with the real Unicode functions which he has also supplied should solve all your problems. However, there is no way I would support adding explicit support to D for ISO-8859-1. I am also European, and I also use non-ASCII characters, but when I step outside the bounds of ASCII, I use use Unicode, not ISO-8859-1.

Jill

PS. Unicode is a superset of ISO-8859-1 with codepoint equivalence. In this sense only, ISO-8859-1 has special status compared with, say, ISO-8859-2. (Unicode is a superset of ISO-8859-2 as well, of course, but the codepoints are different). So anything which works for Unicode will work for ISO-8859-1, codepoint for codepoint. But that's not the same as restricting it to that range.

June 08, 2004

Re: The Unicode Casing Algorithms

Posted by Hauke Duden
in reply to Arcane Jill

Hauke Duden

Posted in reply to Arcane Jill

Arcane Jill wrote:
>>I know. It's only that I'm italian, and the italian language needs at least
>>ISO-8859-1 (with collation, etc), ASCII is not sufficient.
>>Supporting only ASCII means supporting only english. While this can be
>>understandable for english-speaking people, I think that it's worth adding a
>>single bit and upgrade to ISO-8859-1, thus supporting english, spanish, french,
>>portuguese, german, italian, etc.
> 
> 
> Hauke has now implemented utype - a drop-in replacement for ctype, which now
> supports all Unicode characters. (I don't know how he did it. I'm not completely
> convinced that it's backwardly compatible with ctype in the ASCII range, but
> even if it isn't, I'm sure it could be made so).

It is compatible. It has a unittest that checks all ASCII characters with all functions to make sure ;).

Hauke

June 08, 2004

Re: The Unicode Casing Algorithms

Posted by Arcane Jill
in reply to Hauke Duden

Arcane Jill

Posted in reply to Hauke Duden

In article <ca3v8c$fai$1@digitaldaemon.com>, Hauke Duden says...

>> Hauke has now implemented utype - a drop-in replacement for ctype, which now supports all Unicode characters. (I don't know how he did it. I'm not completely convinced that it's backwardly compatible with ctype in the ASCII range, but even if it isn't, I'm sure it could be made so).
>
>It is compatible. It has a unittest that checks all ASCII characters with all functions to make sure ;).
>
>Hauke

Excellent! This is superb. The only thing is, the docs don't make that claim (unless I missed it). When I read the docs for utype.isspace() I kinda got the impression that it just called charIsSpace(), which obviously would not be compatible with ctype. Perhaps you could make the documentation more explicit.

All in all, I'm thoroughly impressed with this. Nice one!

Jill

PS. Did you omit charToCasefold(), or did I just miss it?

June 08, 2004

Re: The Unicode Casing Algorithms

Posted by Hauke Duden
in reply to Arcane Jill

Hauke Duden

Posted in reply to Arcane Jill

Arcane Jill wrote:
>>>Hauke has now implemented utype - a drop-in replacement for ctype, which now
>>>supports all Unicode characters. (I don't know how he did it. I'm not completely
>>>convinced that it's backwardly compatible with ctype in the ASCII range, but
>>>even if it isn't, I'm sure it could be made so).
>>
>>It is compatible. It has a unittest that checks all ASCII characters with all functions to make sure ;).
>>
>>Hauke
> 
> 
> Excellent! This is superb. The only thing is, the docs don't make that claim
> (unless I missed it).

It is there, in the module description.

> When I read the docs for utype.isspace() I kinda got the
> impression that it just called charIsSpace(), which obviously would not be
> compatible with ctype. Perhaps you could make the documentation more explicit.

The documentation of isspace states that it is equivalent to charIsSeparator. But I will make it a little more obvious.

> All in all, I'm thoroughly impressed with this. Nice one!

Thanks :).

> PS. Did you omit charToCasefold(), or did I just miss it?

No, you didn't miss it. Real case folding is another beast entirely, as it requires one-to-many mappings. It is not supported by the module.

If you want to do simple one-to-one case folding then calling charToLower on both characters should be equivalent.

Hauke

June 08, 2004

Re: The Unicode Casing Algorithms

Posted by Arcane Jill
in reply to Hauke Duden

Arcane Jill

Posted in reply to Hauke Duden

In article <ca49vq$10ui$1@digitaldaemon.com>, Hauke Duden says...

>> PS. Did you omit charToCasefold(), or did I just miss it?
>
>No, you didn't miss it. Real case folding is another beast entirely, as it requires one-to-many mappings. It is not supported by the module.

Yes, I know. But I think it would be nice to start getting people used to the
idea that they need to be calling toCasefold() instead of toLower() if they're
going to do case-insensitive comparisons. It's a good "new thing to learn". Even
if all it does (for now) is call charToLower(), that would be better than
nothing.


>If you want to do simple one-to-one case folding then calling charToLower on both characters should be equivalent.

I know, but basically, I'm saying that code which reads:

>       if (charToCaseFold(c) == charToCaseFold(d))

is more self-documenting than code which reads:

>       if (charToLower(c) == charToLower(d))

and it gets people to start thinking in the Unicode way. So - even if it does nothing useful, I think it's still a good function to have.

Jill

June 08, 2004

Re: The Unicode Casing Algorithms

Posted by Hauke Duden
in reply to Arcane Jill

Hauke Duden

Posted in reply to Arcane Jill

Arcane Jill wrote:

> In article <ca49vq$10ui$1@digitaldaemon.com>, Hauke Duden says...
> 
> 
>>>PS. Did you omit charToCasefold(), or did I just miss it?
>>
>>No, you didn't miss it. Real case folding is another beast entirely, as it requires one-to-many mappings. It is not supported by the module.
> 
> 
> Yes, I know. But I think it would be nice to start getting people used to the
> idea that they need to be calling toCasefold() instead of toLower() if they're
> going to do case-insensitive comparisons. It's a good "new thing to learn". Even
> if all it does (for now) is call charToLower(), that would be better than
> nothing.

But the interface would have to be changed to return a string instead of a single character. That would break all code that uses it.

Hauke

June 08, 2004

Re: The Unicode Casing Algorithms

Posted by Arcane Jill
in reply to Hauke Duden

Arcane Jill

Posted in reply to Hauke Duden

Okay, cancel that. I've just realized I was talking complete rubbish. You were right. I was wrong. Case folding comes into play during special casing, not simple casing. (I was thinking it was in UnicodeData.txt, but of course it isn't, it's only in SpecialCasing.txt). So I withdraw my suggestion, apologize for questioning you, and now I'm going to go and hide in a corner until I stop feeling such a prat.

Jill (embarrassed).

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation