Unicode (was The Unicode Casing Algorithms) (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Unicode (was The Unicode Casing Algorithms) (page 2)

June 04, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Walter
in reply to Hauke Duden

Walter

Posted in reply to Hauke Duden

"Hauke Duden" <H.NS.Duden@gmx.net> wrote in message news:c9q5sl$vcj$1@digitaldaemon.com...
> While I'm also working on a string class, the module I'm talking about is a set of simple global functions like charToLower, charToUpper, charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support for the full unicode character range.

How about just calling them isdigit(dchar c), etc.? Perhaps call the module std.utype. The sole remaining advantage of the std.ctype functions is they are very small. So, all a program would need to do to upgrade to unicode is replace:

    import std.ctype;
with:
    import std.utype;

and they'll get the unicode-capable versions of the same functions.

June 04, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Arcane Jill
in reply to Walter

Arcane Jill

Posted in reply to Walter

In article <c9qh22$1fdh$1@digitaldaemon.com>, Walter says...
>
>Oh durn, even with 20 bit unicode they are *still* having multicharacter sequences? ARRRRGGGGHHH.

It's 21 bits actually, the top codepoint being 0x10FFFF. But yeah, there is a distinction between characters and glyphs (or - if you wan't to get technical, "default grapheme clusters"). One character equals one dchar - no questions there - there is not a one-to-one corresporence between characters and glyphs, and there may be several different "spellings" of the same glyph. The combining characters allow you, for example, to put an acute accent over any character. It's all cunning stuff, and of course something of a nightmare for those who design fonts, make text editors, and so on.

But fortunately for us, font design is not an issue, just implementation of a few basic algorithms which someone else has already worked out for us. (Although of course, things are never that straightforward. The Consortium's algorithms are kind of "proof of concept". /Real/ implementations would have to throw in a bit of speed optimization).

No need for the aaargh, though. Once you get your head around the character/glyph distinction, it all makes complete sense. D's dchars are *characters*, and for that purpose, they are exactly what they are designed to be. D has got it right. And no - there's no need to introduce a glyph type, before anyone asks. Glyphs are only important to people who write rendering algorithms. Glyph /boundaries/ are important, but the algorithms will cover that.

I'm sure someone will take up the challenge. It's a fascinating area.

Arcane Jill

June 04, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Arcane Jill
in reply to Walter

Arcane Jill

Posted in reply to Walter

In article <c9qh23$1fdh$2@digitaldaemon.com>, Walter says...

>replace:
>    import std.ctype;
>with:
>    import std.utype;

Hey, Hauke. You've just been offered a place in the vaulted "std" heirarchy! Go for it man.

I must be working in the wrong field.
Jill  :(

June 04, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Hauke Duden
in reply to Walter

Hauke Duden

Posted in reply to Walter

Walter wrote:
> "Hauke Duden" <H.NS.Duden@gmx.net> wrote in message
> news:c9q5sl$vcj$1@digitaldaemon.com...
> 
>>While I'm also working on a string class, the module I'm talking about
>>is a set of simple global functions like charToLower, charToUpper,
>>charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support
>>for the full unicode character range.
> 
> 
> How about just calling them isdigit(dchar c), etc.? Perhaps call the module
> std.utype. The sole remaining advantage of the std.ctype functions is they
> are very small. So, all a program would need to do to upgrade to unicode is
> replace:

I had three reasons for choosing these function names:

1) isdigit etc. do not conform to the convention that new words should be capitalized.

2) because of D's overloading rules (with definitions in one module being able to completely hide those in others) I'm reluctant to choose global names that could also be used in another context.

3) I wanted to improve on ctype in a few places and also keep a bit closer to the Unicode terms. For example, isspace tests for things that separate words (whitespace in ASCII). In Unicode that's more than just whitespace, thus the name doesn't fit. I also think charIsSpace should check for actual space characters instead of all whitespace.

Of course we could create a module std.utype which simply defines std.c.ctype compatible aliases. Or even better, simply call the unicode functions directly from std.c.ctype so that there is no wrong choice anymore.

Hauke

June 04, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Hauke Duden
in reply to Arcane Jill

Hauke Duden

Posted in reply to Arcane Jill

Arcane Jill wrote:
>>replace:
>>   import std.ctype;
>>with:
>>   import std.utype;
> 
> 
> Hey, Hauke. You've just been offered a place in the vaulted "std" heirarchy! Go
> for it man.

Thanks for cheering me on AJ ;).

But let's wait and see what Walter thinks about it when he has it in his hands - especially about the function names :).

Hauke

June 04, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by David L. Davis
in reply to Walter

David L. Davis

Posted in reply to Walter

In article <c9qh23$1fdh$2@digitaldaemon.com>, Walter says...
>
>
>"Hauke Duden" <H.NS.Duden@gmx.net> wrote in message news:c9q5sl$vcj$1@digitaldaemon.com...
>> While I'm also working on a string class, the module I'm talking about is a set of simple global functions like charToLower, charToUpper, charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support for the full unicode character range.
>
>How about just calling them isdigit(dchar c), etc.? Perhaps call the module std.utype. The sole remaining advantage of the std.ctype functions is they are very small. So, all a program would need to do to upgrade to unicode is replace:
>
>    import std.ctype;
>with:
>    import std.utype;
>
>and they'll get the unicode-capable versions of the same functions.
>
>

Walter: The above sounds like a good idea for the dchar character(s) in std.ctype, but what about for strings that use std.string functions and are defined as char[], or is there a dchar[] string type I've missed somewhere? And if there isn't, shouldn't the strings really be defined as dchar[] to work with unicode 32-bit?

Thxs for your answer in advance. :))

June 04, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Walter
in reply to David L. Davis

Walter

Posted in reply to David L. Davis

"David L. Davis" <SpottedTiger@yahoo.com> wrote in message news:c9qmr7$1nrj$1@digitaldaemon.com...
> Walter: The above sounds like a good idea for the dchar character(s) in std.ctype, but what about for strings that use std.string functions and
are
> defined as char[], or is there a dchar[] string type I've missed
somewhere? And
> if there isn't, shouldn't the strings really be defined as dchar[] to work
with
> unicode 32-bit?

Check out the std.utf package, which will decode char[] into a dchar.

June 04, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Walter
in reply to Hauke Duden

Walter

Posted in reply to Hauke Duden

"Hauke Duden" <H.NS.Duden@gmx.net> wrote in message news:c9qjqr$1jfv$1@digitaldaemon.com...
> I had three reasons for choosing these function names:
>
> 1) isdigit etc. do not conform to the convention that new words should be capitalized.

I know, but since these are well-established names, I think we can bend the rules a bit for them <g>.

> 2) because of D's overloading rules (with definitions in one module being able to completely hide those in others) I'm reluctant to choose global names that could also be used in another context.

I can't think of a case where they conflict. Note that the actual global names will not conflict, because the names will be prefixed by the package.module name.

> 3) I wanted to improve on ctype in a few places and also keep a bit closer to the Unicode terms. For example, isspace tests for things that separate words (whitespace in ASCII). In Unicode that's more than just whitespace, thus the name doesn't fit. I also think charIsSpace should check for actual space characters instead of all whitespace.

If you're changing what, say, isspace does for ASCII characters, then I think that's a mistake.

> Of course we could create a module std.utype which simply defines std.c.ctype compatible aliases. Or even better, simply call the unicode functions directly from std.c.ctype so that there is no wrong choice anymore.

I'd do that if the utype functions didn't add significant bloat, but they do
(I presume).

June 04, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Hauke Duden
in reply to Walter

Hauke Duden

Posted in reply to Walter

Walter wrote:
>>I had three reasons for choosing these function names:
>>
>>1) isdigit etc. do not conform to the convention that new words should
>>be capitalized.
> 
> 
> I know, but since these are well-established names, I think we can bend the
> rules a bit for them <g>.

Well, if you're not going to make the cut now, when then? D is a new language and I think the standard library should at least be consistent.

>>2) because of D's overloading rules (with definitions in one module
>>being able to completely hide those in others) I'm reluctant to choose
>>global names that could also be used in another context.
> 
> 
> I can't think of a case where they conflict. Note that the actual global
> names will not conflict, because the names will be prefixed by the
> package.module name.

I can think of a few conflicts. In fact, in one of my own applications I had a function called "isSeparator" that had nothing at all to do with strings.

Regarding the prefixes: I know that you can always access the functions in a fully qualified way, but I think having to do that can be a pain. Especially when you can sometimes get away without it and at other times you have to use the module name.

>>3) I wanted to improve on ctype in a few places and also keep a bit
>>closer to the Unicode terms. For example, isspace tests for things that
>>separate words (whitespace in ASCII). In Unicode that's more than just
>>whitespace, thus the name doesn't fit. I also think charIsSpace should
>>check for actual space characters instead of all whitespace.
> 
> 
> If you're changing what, say, isspace does for ASCII characters, then I
> think that's a mistake.

That's precisely why it is not called isspace in my module :). I wanted to make it obvious that it has different behaviour. The function that does what ctype.isspace does is called charIsSeparator (Unicode calls such characters "separators").

charIsSpace on the other hand tests for characters with the Unicode separator subtype "space", which does NOT include linebreaks. That is as it should be, I think.

However, I'd appreciate any ideas for a better name for charIsSpace that makes it obvious that it tests for spaces without actually using the word "space". I couldn't think of any.

>>Of course we could create a module std.utype which simply defines
>>std.c.ctype compatible aliases. Or even better, simply call the unicode
>>functions directly from std.c.ctype so that there is no wrong choice
>>anymore.
> 
> 
> I'd do that if the utype functions didn't add significant bloat, but they do
> (I presume).

Well, there's not THAT much overhead. But I guess every little bit could be too much for some specialized applications. For example, it would probably not be a good choice for embedded systems.

Right now the module will increase executable size by 12 KB and uses about 2 MB of RAM. The RAM usage could be reduced quite a bit but then the character lookup would be about 3 times slower (right now only a comparison and a simple array indexing operation is needed).

Hauke

June 04, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Kris
in reply to Walter

Kris

Posted in reply to Walter

"Walter"  wrote:
> > Of course we could create a module std.utype which simply defines std.c.ctype compatible aliases. Or even better, simply call the unicode functions directly from std.c.ctype so that there is no wrong choice anymore.
>
> I'd do that if the utype functions didn't add significant bloat, but they
do
> (I presume).

Well then, Walter. If that's the case, perhaps you'd apply the same rule to printf usage within the root object? As we all know, printf drags along all the floating point formatting and boatloads of other, uhhh, errrrr ... stuff.

It absolutely does not belong in the root object, and there's only a dozen or so references to it within debug code inside Phobos ...

Sorry to sound a bit snotty, but this is surely a blatant double-standard <g>

- Kris

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation