Unicode (was The Unicode Casing Algorithms) (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Unicode (was The Unicode Casing Algorithms) (page 3)

June 04, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Walter
in reply to Kris

Walter

Posted in reply to Kris

"Kris" <someidiot@earthlink.dot.dot.dot.net> wrote in message news:c9qub0$22er$1@digitaldaemon.com...
> "Walter"  wrote:
> > > Of course we could create a module std.utype which simply defines std.c.ctype compatible aliases. Or even better, simply call the
unicode
> > > functions directly from std.c.ctype so that there is no wrong choice anymore.
> >
> > I'd do that if the utype functions didn't add significant bloat, but
they
> do
> > (I presume).
>
> Well then, Walter. If that's the case, perhaps you'd apply the same rule
to
> printf usage within the root object? As we all know, printf drags along
all
> the floating point formatting and boatloads of other, uhhh, errrrr ... stuff.
>
> It absolutely does not belong in the root object, and there's only a dozen or so references to it within debug code inside Phobos ...
>
> Sorry to sound a bit snotty, but this is surely a blatant double-standard <g>

But everyone needs printf! And printf doesn't add 2Mb, either, last I checked <g>.

June 05, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Kris
in reply to Walter

Kris

Posted in reply to Walter

Printf is certainly useful, but one shouldn't have to pay the bloat price when they don't even use it. Placing a printf call within Object.d (the print() method) adds zero value, and has negative impact.

It's great not having to explicitly import printf ... but having it automatically loaded where it's never actually used is so totally bogus.

BTW, there's actually only around 20 calls to Object.print(); All within
Phobos (as Ben Hinkle pointed out). If you remove those, along with
Object.print(), the problem just goes away ...

"Walter" wrote:
> But everyone needs printf! And printf doesn't add 2Mb, either, last I checked <g>.
>
>

June 05, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Kris
in reply to Walter

Kris

Posted in reply to Walter

"Walter"  wrote:
> But everyone needs printf! And printf doesn't add 2Mb, either, last I checked <g>.

Walter: I realize my reply wasn't very helpful, so please permit me to re-phrase?

Yes, as you say, everyone needs printf <g>. They just don't need it in
Object.print()

- Kris

June 05, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Walter
in reply to Kris

Walter

Posted in reply to Kris

"Kris" <someidiot@earthlink.dot.dot.dot.net> wrote in message news:c9r8sq$2hnt$1@digitaldaemon.com...
> "Walter"  wrote:
> > But everyone needs printf! And printf doesn't add 2Mb, either, last I checked <g>.
>
> Walter: I realize my reply wasn't very helpful, so please permit me to re-phrase?
>
> Yes, as you say, everyone needs printf <g>. They just don't need it in
> Object.print()

Yeah, it probably should go from that.

June 05, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Arcane Jill
in reply to Walter

Arcane Jill

Posted in reply to Walter

In article <c9qr0q$1tk7$2@digitaldaemon.com>, Walter says...

>If you're changing what, say, isspace does for ASCII characters, then I think that's a mistake.

Unicode space is not whitespace. Whitespace is a completely different concept. For example, non-breaking space ('\u00A0') is not considered whitespace, but Unicode correctly identifies it as a spacing character. Even more disasterous, '\n'  is whitespace, but it is not space.

Hauke is correct. These are different properties. You cannot simply re-use the old functions. You have to supply new ones, and preferably with different names.

Arcane Jill

(By the way, I couldn't download the zip file. Mozilla Firebird freaked out when I tried to click on the link).

June 05, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Sean Kelly
in reply to Arcane Jill

Sean Kelly

Posted in reply to Arcane Jill

In article <c9rqvu$bah$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <c9qr0q$1tk7$2@digitaldaemon.com>, Walter says...
>
>>If you're changing what, say, isspace does for ASCII characters, then I think that's a mistake.
>
>Unicode space is not whitespace. Whitespace is a completely different concept. For example, non-breaking space ('\u00A0') is not considered whitespace, but Unicode correctly identifies it as a spacing character. Even more disasterous, '\n'  is whitespace, but it is not space.
>
>Hauke is correct. These are different properties. You cannot simply re-use the old functions. You have to supply new ones, and preferably with different names.

But that doesn't break the ASCII functions for the ASCII character set, it only means that new ones must be provided for Unicode characters.  Personally, I'd prefer that the new functions work for both Unicode and for ASCII, much like the locale-based functions do in C++.  Localization in C++ is probably the most complex part of the language, however, and I'd like to see if we can't find a way to simplify it a bit in D.

Sean

June 05, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Arcane Jill
in reply to Sean Kelly

Arcane Jill

Posted in reply to Sean Kelly

In article <c9sob6$1qpn$1@digitaldaemon.com>, Sean Kelly says...

>But that doesn't break the ASCII functions for the ASCII character set, it only means that new ones must be provided for Unicode characters.  Personally, I'd prefer that the new functions work for both Unicode and for ASCII,

Obviously you are aware of this, but your choice of words gives a strange impression here. Clearly, ASCII characters *are* Unicode characters. ASCII is but a small subset of Unicode. They are defined for all Unicode characters, therefore they are defined for all ASCII characters.


>much like the
>locale-based functions do in C++.  Localization in C++ is probably the most
>complex part of the language, however, and I'd like to see if we can't find a
>way to simplify it a bit in D.

Agreed, but I'm not clear what you're asking. I've been involved with a text-to-speech project which we had to internationalize and localize for a whole bunch of languages. That was in C++, so I know the issues. Using Unicode made things a whole lot easier, but localization is about a lot more than selecting a character set. Stuff like what character you use for a decimal point, how you punctuate sentences, what kind of quotation marks you use, and so on, are all relevant to localization, and it would be nice to address these. But these issues are independent of the assinged properties of Unicode characters.

But I never did like the way C handled locales. Java's tactic made more sense.

With regard to those character properties, I couldn't quite figure out if you were agreeing or disagreeing. I suspect that we are all in agreement really. Certainly I would hope so, because actually there is no decision to be taken. And for obvious reasons:

(1) The behavior of the ctype functions for the ASCII range is well and truly defined by years of precedent, and cannot be changed.

(2) Similarly, the Unicode standard, and its various classifications, is an established international standard, and one which we are also not at liberty to change.

So, either we implement Unicode properties or we don't, but if we want to be standards compliant, we /cannot/ change one single Unicode property - not even to make it compatible with isspace(), whether we agree with it or not. To do so would place us at odds with - well, basically, the rest of the world.

It follows, therefore, that we need BOTH functions - for instance, we need the
old fashioned ctype isspace() AND we need the new Unicode function
charIsSpace(). We need the old fashioned ctype isalpha() AND we need the new
Unicode function charIsLetter().

Supplying new functions cannot possibly break the old ones! But as Hauke and I have pointed out, in general they do not agree with each other, even in the ASCII range, and certainly not in the range 0x00 to 0xFF (the range for which the ctype functions are usually implemented).

Java has a nice solution, which we might like to copy. Java implements the
Unicode Standard (at least for Unicode 2.0), but they ALSO implement ADDITIONAL
functions, such as isWhitespace(), isJavaIdentifierStart(), and so on.

<ping!> I've just realized what you're refering to. How dumb of me not to have seen it earlier! Ok, let me go through this.... In C, the ctype functions such as toupper(c) will return a different value for a given codepoint c, depending on the current system default locale. toupper(0xD3) might give a different answer in Russia from that which it does in France. THIS PROBLEM DOES NOT ARISE WITH UNICODE. However, D implements toupper(), so the question is, should toupper() be locale dependent in D as it is in C. My immediate thought would be no. No way. The C system locale selects a character encoding upon which toupper() et al operate, but there is only one D character encoding standard. It is Unicode - the superset of all the others. And in Unicode, you *don't* call toupper(), you call Hauke's new function - charToUpper(). My inclination is that the old ctype functions should be defined only for the ASCII range (though having them take a dchar is harmless), and within that range, they be compatible with what C did.


Arcane Jill

June 05, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Walter
in reply to Arcane Jill

Walter

Posted in reply to Arcane Jill

"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:c9t05d$26ft$1@digitaldaemon.com...
> <ping!> I've just realized what you're refering to. How dumb of me not to
have
> seen it earlier! Ok, let me go through this.... In C, the ctype functions
such
> as toupper(c) will return a different value for a given codepoint c,
depending
> on the current system default locale. toupper(0xD3) might give a different answer in Russia from that which it does in France. THIS PROBLEM DOES NOT
ARISE
> WITH UNICODE. However, D implements toupper(), so the question is, should
> toupper() be locale dependent in D as it is in C. My immediate thought
would be
> no. No way. The C system locale selects a character encoding upon which toupper() et al operate, but there is only one D character encoding
standard. It
> is Unicode - the superset of all the others. And in Unicode, you *don't*
call
> toupper(), you call Hauke's new function - charToUpper(). My inclination
is that
> the old ctype functions should be defined only for the ASCII range (though having them take a dchar is harmless), and within that range, they be
compatible
> with what C did.

I've pretty much come to the same conclusions:

1) D's character types are unicode. They aren't indices into locale-dependent code pages. The library functions are unicode. If you have data that's in a locale-dependent code page, convert it to unicode before using library string functions.

2) The ctype functions will just return 0 for non-ASCII characters.

3) There will be a separate set of functions for unicode, with different names.

Thanks to you and Hauke for clarifying the issues with this.

June 05, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Sean Kelly
in reply to Walter

Sean Kelly

Posted in reply to Walter

Walter wrote:
> "Arcane Jill" <Arcane_member@pathlink.com> wrote in message
> news:c9t05d$26ft$1@digitaldaemon.com...
> 
>><ping!> I've just realized what you're refering to. How dumb of me not to have
>>seen it earlier! Ok, let me go through this.... In C, the ctype functions such
>>as toupper(c) will return a different value for a given codepoint c, depending
>>on the current system default locale. toupper(0xD3) might give a different
>>answer in Russia from that which it does in France. THIS PROBLEM DOES NOT ARISE
>>WITH UNICODE. However, D implements toupper(), so the question is, should
>>toupper() be locale dependent in D as it is in C. My immediate thought would be
>>no. No way. The C system locale selects a character encoding upon which
>>toupper() et al operate, but there is only one D character encoding standard. It
>>is Unicode - the superset of all the others. And in Unicode, you *don't* call
>>toupper(), you call Hauke's new function - charToUpper(). My inclination is that
>>the old ctype functions should be defined only for the ASCII range (though
>>having them take a dchar is harmless), and within that range, they be compatible
>>with what C did.

Thanks for putting it so clearly.  I'm a bit rusty with C locale stuff and had forgotten about the default locale business.  I agree.  I would prefer to have a set of basic functions that are not locale dependent for the ASCII character set and have D provide its own set of unicode functions.

> I've pretty much come to the same conclusions:
> 
> 1) D's character types are unicode. They aren't indices into
> locale-dependent code pages. The library functions are unicode. If you have
> data that's in a locale-dependent code page, convert it to unicode before
> using library string functions.
> 
> 2) The ctype functions will just return 0 for non-ASCII characters.
> 
> 3) There will be a separate set of functions for unicode, with different
> names.

Sounds fantastic.


Sean

June 05, 2004

Re: Unicode (was The Unicode Casing Algorithms)

Posted by Hauke Duden
in reply to Arcane Jill

Hauke Duden

Posted in reply to Arcane Jill

Arcane Jill wrote:
> (By the way, I couldn't download the zip file. Mozilla Firebird freaked out when
> I tried to click on the link).

It is now also available here:

http://www.hazardarea.com/unichar.zip


Hauke

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation