char[] casting (page 10)

"kris" <fu@bar.org> wrote in message news:dlc271$n9h$1@digitaldaemon.com... > I suspected so, which is why that stated "suppose there were". Do you have some suggestions? I'm sure you could probably think of a handful. I understand the issue. The best I could come up with is the c, w and d suffixes in the cases where the type cannot be inferred. Other ways to deal with it: 1) instead of overloading the functions by parameter type, change the names as in: putc(char[]) putw(wchar[]) putd(dchar[]) 2) don't provide the overload, and rely on transcoding

"kris" <fu@bar.org> wrote in message news:dlc116$ioe$1@digitaldaemon.com... > Yes, that would be a problem. However, a pragma would work. I can well imagine you wouldn't care for that kind of 'solution' though. Still, there is surely some reasonable mechanism? You're right, I don't like pragma's for that purpose, either. One big problem with pragmas is what do they apply to? The text from then on? The module? All modules? How would they interact with templates imported from other modules?

"Bruno Medeiros" <daiphoenixNO@SPAMlycos.com> wrote in message news:dl51ou$2vf9$1@digitaldaemon.com... > There is also the option of all undecorated string literals having a > default type (like char[] for instance), instead of "it being inferred > from the context." > This seems best to me, at first glance at least. What consequences could > there be from this approach? The problem with that approach is more political than technical. There are some older threads here where many felt very strongly that D should be agnostic about which character encoding is "preferred."

Regan Heath wrote: > There was talk about porting a populat unicode library to D, this would likely solve many of the problems. Not just talk... http://mango.dsource.org/classICU.html Uses "The International Component for Unicode" (ICU4C) : http://www-306.ibm.com/software/globalization/icu/index.jsp --anders

Georg Wrede wrote: > I took it for granted that you had "done the Bill Gates" (which implies using wchar and scrubbing the problem under the rug "as you're only dealing with BMP values (common for all civilised folks) you *can* chop a wchar[] anywhere you want, though this is obviously risky for some applictions)", that I already wrote several lines of "less than appropriate" commentary. Earlier versions of Java also did this... (using Java's char, the same as D's wchar) They've changed that now, and are using "int" which is the closest they get dchar: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html You do know that part of this comes from Unicode changes, right ? (originally it was thought that 16 bits would be enough. It wasn't) --anders

November 15, 2005

Re: The hunting accident is claiming more casualties

Posted by Oskar Linde
in reply to Georg Wrede

Permalink

Oskar Linde

Posted in reply to Georg Wrede

Permalink

In article <43790E73.1080704@nospam.org>, Georg Wrede says...
>
>> I think we've painted ourselves in the corner by calling the UTF-8 entity "char"!!
>
>Given:
>
>     char[10] foo;
>
>What is the storage capacity of foo?
>
>  - is it 10 UTF-8 characters
>  - is it 2.5 UTF-8 characters
>  - "it depends"
>  - something else

It is obviously 10 UTF-8 codepoints. 10 bytes of memory gets reserved. Nothing else makes sense. There is no such thing as a UTF-8 character. There are unicode characters, which are encoded by 1-4 UTF-8 codepoints.

>Another question:
>
>How much storage capacity should I allocate here:
>
>    char[?] bar = "Ã„iti syÃ¶ lettuja.";
>       // Finnish for "Mother eats pancakes." :-)
>    char[?] baz = "ï®”ï¯›ïºŒï¯”ïº¡ïº ï®—ï®ï®±ïº¼ïº¶";
>       // I hope the al Qaida or the CIA don't kock on my door,
>       // I hereby officially state I have no idea what I wrote.

If you don't want the extra overhead of a dynamic char[], auto will help you here:

auto baz = "ï®”ï¯›ïºŒï¯”ïº¡ïº ï®—ï®ï®±ïº¼ïº¶";

writef("%s\n",typeid(typeof(baz))); // will print char[33]

This string will always be a char[33], no matter if the file encoding is UTF-8, UTF-16 or UTF-32.

>(Heh, btw, upon writing that string, my cursor started looking weird. Now I'll just have to see whether my news reader or windows itself crashes first!! Which should never happen because I got the characters "legally", i.e. from the windows Character Map in the System Tools menu.)

Unicode support is hardly mature everywhere yet... :)

/Oskar

On Tue, 15 Nov 2005 09:06:15 +0100, Anders F Björklund <afb@algonet.se> wrote: > Regan Heath wrote: > >> There was talk about porting a populat unicode library to D, this would likely solve many of the problems. > > Not just talk... http://mango.dsource.org/classICU.html > > Uses "The International Component for Unicode" (ICU4C) : > http://www-306.ibm.com/software/globalization/icu/index.jsp That's the one. Any plans to get this into phobos, or ares. Can I assume from the filename "classICU" that it's class based, does it also contain stand alone methods, perhaps a direct port of the C API of ICU with the class built on top of that? These questions are directed at Kris, I guess, or anyone who has the knowledge I'm too lazy to find for myself ;) Regan

Walter Bright wrote: > "kris" <fu@bar.org> wrote in message news:dlc116$ioe$1@digitaldaemon.com... > >>Yes, that would be a problem. However, a pragma would work. I can well >>imagine you wouldn't care for that kind of 'solution' though. Still, >>there is surely some reasonable mechanism? > > > You're right, I don't like pragma's for that purpose, either. One big > problem with pragmas is what do they apply to? The text from then on? The > module? All modules? How would they interact with templates imported from > other modules? That's a good question; I had wondered idly whether the pragma/whatever might be set within the class/struct/interface mechanism? Kind of like an attribute? The notion being that the class could indicate what its intentions were, regarding literal behavior (both elemental and [] varieties) for its contained methods/templates/mixins etc ~ as intended by the developer. Being optional this could, ostensibly, support all perspectives?

Walter Bright wrote: > I understand the issue. The best I could come up with is the c, w and d > suffixes in the cases where the type cannot be inferred. > > Other ways to deal with it: > > 1) instead of overloading the functions by parameter type, change the names > as in: > > putc(char[]) > putw(wchar[]) > putd(dchar[]) Doesn't work for opX aliases, such as opShl() for C++ compatability :-( > 2) don't provide the overload, and rely on transcoding That's perhaps fine for single char instances, but it won't fly for anything other than trivial arrays :-(

kris wrote: > Walter Bright wrote: > >> "kris" <fu@bar.org> wrote in message news:dlc116$ioe$1@digitaldaemon.com... >> >>> Yes, that would be a problem. However, a pragma would work. I can well >>> imagine you wouldn't care for that kind of 'solution' though. Still, >>> there is surely some reasonable mechanism? >> >> >> >> You're right, I don't like pragma's for that purpose, either. One big >> problem with pragmas is what do they apply to? The text from then on? The >> module? All modules? How would they interact with templates imported from >> other modules? > > > That's a good question; > > I had wondered idly whether the pragma/whatever might be set within the class/struct/interface mechanism? Kind of like an attribute? The notion being that the class could indicate what its intentions were, regarding literal behavior (both elemental and [] varieties) for its contained methods/templates/mixins etc ~ as intended by the developer. > > Being optional this could, ostensibly, support all perspectives? Dear gods, no! Think of the code readability man. One should keep all the elements necessary to understand a piece of code to a minimum, (and as close and local as possible). Keep things simple. -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."

Forums