August 28, 2004
Sean Kelly wrote:

> In article <cgqlmm$2ui$1@digitaldaemon.com>, Walter says...
> 
>>
>>"Arcane Jill" <Arcane_member@pathlink.com> wrote in message
>>news:cgmr7f$1b90$1@digitaldaemon.com...
>>
>>>UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all
>>
>>text
>>
>>>is ASCII is going to be just as fast in UTF-16 as it is in ASCII.
>>
>>While the rest of your post has a great deal of merit, this bit here is just
>>not true. A pure ASCII app written using UTF-16 will consume twice as much
>>memory for its data, and there are a lot of operations on that data that
>>will be correspondingly half as fast. Furthermore, you'll start swapping a
>>lot sooner, and then performance takes a dive.
>>
>>It makes sense for Java, Javascript, and for languages where performance is
>>not a top priority to standardize on one character type. But if D does not
>>handle ASCII very efficiently, it will not have a chance at interesting the
>>very performance conscious C/C++ programmers.
> 
> 
> Agreed.  Frankly, I've begun to wonder just what the purpose of this discussion
> is.  I think it's already been agreed that none of the three char types should
> be removed from D, and it seems clear that there is no "default" char type.  Is

I think the thread has gone somewhat off topics by this point.

Apparently, a lot of people feel oppressed by ASCII. I much be a bad person since 7-bits is all I need most of the time.

On a related note, the "performance of char vs wchar" recently degraded into enlightened comments along the lines of "You're a know-it-all American cowboy who discriminates against the all of the Chinese, Japanese, Indians, Russian, British, and members of the European Union in the world. And Mormons, too." Or something like that.

Somehow these Unicode-related discussions bring out the best in people. :P

> this a nomenclature issue?  ie. that the UTF-8 type is named "char" and thus
> considered to be somehow more important than the others?
> 
> 
> Sean

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/
August 28, 2004
"Sean Kelly" <sean@f4.ca> wrote in message news:cgqnjm$3eq$1@digitaldaemon.com...
> In article <cgqmpo$387$1@digitaldaemon.com>, Andy Friesen says...
> >
> >I know the point is to avoid all the complications that the C++ appoarch entails, but this has a way of throwing a wrench in any interface that wants to handle all three.
>
> I'm inclined to agree, though I'm wary of making char types a special case
for
> overload resolution.  Perhaps a prefix to indicate type?
>
> c"" // utf-8
> w"" // utf-16
> d"" // utf-32
>
> Still not ideal, but it would require less typing :/

I thought of the prefix approach, like C uses, but it just seemed redundant for the odd case where a cast(char[]) will do.


August 29, 2004
J C Calvarese wrote:

> Arcane Jill wrote:
>> In article <cgqju0$2b4$1@digitaldaemon.com>, Walter says...
> ...
>> In article <cgqju1$2b4$2@digitaldaemon.com>, Walter says...
>> 
>>>>It's not invalid as such, it's just that the return type of an overloaded function has to be "covariant" with the return type of the function it's overloading. So it's a compile error /now/. But if dchar[] and char[] were
>>>
>>>to be
>>>
>>>>considered mutually covariant then this would magically start to compile.
>>>
>>>That would be nice, but I don't see how to technically make that work.
>> 
>> 
>> You're right. It wouldn't work.
>> 
>> Well, this is the sense in which D does have a "default" string type. It is the case where we see clearly that char[] has special privilege. Currently, Object - and therefore /everything/ - defines a function toString() which returns a char[]. It is not possible to overload this with a function returning wchar[], even if wchar[] would be more appropriate. This affects /all/ objects.
>> 
>> What to do about it? Hmmm....
>> 
>> You could change the Object.toString to:
>> 
>> #    void toString(out char[] s) { /* implementation as before */ }
>> #    void toString(out wchar[] s} { char[] t; toString(t); s = t; }
>> #    void toString(out dchar[] s} { char[] t; toString(t); s = t; }
>> 
>> with the latter two employing implicit conversion for the "s = t;" part. Subclasses of Object which overloaded only toString(out char[]) would get the other three for free. But subclasses of Object which decided to go a bit further could return a wchar[] or a dchar[] directly to cut down on conversions.
> 
> Could we ditch toString and replace the functionality with:
> 
> toUtf8(), toUtf16(), and toUtf32()
> or toCharStr(), toWCharStr(), and toDCharStr()
> 
> Usually, the person writing the object could define one and the other two would call conversions.
> 
> There's probably some reason why this wouldn't work, but it's just such a pleasant idea to me that I was forced to share it.

Why is toString such a hot topic anyway? In Java end users hardly ever see
the result of a toString. In D I can see things like toString(int) and
toString(double) being seen by users but in general Foo.toString should
just give a summary of the object - preferably short and easy to transcode.

I wouldn't use toString for things like getting user strings out of text fields in a GUI or reading from a file. For those cases I would use another function name like TextBox.getText or File.readStringW. Those other functions can have char[] versions and wchar[] versions as desired. As an example of a "bad" toString see std.stream.Stream.toString. It will usually create a huge string.

For classes like AJ's arbitrary sized Int toString should return the whole integer since that is the best summary of the object. So we should still allow toString to return arbitrarily long strings - we just need to be careful how toString is used.


August 29, 2004
In article <cgra2k$udg$1@digitaldaemon.com>, Ben Hinkle says...

>Why is toString such a hot topic anyway?

Ask yourself why toString() exists at all. What does D use it for?

If it's an unnecessary, hardly-used function, then it should be removed from Object, because this is OOP, if something doesn't make sense for all Objects then it should not be defined for all Objects.

On the other hand, if it /is/ necessary for all objects, it shouldn't be biased one way or the other.


>but in general Foo.toString should
>just give a summary of the object

Of the object's /value/, yes. So toString() only makes sense for objects which actually /have/ a value. I'm not sure if streams can be said to have a "value" in the sense that Ints do, so maybe it shouldn't be defined at all for streams.


>As an
>example of a "bad" toString see std.stream.Stream.toString. It will usually
>create a huge string.

Yes. Now I'm starting to wonder what toString() is actually for, and whether
implementing a three-function interface (Stringizable?) might be better than
inheriting from Object.


>For classes like AJ's arbitrary sized Int toString should return the whole integer since that is the best summary of the object. So we should still allow toString to return arbitrarily long strings - we just need to be careful how toString is used.

Let's go back to Walter on this one. Walter - why does Object have a toString() function? In what way does D require or rely on it? How badly would D be affected if it didn't exist at all or if it were an interface?

Jill


August 29, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgrsnq$168a$1@digitaldaemon.com...
> Let's go back to Walter on this one. Walter - why does Object have a
toString()
> function? In what way does D require or rely on it? How badly would D be affected if it didn't exist at all or if it were an interface?

It's so when you pass an object to writef(), there's a way that it can be printed. But Ben is right, I don't see Object.toString() being used to generate very large strings, so any transcoding of it isn't going to be expensive overall.


August 29, 2004
>Agreed.  Frankly, I've begun to wonder just what the purpose of this discussion is.

The toString()-method has to return a strig, but in which format? That's what
this is all about.

>I think it's already been agreed that none of the three char types should be removed from D, and it seems clear that there is no "default" char type.

Well the language can only use one type as return type for toString(). So there actualy IS a default character type.

BTW, isn't the name "char" totaly misleading? char means character, but it can't hold a character it can only hold parts of it. This is confusing.

>Is
>this a nomenclature issue?  ie. that the UTF-8 type is named "char" and thus
>considered to be somehow more important than the others?

Nope.

-- Matthias Becker


August 29, 2004
>news:cgrsnq$168a$1@digitaldaemon.com...
>> Let's go back to Walter on this one. Walter - why does Object have a
>toString()
>> function? In what way does D require or rely on it? How badly would D be affected if it didn't exist at all or if it were an interface?
>
>It's so when you pass an object to writef(), there's a way that it can be printed. But Ben is right, I don't see Object.toString() being used to generate very large strings, so any transcoding of it isn't going to be expensive overall.

I don't get it :(

Why isn't it possible to use a Stringizeable interface instead?

-- Matthias Becker


August 29, 2004
On Fri, 27 Aug 2004 08:26:23 +0000 (UTC), Arcane Jill <Arcane_member@pathlink.com> wrote:
> In article <opsdc4sgsi5a2sq9@digitalmars.com>, Regan Heath says...
>> After all, I don't mind if my compile is a little slower if it means my
>> app is faster.
>
> UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all text
> is ASCII is going to be just as fast in UTF-16 as it is in ASCII. Remember that
> ASCII is a subset of UTF-16, just as it is a subset of UTF-8.
>
> Converting between UTF-8 and UTF-16 won't slow you down much if all your
> characters are ASCII, of course. Such a conversion is trivial - not much slower
> than a memcpy. /But/ - you're still making a copy, still allocating stuff off
> the heap and copying data from one place to another, and that's still overhead
> which you would have avoided had you used UTF-16 right through.

You're right.. I would argue however that space == speed when you start to run out, which happens 2 times faster if you use wchar (and ASCII only), right? the overall efficiency of a program is made up of both it's space and cpu requirements, some times you will need to or want to lessen the space requirements.

>>> (3) Object.d should contain the line:
>>> #    alias wchar[] string;
>>
>> I'm not sure I like this.. will this hide details a programmer should be
>> aware of?
>
> It's just a strong hint. If aliases are bad then we shouldn't use them anywhere.

I wasn't suggesting aliases were bad. Aliases that serve to make type declarations clearer are very useful, they make code clearer. This alias just renames a type, so now it has 2 names, this will likely cause some confusion. I think we can suggest a type without it.

>>> (It could be made faster, but it can /never/ be made as fast as UTF-16).
>>> So we make wchar[], not char[], the "standard", and hey presto, things
>>> get faster
>>
>> Qualification: For non ASCII only apps.
>
> No, for /all/ apps. I don't see any reason why ASCII stored in wchar[]s would be
> any slower than ASCII stored in char[]s. Can you think of a reason why that
> would be so?
>
> ASCII is a subset of UTF-8
> ASCII is a subset of UTF-16
>
> where's the difference? The difference is space, not speed.

Correct, but space == speed (as above).

>> The fact that ICU has no char type suggests it's a bad choice for D, that
>> is, if we want to assume they knew what they we're doing.
>
> See http://oss.software.ibm.com/icu/userguide/strings.html for UCI's discussion
> on this.

Thanks.

>> Are there any
>> complaints from developers about ICU anywhere, perhaps some digging for
>> dirt would help make an objective decision here?
>
> I don't know. I imagine so. People generally tend to complain about
> /everything/.

<g> true, too true...

>> I'd like some more stats and figures, simply:
>>  - how many unicode characters are in the range < U+0800?
>
> These figures are all from Unicode 4.0. Unicode now stands at 4.0.1, so these
> figures are out of date - but they'll give you the general idea.
>
> 1646
>
>>  - how many unicode from U+0800 >= x <= U+FFFF?
>
> 54014, of which 6400 are Private Use Characters
>
>>  - haw many unicode > U+FFFF?
>
> 176012, of which 131068 are Private Use Characters
>
>
>> Then, how commonly is each range used? I imagine this differs depending on
>> exactly what you're doing.. basically when would you use characters in
>> each range and how common is that task?

..thanks for the lists/figures..

>> It used to be that ASCII < U+0800 was the most common, it still may be,
>> but I can see that it's not the future, the future is unicode.
>
> Most common, perhaps, if you limit your text to Latin, Greek, Hebrew and
> Russian. But not otherwise.

So would you say most common worldwide then? It may be due to the fact that I only speak english but I see many more english-only programs than (pick a language)-only programs. Ignoring those applications that come in several languages (as all the big ones do).

>> That said, would the introduction of char to Java give you anything?
>> perhaps.. it would allow you to write an app that only deals with ASCII
>> (chars < U+0800) more space efficiently, correct?
>
> Only chars < U+0080 (not U+0800) would be more space efficient in UTF-8.

Yes, what if they are all you want/need/(are going) to use...

> Between
> U+0080 and U+07FFF they both need two bytes. From U+0800 upwards, UTF-8 needs
> three bytes where UTF-16 needs two.
>
> How am I doing on the convincing front?

You still have work to do <g>

> I'd still go for:
> *) wchar[] for everything in Phobos and everything DMD-generated;

What if you know you're only going to need ASCII(utf-8), what if all your data is going to be in ASCII(utf-8), won't you want your static strings in ASCII(utf-8) also, to cut down on transcoding?

> *) ditch the char

I don't see the point in this. char[] is still useful regardless which type we 'promote' as the best type to use for internationalized strings.

If it were removed, then dchar should be removed for the same reason, and both types would have to be implemented in ubyte[] and int[] instead. Coincidentally this is what the ICU have done, I quote...

"UTF-8 and UTF-32 are supported with converters (ucnv.h), macros (utf.h), and convenience functions (ustring.h), but not directly as string encoding forms for most APIs."

I'm not convinced removing them is more useful than keeping them but having implicit transcoding.

> *) lossless implicit conversion between all remaining D string types

Here we agree, this would make life much easier.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
August 29, 2004
Arcane Jill wrote:

> In article <cgra2k$udg$1@digitaldaemon.com>, Ben Hinkle says...
> 
>>Why is toString such a hot topic anyway?
> 
> Ask yourself why toString() exists at all. What does D use it for?

In Java (and I suppose C#) it is very handy when debugging to print an object out. One drawback with D's Object.toString is that the default implementation doesn't print the object's address (or in Java's case the hash code) so you can't distinguish one object from another. If anything I'd like to see some guidelines for toString so that the output is consistent across D. For example if the object doesn't have an obvious string representation, like a class Foo{int n; double d;} then the result of toString should have the form "[Foo n:0, d:0.0]" - possibly include the address or hash code in there. I think this is basically the format Java uses but I can't remember exactly. In general toString should avoid newlines.

> If it's an unnecessary, hardly-used function, then it should be removed from Object, because this is OOP, if something doesn't make sense for all Objects then it should not be defined for all Objects.
> 
> On the other hand, if it /is/ necessary for all objects, it shouldn't be biased one way or the other.

It isn't necessary but it is nice to have around. Does it make sense for all
objects? I guess that depends on one's viewpoint of "makes sense". I think
printing the class and hash code makes sense. Maybe others don't.

>>but in general Foo.toString should
>>just give a summary of the object
> 
> Of the object's /value/, yes. So toString() only makes sense for objects which actually /have/ a value. I'm not sure if streams can be said to have a "value" in the sense that Ints do, so maybe it shouldn't be defined at all for streams.

See above comments about making sense and the default toString format.

>>As an
>>example of a "bad" toString see std.stream.Stream.toString. It will
>>usually create a huge string.
> 
> Yes. Now I'm starting to wonder what toString() is actually for, and
> whether implementing a three-function interface (Stringizable?) might be
> better than inheriting from Object.

That's an option, but it adds more work for users to make something
stringizable - two of the stringizable functions will call the "real" one
and wrap the result in toUTF8 etc. Is it worth it to do all that just for
debugging?

>>For classes like AJ's arbitrary sized Int toString should return the whole integer since that is the best summary of the object. So we should still allow toString to return arbitrarily long strings - we just need to be careful how toString is used.
> 
> Let's go back to Walter on this one. Walter - why does Object have a toString() function? In what way does D require or rely on it? How badly would D be affected if it didn't exist at all or if it were an interface?
> 
> Jill

August 30, 2004
Walter wrote:
> "Arcane Jill" <Arcane_member@pathlink.com> wrote in message
> news:cgmr7f$1b90$1@digitaldaemon.com...
> 
>>UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all
> 
> text
> 
>>is ASCII is going to be just as fast in UTF-16 as it is in ASCII.
> 
> 
> While the rest of your post has a great deal of merit, this bit here is just
> not true. A pure ASCII app written using UTF-16 will consume twice as much
> memory for its data, and there are a lot of operations on that data that
> will be correspondingly half as fast. Furthermore, you'll start swapping a
> lot sooner, and then performance takes a dive.
> 
> It makes sense for Java, Javascript, and for languages where performance is
> not a top priority to standardize on one character type. But if D does not
> handle ASCII very efficiently, it will not have a chance at interesting the
> very performance conscious C/C++ programmers.
> 
> 
They won't worry about it, because if they are true performance affinciandos, they will never create an array in D, because of the default initialization.   :)  so, ubyte[] it is <g>

Seriously, is performance a concern for D?  If it truly is, this should be able to be turned off, if I take ownership of the potential consequences, no?

Scott