August 27, 2004
In article <cgmr7f$1b90$1@digitaldaemon.com>, Arcane Jill says...

>10400..1044F; Deseret
>*) The Deseret script (a dead language, so far as I know)


"The Deseret Alphabet was designed as an alternative to the Latin alphabet for writing the English language. It was developed during the 1850s by The Church of Jesus Christ of Latter-day Saints (also known as the "Mormon" or LDS Church) under the guidance of Church President Brigham Young (1801-1877). Brigham Young's secretary, George D. Watt, was among the designers of the Deseret Alphabet."

"The LDS Church published four books using the Deseret Alphabet" #                         ^^^^^^^^^^

See http://www.molossia.org/alphabet.html

Jill


August 28, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgfen9$rhq$1@digitaldaemon.com...
> Phobos UTF-8 code could be faster, I grant you. But perhaps it will be in
the
> next release. We're only talking about a tiny number of functions here,
after
> all.

My first goal with std.utf is to make it work right. Then comes the optimize-the-heck-out-of-it. std.utf is shaping up to be a core dependency for D, so making it run as fast as possible is worthwhile. Any suggestions?


August 28, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgh8vi$1o7k$1@digitaldaemon.com...
> It's not invalid as such, it's just that the return type of an overloaded function has to be "covariant" with the return type of the function it's overloading. So it's a compile error /now/. But if dchar[] and char[] were
to be
> considered mutually covariant then this would magically start to compile.

That would be nice, but I don't see how to technically make that work.


August 28, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgk32j$1fj$1@digitaldaemon.com...
> (2) String literals such as "hello world" should be interpretted as
wchar[], not
> char[].

Actually, string literals are already interpreted as char[], wchar[], or dchar[] depending on the context they appear in. The compiler implicitly does a UTF conversion on them as necessary. If you have an overload based on char[] vs wchar[] vs dchar[] and pass a string literal, it should result in an ambiguity error.

The only place it would default to char[] would be when it is passed as a ... argument to a variadic function.


August 28, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgmr7f$1b90$1@digitaldaemon.com...
> UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all
text
> is ASCII is going to be just as fast in UTF-16 as it is in ASCII.

While the rest of your post has a great deal of merit, this bit here is just not true. A pure ASCII app written using UTF-16 will consume twice as much memory for its data, and there are a lot of operations on that data that will be correspondingly half as fast. Furthermore, you'll start swapping a lot sooner, and then performance takes a dive.

It makes sense for Java, Javascript, and for languages where performance is not a top priority to standardize on one character type. But if D does not handle ASCII very efficiently, it will not have a chance at interesting the very performance conscious C/C++ programmers.


August 28, 2004
Walter wrote:
> "Arcane Jill" <Arcane_member@pathlink.com> wrote in message
> news:cgk32j$1fj$1@digitaldaemon.com...
> 
>>(2) String literals such as "hello world" should be interpretted as
> 
> wchar[], not
> 
>>char[].
> 
> 
> Actually, string literals are already interpreted as char[], wchar[], or
> dchar[] depending on the context they appear in. The compiler implicitly
> does a UTF conversion on them as necessary. If you have an overload based on
> char[] vs wchar[] vs dchar[] and pass a string literal, it should result in
> an ambiguity error.

Is there any chance that this could be adjusted somehow?

I know the point is to avoid all the complications that the C++ appoarch entails, but this has a way of throwing a wrench in any interface that wants to handle all three.

Presently, we're given a choice: either handle all three char types and therefore demand ugly casts on all string literal arguments, or only handle one and force conversions that aren't necessarily required or desired by either the caller or the callee.

If say, in the case of an ambiguity, a string literal were assumed to be of the smallest char type for which a match exists, the code would compile and, in almost all cases, do the right thing.

It does complicate the rules some, but it seems preferable to the current dilemma.

 -- andy
August 28, 2004
In article <cgqlmm$2ui$1@digitaldaemon.com>, Walter says...
>
>
>"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgmr7f$1b90$1@digitaldaemon.com...
>> UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all
>text
>> is ASCII is going to be just as fast in UTF-16 as it is in ASCII.
>
>While the rest of your post has a great deal of merit, this bit here is just not true. A pure ASCII app written using UTF-16 will consume twice as much memory for its data, and there are a lot of operations on that data that will be correspondingly half as fast. Furthermore, you'll start swapping a lot sooner, and then performance takes a dive.
>
>It makes sense for Java, Javascript, and for languages where performance is not a top priority to standardize on one character type. But if D does not handle ASCII very efficiently, it will not have a chance at interesting the very performance conscious C/C++ programmers.

Agreed.  Frankly, I've begun to wonder just what the purpose of this discussion is.  I think it's already been agreed that none of the three char types should be removed from D, and it seems clear that there is no "default" char type.  Is this a nomenclature issue?  ie. that the UTF-8 type is named "char" and thus considered to be somehow more important than the others?


Sean


August 28, 2004
In article <cgqmpo$387$1@digitaldaemon.com>, Andy Friesen says...
>
>I know the point is to avoid all the complications that the C++ appoarch entails, but this has a way of throwing a wrench in any interface that wants to handle all three.

I'm inclined to agree, though I'm wary of making char types a special case for overload resolution.  Perhaps a prefix to indicate type?

c"" // utf-8
w"" // utf-16
d"" // utf-32

Still not ideal, but it would require less typing :/


Sean


August 28, 2004
In article <cgqju0$2b4$1@digitaldaemon.com>, Walter says...

>My first goal with std.utf is to make it work right. Then comes the optimize-the-heck-out-of-it. std.utf is shaping up to be a core dependency for D, so making it run as fast as possible is worthwhile. Any suggestions?

It's only really UTF-8 decoding that's complicated. All the rest are pretty easy, even UTF-8 encoding (as I'm sure you know). The approach I took in the code sample I posted here a while back was to read the first byte (which will be the /only/ byte, in the case of ASCII) and use it as the index into a lookup table. A byte has only 256 possible values - 128 after you've eliminated ASCII chars, and you can look up both the sequence length (or 0 for illegal first-bytes), and the initial value for the (dchar) accumulator. Then you just get six more bits from each of the remaining bytes (after ensuring that the bit pattern is 10xxxxxx). This approach will fail to catch precisely /two/ non-shortest cases, so you have to test for them explicitly. Finally, you make sure that the resulting dchar is not a forbidden value.

(From memory, I think that in some cases your current code checks for errors which can never happen, such as checking for a non-shortest 5+ byte sequence /after/ overlong sequences have already been eliminated).

You could go further. Kris has mentioned that heap allocation is slow.
Presumably, you could start off by allocating a single char buffer of length 3*N
(if input=wchars) or 4*N (if input=dchars), decoding into it, and then reducing
its length. (Of course, the excess then won't be released).

(But never let an invalid input go unnoticed. That would be one optimization too
many).



In article <cgqju1$2b4$2@digitaldaemon.com>, Walter says...
>>
>> It's not invalid as such, it's just that the return type of an overloaded function has to be "covariant" with the return type of the function it's overloading. So it's a compile error /now/. But if dchar[] and char[] were
>to be
>> considered mutually covariant then this would magically start to compile.
>
>That would be nice, but I don't see how to technically make that work.

You're right. It wouldn't work.

Well, this is the sense in which D does have a "default" string type. It is the case where we see clearly that char[] has special privilege. Currently, Object - and therefore /everything/ - defines a function toString() which returns a char[]. It is not possible to overload this with a function returning wchar[], even if wchar[] would be more appropriate. This affects /all/ objects.

What to do about it? Hmmm....

You could change the Object.toString to:

#    void toString(out char[] s) { /* implementation as before */ }
#    void toString(out wchar[] s} { char[] t; toString(t); s = t; }
#    void toString(out dchar[] s} { char[] t; toString(t); s = t; }

with the latter two employing implicit conversion for the "s = t;" part. Subclasses of Object which overloaded only toString(out char[]) would get the other three for free. But subclasses of Object which decided to go a bit further could return a wchar[] or a dchar[] directly to cut down on conversions.




>Actually, string literals are already interpreted as char[], wchar[], or dchar[] depending on the context they appear in. The compiler implicitly does a UTF conversion on them as necessary.

#    void main()
#    {
#        dchar[] s = "hello";
#        dchar[] t = s ~ " world";
#    }

The error is:
incompatible types for ((s) ~ (" world")): 'dchar[]' and 'char[]'

But yes - it works /nearly/ always, and that's cool. The above case would be covered by implicit conversion, of course (although that would defer the conversion from compile-time to run-time).

>If you have an overload based on
>char[] vs wchar[] vs dchar[] and pass a string literal, it should result in
>an ambiguity error.

Ah! That's the bit I didn't know. I was wondering how that context thing would work, given that signature matching happens /after/ the evaluation of the function's arguments' types.

You could fix this by allowing explicit UTF-8, UTF-16 and UTF-32 literals. Sean suggested c"", w"" and d"" (and similarly for char literals). That would fix it.


>> UTF-16 is not slower that UTF-8, even for pure ASCII. An app in which all
>text
>> is ASCII is going to be just as fast in UTF-16 as it is in ASCII.
>
>While the rest of your post has a great deal of merit, this bit here is just not true. A pure ASCII app written using UTF-16 will consume twice as much memory for its data, and there are a lot of operations on that data that will be correspondingly half as fast.

That's true. Guess I got a bit carried away there. I was thinking that statements like "c = *p++;" would compile to just one machine code instruction regardless of the data width, and that the byte-wide version wouldn't necessarily be the fastest. But I forgot about all the initializing and copying that you also have to do.


Arcane Jill


August 28, 2004
Arcane Jill wrote:
> In article <cgqju0$2b4$1@digitaldaemon.com>, Walter says...
...
> In article <cgqju1$2b4$2@digitaldaemon.com>, Walter says...
> 
>>>It's not invalid as such, it's just that the return type of an overloaded
>>>function has to be "covariant" with the return type of the function it's
>>>overloading. So it's a compile error /now/. But if dchar[] and char[] were
>>
>>to be
>>
>>>considered mutually covariant then this would magically start to compile.
>>
>>That would be nice, but I don't see how to technically make that work.
> 
> 
> You're right. It wouldn't work.
> 
> Well, this is the sense in which D does have a "default" string type. It is the
> case where we see clearly that char[] has special privilege. Currently, Object -
> and therefore /everything/ - defines a function toString() which returns a
> char[]. It is not possible to overload this with a function returning wchar[],
> even if wchar[] would be more appropriate. This affects /all/ objects.
> 
> What to do about it? Hmmm....
> 
> You could change the Object.toString to:
> 
> #    void toString(out char[] s) { /* implementation as before */ }
> #    void toString(out wchar[] s} { char[] t; toString(t); s = t; }
> #    void toString(out dchar[] s} { char[] t; toString(t); s = t; }
> 
> with the latter two employing implicit conversion for the "s = t;" part.
> Subclasses of Object which overloaded only toString(out char[]) would get the
> other three for free. But subclasses of Object which decided to go a bit further
> could return a wchar[] or a dchar[] directly to cut down on conversions.

Could we ditch toString and replace the functionality with:

toUtf8(), toUtf16(), and toUtf32()
or toCharStr(), toWCharStr(), and toDCharStr()

Usually, the person writing the object could define one and the other two would call conversions.

There's probably some reason why this wouldn't work, but it's just such a pleasant idea to me that I was forced to share it.

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/