July 27, 2004
Berin Loritsch wrote:
> parabolis wrote:
> 
>> I thought I remembered reading that Java was originally deisgned
>> for appliance microprocessors, but I could be wrong.
> 
> 
> Ok, you are going pre sun involvement here...
No what I remember reading was that Sun wanted to dabble in
'smart' appliances... But this is a vague impression I have
from an article I read 5+ years ago...
> 
>>
>> As for the unsigned primitive... Consider java.lang.String's:
>>
> 
> <snip/>
> 
> Let me just say that it doesn't have a serious impact on day to day
> programming activities--even if it is not "ideologically pure".  Most
> values used in day to day development fall well within the range of
> the signed positive value range.  Most folks don't even worry about
> whether it would be more efficient to use a byte or an int.  We just
> use ints because the performance gains of using the smaller primitive
> is no where near the gains of improving the algorithm.

(Not that it matters but I believe using a 32 bit condition variable on a 32 bit machine is actually faster than a type with fewer bits...)

> But that's just my experience (public projects I have worked on include
> Apache Avalon, Apache Cocoon, Apache JMeter, Apache Axis, and the
> D-Haven projects.  I know this is a D forum, but I am including these
> to add weight to the argument that signed vs. unsigned arguments really
> don't impact most average programs all that much.

I apologize if I seemed to be arguing that unsigned is inherently better. I was just trying to make the point
that not only do I have to avoid using my default in
Java, but I also have to guard against conditions that
are a direct result of my not getting to use unsigned.

Perhaps a better way to make the point is imagine a
language which does not allow the use of integer
types. So now fictional.lang.String has the function:

  copyValueOf(char[] data, float offset, float count)

And you have to write methods similar methods yourself
and check to make sure the number is both integral and
positive... This is an overstatement of my frustrations
but I think it does illustrate what I mean.
July 27, 2004
Arcane Jill wrote:

> In article <ce6cl2$2j39$1@digitaldaemon.com>, parabolis says...
> 
> 
>>So when I see a for loop with a signed
>>condition variable I wonder why someone would choose to do that.
> 
> 
> Well, here's one possible reason:
> 
> # for (int i=9; i>=0; --i) { /* blah */ }
> 
> is likely to be a few cycles faster than
> 
> # for (uint i=0; i<10; ++i) { /* blah */ }
> 
> (depending on how good the compiler is at optimizing - a black art about which I
> know nothing)

This will not hold true for all processor types, so it is generally
better to code normally and trust the compiler to do the right
optimization (if any).

But that is a whole other topic (premature optimizations, etc.)
July 27, 2004
Arcane Jill wrote:

> In article <ce6cl2$2j39$1@digitaldaemon.com>, parabolis says...
> 
> 
>>So when I see a for loop with a signed
>>condition variable I wonder why someone would choose to do that.
> 
> 
> Well, here's one possible reason:
> 
> # for (int i=9; i>=0; --i) { /* blah */ }
> 
> is likely to be a few cycles faster than
> 
> # for (uint i=0; i<10; ++i) { /* blah */ }
> 
> (depending on how good the compiler is at optimizing - a black art about which I
> know nothing)
> 
> Jill
> 
> 

Yes I would wonder why you wrote that and probably assume to save a few cycles... However more often I tend to see:

    for (int i=array.length-1; i>=0; --i) { /* blah */ }

is not likely to be a few cycles faster than

    for (uint i=array.length; i<10; ++i) { /* blah */ }

But the general case is this horrible version:

# for (int i=0; i>=String.length(); i++) { /* blah */ }
  (call String.length(); every iteration)

Also you should consider that. I am assuming you are pointing out that using 0 as a sentinel is faster than another number.
Also dont forget that any speed benifit from using 0 as a sentinel is completely negated for a processor which does not implement unsigned addition identically to signed addition.


And finally do not dismiss the unsigned alternate to your original suggestion:

    for( uint i = 0xFFFFFFFA; i != 0; i++ )
July 27, 2004
Arcane Jill wrote:
>>Arcane Jill wrote:
>>
>>
>>>It doesn't matter for me, though, as I don't use Java, and I intend for D to do
>>>better.
>>>
>>
>>In my opinion D is off to a really bad start with Unicode.
> 
> 
> The "start" hasn't even happened yet. What we have now isn't anything like what
> we're /going/ to have. There are /loads/ of (other) things that D doesn't have
> yet (like decent streams support), but most of these things are *in progress*.
> I'd say you made your call too early.

Actually the start has happened. What I was referring to was that the conception of the string in D has seemingly been defined. The Object.toString method returns a UTF sequence.
(I will explain further down...)

> 
> Look at it like this. D has only been around for three or four years, and it was
> basically a one-person project. We're not even at version 1.0 yet, so the best
...
> 
> And as for the future - well, for stage 2 we've got the normalization, canonical
...
> For stage three - and by this stage we'll be way ahead of the field - we'll have
...

Please forgive me but I have only started figuring out Unicode this week so a good deal of D's planned implementation are features that I at best partially understand. I am happy to
know there is a master plan however.

I applaud the robot builds.

I would be curious whether non-ASCII names will be supported (ie classes and variable names, etc).

I am also curious about whether there it will be possible for a non-English speaker to use their language's version of D's reserverved words (ie Swedish word for synchronized, etc).

> 
> I think you have made your judgement too early. Phobos is tiny right now,
> compared with Java's vast array of classes. Deimos is even tinier, and somewhat
> more piecemeal. But already D's Unicode support is:
> 
> * Better than C
> * Better than C++
> * Catching up with Java (and better in some areas)
> 
> To expect the full whack right at the start is unrealistic (and we /are/ still
> right at the start). Walter was way too busy getting the core of the language
> together to start worrying about how you do uppercasing in Deseret*, but the
> language has now reached the point where we can do that.
> 
> So tell me. Against what are you comparing D? Java? Tell me in what ways you
> think D is behind? Tell me what does better than D, and in what way? I suspect
> you may be hard pressed to come up with examples.
> 

Ok before I explain what aspects of D's Unicode implementation bother me I feel given the context of the thread that I need to point out that I am not comparing D to any other language. I have used Java's Unicode related documents only to clarify the diverse Unicode technical vocabulary.

As stated above (and in the 'Source level Java to D converter' thread) I do not agree with D's apparent conception of the string.

The D docs (in Arrays.Special Array Types.Strings) say this:
================================
Dynamic arrays in D suggest the obvious solution - a string is just a dynamic array of characters. String literals become just an easy way to write character arrays.
================================

I agree that a string is a sequence of characters. However D's conception of string seems to be a Unicode string which is most decidedly NOT a sequence of characters. Unicode defines a Character in a sensible fashion:

================================
(from http://www.unicode.org/glossary/)

Character. (1) The smallest component of written language that has semantic value; ...

Unicode String. A code unit sequence ...
================================

What D calls characters are in fact code units. A char for example is an 8-bit code unit that may in special cases represent a Character. Of course the type name 'char' was
strongly suggested for C-compatibility so the misnomer was not wanton.

However conceptually confusing a String of Characters with a Unicode String (of code units) led to what I consider a fairly glaring omission in even the most basic or unfinished library:

The most basic of String operations length-querry and substring are not supported. It is clearly possible to count the code units with _char[].length and it is possible to slice the code units with char[i..j]. But no predefined operations actually indicate how many Characters a char[] (or wchar[]) actually contains. To put it another way:

    char[].length != <String>.length
       char[i..k] != <String>.substring(i,j)
       char[].sort  (just amusing to consider)

I may seem like I am being overly pedantic but I came to D without any knowledge of Unicode. It took me days to finally figure out that when anything D related says 'string' it actually means something different from the intuitive notion of a string, the formal notion of a string (assuming an alphabet must consist of Characters) and Unicode's technical definition.

I would strongly suggest adding a String class to phobos which implements a String of Characters and reserve the term string to that class alone. Hence if a String class is written then Object.toString() should return a String reference.

Of course a String class is not the only valid solution and I do not have enough experience with D or Unicode to suggest that it would be the best. I certainly would not suggest that it should be done because that is how it was done in Java...

With that said I do have doubts that a feasible solution exists without implementing a String class. Non-class methods would have to parse UTF once for each length and substring call whereas a proper class implementation can do it in constant time (see implementation suggestion below if in doubt).

I am under the impression that while Unicode has enities that require more than 16 bits to represent it has been said that such 32 bit examples will be "vanishingly rare". Thus 16 bits is the normal case with occasional use of 32 bit entities.

Optimizing the most frequent case suggests using an internal representation of wchar[] for the 16 bit entities and another sparse wchar array for the cases in which a wchar is too small. The querry-length function is obviously constant time in all cases. However so is the substring operation - thanks to copy-on-write.

I might also suggest considering making String an interface and implementing it in 3 seperate classes (or more):
    1) String8
    2) String16
    3) String32
(these are horrible names, sorry)

The Interface implementation of course has the benefit that anybody who wants to tune a String class to work for them can either subclass an existing String class or write their own implementation (without inheriting super class stuff) and still have the class recognized by Object.toString() and Exception.this(String)






July 28, 2004
parabolis wrote:
> 
> What D calls characters are in fact code units. A char for example is an 8-bit code unit that may in special cases represent a Character. Of course the type name 'char' was
> strongly suggested for C-compatibility so the misnomer was not wanton.

This is a multifaceted issue.  D supports UTF-8, UTF-16, and UTF-32 representations, stored in arrays of char, wchar, and dchar, respectively.  While char strings are technically UTF-8, there is a 1-1 correspondence between characters and bytes so long as the values are within the range of the ASCII character set.  And in the case of dchars, there (as far as I know) is always a 1-1 correspondence between D characters and Unicode characters.

> However conceptually confusing a String of Characters with a Unicode String (of code units) led to what I consider a fairly glaring omission in even the most basic or unfinished library:
> 
> The most basic of String operations length-querry and substring are not supported. It is clearly possible to count the code units with _char[].length and it is possible to slice the code units with char[i..j]. But no predefined operations actually indicate how many Characters a char[] (or wchar[]) actually contains. To put it another way:
> 
>     char[].length != <String>.length
>        char[i..k] != <String>.substring(i,j)
>        char[].sort  (just amusing to consider)

Good point.  However C++ has this exact same issue with its string class.  Perhaps the problem is one of semantics.  While C++ merely claims that its strings are an ordered sequence of bytes, the D documentation suggests that these bytes are in a specific encoding format (though the language does not require this).

> I may seem like I am being overly pedantic but I came to D without any knowledge of Unicode. It took me days to finally figure out that when anything D related says 'string' it actually means something different from the intuitive notion of a string, the formal notion of a string (assuming an alphabet must consist of Characters) and Unicode's technical definition.

Part of this has come about because we've been actively discussing internationalization recently, so much of what's said about strings is done so in that context.  I'm only passingly familiar with many of the details of Unicode as well, but I do believe that there is room in the language for both definitions of "string."

> I would strongly suggest adding a String class to phobos which implements a String of Characters and reserve the term string to that class alone. Hence if a String class is written then Object.toString() should return a String reference.

True enough.  I agree that if a sequence of characters is to be printed then it must be properly encoded.  Whether the internal representation is properly encoded, however, isn't much of an issue to me, so long as there is a clear means of producing the encoded string when output is desired.

> With that said I do have doubts that a feasible solution exists without implementing a String class. Non-class methods would have to parse UTF once for each length and substring call whereas a proper class implementation can do it in constant time (see implementation suggestion below if in doubt).

True enough.  At the very least, we need some method of determing "true" string length.  ie. how many representable characters a string contains.  I have a feeling that there is a Unicode function for this, but I could not tell you its name.  Frankly, I suspect that we will begin to use dcar arrays more and more often to avoid the trouble that dealing with multibyte encodings causes.

> I might also suggest considering making String an interface and implementing it in 3 seperate classes (or more):
>     1) String8
>     2) String16
>     3) String32
> (these are horrible names, sorry)

I'm not sure if there's one in the DTL, but it might be worth waiting to see.  Assuming there is, I suspect that the signature would be along the lines of:

class String(CharT) {...}

so

String!(char);
String!(wchar);
String!(dchar);


Sean
July 28, 2004
parabolis wrote:
>
> But the general case is this horrible version:
> 
> # for (int i=0; i>=String.length(); i++) { /* blah */ }
>   (call String.length(); every iteration)

Though I'm generally too prone to premature optimization to do this, I think the above code has the potential to be just as fast as the unsigned version.  String likely contains a size_t variable to represent string length and it would be trivial for a compiler to inline calls to the String.length() function.  Unless you want to compare results on a per instruction basis, I would likely not be too concerned with performance differences between the calls.


Sean
July 28, 2004
In article <ce6pke$2o2m$1@digitaldaemon.com>, parabolis says...
>
>Actually the start has happened. What I was referring to was that the conception of the string in D has seemingly been defined. The Object.toString method returns a UTF sequence. (I will explain further down...)
>
>I would be curious whether non-ASCII names will be supported (ie classes and variable names, etc).

You'll have to ask Walter that one. (I mean, you'll have to wait and see if Walter answers this question). I suspect not, because I'm only providing a library, and it's written in D. The DMD compiler is written in C, and so can't call D libraries, and therefore won't be able to take advantage of any D library I provide. Adding Unicode support to /the compiler/ would also bloat the compiler somewhat. But that's just a guess. As I said, only Walter can answer this one definitively.


>I am also curious about whether there it will be possible for a non-English speaker to use their language's version of D's reserverved words (ie Swedish word for synchronized, etc).

I'd be surprised if that were so. Syntax analysis happens /before/ semantic analysis, and syntax analysis needs to know all the reserved words. But again, I'm just guessing. Only Walter can be definitive.


>Ok before I explain what aspects of D's Unicode implementation bother me I feel given the context of the thread that I need to point out that I am not comparing D to any other language. I have used Java's Unicode related documents only to clarify the diverse Unicode technical vocabulary.
>
>As stated above (and in the 'Source level Java to D converter' thread) I do not agree with D's apparent conception of the string.
>
>The D docs (in Arrays.Special Array Types.Strings) say this:
>================================
>Dynamic arrays in D suggest the obvious solution - a string is
>just a dynamic array of characters. String literals become just
>an easy way to write character arrays.
>================================
>
>I agree that a string is a sequence of characters. However D's conception of string seems to be a Unicode string which is most decidedly NOT a sequence of characters. Unicode defines a Character in a sensible fashion:
>
>================================
>(from http://www.unicode.org/glossary/)
>
>Character. (1) The smallest component of written language that has semantic value; ...
>
>Unicode String. A code unit sequence ... ================================
>
>What D calls characters are in fact code units.

Correct.


>A char for
>example is an 8-bit code unit that may in special cases
>represent a Character. Of course the type name 'char' was
>strongly suggested for C-compatibility so the misnomer was not
>wanton.

I think it was also chosen for ASCII compatibility. It makes sense for Westerners. "hello world\n" has got twelve characters in it, as well as twelve code units. See - D is trying to educate people /gently/. If it had started out with the following as basic types:

*    codeunit    // UTF-8 code unit
*    wcodeunit   // UTF-16 code unit
*    dcodeunit   // UTF-32 code unit
*    char        // 32-bit wide character (same as dcodeunit)

then everything would have worked, but people who used mostly ASCII would likely go: Eh? And ASCII strings would be four times as long.


>However conceptually confusing a String of Characters with a Unicode String (of code units) led to what I consider a fairly glaring omission in even the most basic or unfinished library:
>
>The most basic of String operations length-querry and substring are not supported. It is clearly possible to count the code units with _char[].length and it is possible to slice the code units with char[i..j]. But no predefined operations actually indicate how many Characters a char[] (or wchar[]) actually contains. To put it another way:
>
>     char[].length != <String>.length
>        char[i..k] != <String>.substring(i,j)

That's only partially true. As noted above, it's the /names/ for things that are wrong, not that things are absent. If you pretend that "dchar" is the character type, rather than "char", then you /do/ get the behavior you desire. You /could/ simply pretend that char and wchar don't exist, if you really wanted.


>        char[].sort  (just amusing to consider)

This one will actually work. Lexicographical UTF-8 order is the same as lexicographical Unicode order.


>I may seem like I am being overly pedantic but I came to D without any knowledge of Unicode.

Most people do, and you're not being overly pedantic.


>It took me days to finally figure out that when anything D related says 'string' it actually means something different from the intuitive notion of a string, the formal notion of a string (assuming an alphabet must consist of Characters) and Unicode's technical definition.

Mebe, but it's no different from a string in any other computer language. In /no/ language of which I am aware is a string an array of Unicode characters. In C and C++ on Windows, for example, a char is eight bits wide, and so /obviously/ can't store all Unicode characters. In fact, it's very hard for C source code to know the encoding of a C string, and everything will work fine only if everything sticks to the system default. This makes internationalization much harder.



>I would strongly suggest adding a String class to phobos which implements a String of Characters and reserve the term string to that class alone. Hence if a String class is written then Object.toString() should return a String reference.

We D users can write a Unicode aware String class (and I believe Hauke is doing that); we can publish it; we can even /suggest/ that it be moved into Phobos. But Walter is the only one who can approve/disapprove/implement that suggestion. Phobos is Walter's baby. Deimos is one place where we can put things in the meantime, but the tight integration that you suggest can only happen if everything is in the same place.

But I'm tempted to ask why? I mean, what's wrong with a char[] (UTF-8 sequence)?
Its good enough for many purposes, especially for mostly-ASCII strings (which
Object.toString() is likely to return), and you can always convert it to a
String (pending such a class) if you want more functionality.


>Of course a String class is not the only valid solution and I do not have enough experience with D or Unicode to suggest that it would be the best. I certainly would not suggest that it should be done because that is how it was done in Java...

True. And Java made the mistake of declaring a String class /final/. I found that damned annoying, as I couldn't extend it. If I wanted additional functionality not provided by Java String, I would have had to have written a brand new class from scratch, and even then it wouldn't have cast. I seriously hope D doesn't make /that/ mistake. However much functionality a String may provide, there's always going to be at least one user who wants /just one more function/.



>I am under the impression that while Unicode has enities that require more than 16 bits to represent it has been said that such 32 bit examples will be "vanishingly rare". Thus 16 bits is the normal case with occasional use of 32 bit entities.

Depends what you want to do. As a musician, I've often wanted to use the musical characters U+1D100 to U+1D1DD. As a mathematician, I similarly would want to use the mathematical letters U+1D400 to U+1D7FF. Mystical types would probably like to use the tetragrams between U+1D306 and U+1D356. So you see, the characters beyond U+FFFF are not /all/ strange alphabets we've never heard of, and I certainly wouldn't call the desire to go beyond U+FFFF "vanishingly rare".



>Optimizing the most frequent case suggests using an internal representation of wchar[] for the 16 bit entities

Makes sense

>and another sparse wchar array for the cases in which a wchar is too small.

I don't understand that. UTF-16 is better, from the point of view of most common case and memory usage.


>The querry-length function is obviously constant time in all cases. However so is the substring operation - thanks to copy-on-write.

It's not /that/ hard to count characters in UTF-8 and UTF-16. In UTF-8, you only have to ignore code units between 0x80 and 0xBF, and in UTF-16 you only have to ignore code units between 0xDC00 and 0xDFFF. Count all the rest and you've got the number of characters.

Nice thoughts though. Keep them coming.

Jill


July 28, 2004
> The D docs (in Arrays.Special Array Types.Strings) say this:
> ================================
> Dynamic arrays in D suggest the obvious solution - a string is
> just a dynamic array of characters. String literals become just
> an easy way to write character arrays.
> ================================
> 
> I agree that a string is a sequence of characters. However D's conception of string seems to be a Unicode string which is most decidedly NOT a sequence of characters. Unicode defines a Character in a sensible fashion:
> 
> ================================
> (from http://www.unicode.org/glossary/)
> 
> Character. (1) The smallest component of written language that
> has semantic value; ...

The section of the D doc that you quote is followed by an example and then
"char[] strings are in UTF-8 format. wchar[] strings are in UTF-16 format.
dchar[] strings are in UTF-32 format."
Would it help to move those sentances to right after the one you quote
instead of putting it after the example? That way users will see that UTF-8
and realize how Walter is using the words "character" and the type "char".

Or maybe change the first sentance to "Dynamic arrays in D suggest the obvious solution - a string is just a dynamic array of characters in UTF-8, UTF-16 or UTF-32 format." Nipping in the bud any questions about what is meant by the word "character".


July 28, 2004
>Or maybe change the first sentance to "Dynamic arrays in D suggest the obvious solution - a string is just a dynamic array of characters in UTF-8, UTF-16 or UTF-32 format." Nipping in the bud any questions about what is meant by the word "character".

That works for me.


July 28, 2004
Sean Kelly wrote:

> parabolis wrote:
> 
>>
>> What D calls characters are in fact code units. A char for example is an 8-bit code unit that may in special cases represent a Character. Of course the type name 'char' was
>> strongly suggested for C-compatibility so the misnomer was not wanton.
> 
> 
> This is a multifaceted issue.  D supports UTF-8, UTF-16, and UTF-32 representations, stored in arrays of char, wchar, and dchar, 

Yes Unicode calls them code units instead of characters because they do not always represent a character.

> respectively.  While char strings are technically UTF-8, there is a 1-1 correspondence between characters and bytes so long as the values are within the range of the ASCII character set.  And in the case of dchars, there (as far as I know) is always a 1-1 correspondence between D characters and Unicode characters.

Yes that would be the special case in which a char actually holds sufficient code units to be interpreted as a Character.

> 
>> However conceptually confusing a String of Characters with a Unicode String (of code units) led to what I consider a fairly glaring omission in even the most basic or unfinished library:
>>
>> The most basic of String operations length-querry and substring are not supported. It is clearly possible to count the code units with _char[].length and it is possible to slice the code units with char[i..j]. But no predefined operations actually indicate how many Characters a char[] (or wchar[]) actually contains. To put it another way:
>>
>>     char[].length != <String>.length
>>        char[i..k] != <String>.substring(i,j)
>>        char[].sort  (just amusing to consider)
> 
> 
> Good point.  However C++ has this exact same issue with its string class.  Perhaps the problem is one of semantics.  While C++ merely claims that its strings are an ordered sequence of bytes, the D documentation suggests that these bytes are in a specific encoding format (though the language does not require this).

If there were actually a string class I would not excpect the above to hold. I simply meant that there is no way currently in D to find any of:

    <String>.length
    <String>.substring(i,j)

Because only these are implemented:
     char[].length
        char[i..k]
> 
>> I may seem like I am being overly pedantic but I came to D without any knowledge of Unicode. It took me days to finally figure out that when anything D related says 'string' it actually means something different from the intuitive notion of a string, the formal notion of a string (assuming an alphabet must consist of Characters) and Unicode's technical definition.
> 
> 
> Part of this has come about because we've been actively discussing internationalization recently, so much of what's said about strings is done so in that context.  I'm only passingly familiar with many of the details of Unicode as well, but I do believe that there is room in the language for both definitions of "string."

I think you may be missing my point. I am not suggesting eliminating "Unicode string" support for the sake of a 1:1 corresondence between a primitive type and character.  I am saying that there is really only one definition of "string" and calling sequences of code units 'strings' does not fit any standard notion of a "string".

> 
>> With that said I do have doubts that a feasible solution exists without implementing a String class. Non-class methods would have to parse UTF once for each length and substring call whereas a proper class implementation can do it in constant time (see implementation suggestion below if in doubt).
> 
> 
> True enough.  At the very least, we need some method of determing "true" string length.  ie. how many representable characters a string contains.  I have a feeling that there is a Unicode function for this, but I could not tell you its name.  Frankly, I suspect that we will begin to use dcar arrays more and more often to avoid the trouble that dealing with multibyte encodings causes.
> 
>> I might also suggest considering making String an interface and implementing it in 3 seperate classes (or more):
>>     1) String8
>>     2) String16
>>     3) String32
>> (these are horrible names, sorry)
> 
> 
> I'm not sure if there's one in the DTL, but it might be worth waiting to see.  Assuming there is, I suspect that the signature would be along the lines of:
> 
> class String(CharT) {...}
> 
> so
> 
> String!(char);
> String!(wchar);
> String!(dchar);
> 

I think a templated version of String should also implement a String interface because it would still allow other implementations to be used:

   interface String
   class StringT(CharT) : StringT {...}