July 28, 2004
In article <ce7o9r$2uo$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <ce6pke$2o2m$1@digitaldaemon.com>, parabolis says...
>>
>>Actually the start has happened. What I was referring to was that the conception of the string in D has seemingly been defined. The Object.toString method returns a UTF sequence. (I will explain further down...)
>>
>>I would be curious whether non-ASCII names will be supported (ie classes and variable names, etc).
>
>You'll have to ask Walter that one. (I mean, you'll have to wait and see if Walter answers this question). I suspect not, because I'm only providing a library, and it's written in D. The DMD compiler is written in C, and so can't call D libraries, and therefore won't be able to take advantage of any D library I provide. Adding Unicode support to /the compiler/ would also bloat the compiler somewhat. But that's just a guess. As I said, only Walter can answer this one definitively.

Unless I don't understand the question (which is always a strong possibility), DMD already supports non-ASCII names for identifiers:

"Identifiers start with a letter, _, or unicode alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.)"

http://www.digitalmars.com/d/lex.html

I've tested it before and it worked for me.

jcc7
July 28, 2004
Arcane Jill wrote:

> 
> You'll have to ask Walter that one. (I mean, you'll have to wait and see if
....
> analysis, and syntax analysis needs to know all the reserved words. But again,
> I'm just guessing. Only Walter can be definitive.

I will probabably wait so see if he responds to me. As I have said before I imagine that Unicode acceptance will suggest these issues become solvable for compiler writers and so languages in the near future will be developed with these aspects.

>>A char for example is an 8-bit code unit that may in special cases represent a Character. Of course the type name 'char' was
>>strongly suggested for C-compatibility so the misnomer was not wanton.
> 
> 
> I think it was also chosen for ASCII compatibility. It makes sense for
> Westerners. "hello world\n" has got twelve characters in it, as well as twelve
> code units. See - D is trying to educate people /gently/. If it had started out
> with the following as basic types:
> 
> *    codeunit    // UTF-8 code unit
> *    wcodeunit   // UTF-16 code unit
> *    dcodeunit   // UTF-32 code unit
> *    char        // 32-bit wide character (same as dcodeunit)
> 
> then everything would have worked, but people who used mostly ASCII would likely
> go: Eh? And ASCII strings would be four times as long.
> 

Actually facet of UTF is exactly why I want to see proper use used for things. People who generally use ASCII can expect char to represent a character and People who generally use subset of 16 bit Unicode values can expect a wchar to represent a Character. Combine that with the fact char seems to be short for Character and it is obvious people will make the wrong over generalization that a char or wchar actually represent Characters.

The result is very subtle bugs when assuming char[].length or wchar[].length counts the /Characters/ in an array or that char[i..k] slices the /Character/ in an array.

> 
> 
>>However conceptually confusing a String of Characters with a Unicode String (of code units) led to what I consider a fairly glaring omission in even the most basic or unfinished library:
>>
>>The most basic of String operations length-querry and substring are not supported. It is clearly possible to count the code units with _char[].length and it is possible to slice the code units with char[i..j]. But no predefined operations actually indicate how many Characters a char[] (or wchar[]) actually contains. To put it another way:
>>
>>    char[].length != <String>.length
>>       char[i..k] != <String>.substring(i,j)
> 
> 
> That's only partially true. As noted above, it's the /names/ for things that are
> wrong, not that things are absent. If you pretend that "dchar" is the character
> type, rather than "char", then you /do/ get the behavior you desire. You /could/
> simply pretend that char and wchar don't exist, if you really wanted.

But things are absent. D does not currently have the facility to do to two most fundamental String operations. I really doubt that if it werent for an oversight

> 
>>       char[].sort  (just amusing to consider)
> 
> 
> This one will actually work. Lexicographical UTF-8 order is the same as
> lexicographical Unicode order.

I agree with you but you missed the multiple code units case where sort has the nice property that it /destroys/ a valid encoding:
================================================================
    char threeInOrderCharacters[] = [
        0xE6,0x97,0xA5,    // u+65E5
        0xE6,0x9C,0xAC,    // u+672C
        0xE8,0xAA,0x9E,    // u+8A9E
    ];

    void main(char[][] argv) {
        uint max = threeInOrderCharacters.length;
        threeInOrderCharacters.sort;
        for( uint i = 0; i < max; i++ ) {
        printf( "%2X ", threeInOrderCharacters[i] );
    	}
    }
================================================================
    Output:
    97 9C 9E A5 AA AC E6 E6 E8
================================================================

> 
>>I may seem like I am being overly pedantic but I came to D without any knowledge of Unicode.
> 
> 
> Most people do, and you're not being overly pedantic.

lol I know I just dont want them to hate me.

> 
>>It took me days to finally figure out that when anything D related says 'string' it actually means something different from the intuitive notion of a string, the formal notion of a string (assuming an alphabet must consist of Characters) and Unicode's technical definition.
> 
> 
> Mebe, but it's no different from a string in any other computer language. In

C uses "string" the same way. If it did not then all the ctype.h functions would have to take pointers to char arrays to be able to answer the questions that they answer. The char type is so named because it was expected that a Character would be represented (wholly) by a char. So string.h was built assuming strlen gives the number of Characters.

C++ char[] and wchar[] /arrays/ should not be confused with a String. See STL which defines String to work in a manner consistent with my Characters of Strings notion.

Java obviously also uses "string" the same way... see java.lang.String.

I /suspect/ that Objective C and ECMA-262 also define Strings in a similar manner.

> /no/ language of which I am aware is a string an array of Unicode characters. In
> C and C++ on Windows, for example, a char is eight bits wide, and so /obviously/
> can't store all Unicode characters. In fact, it's very hard for C source code to
> know the encoding of a C string, and everything will work fine only if
> everything sticks to the system default. This makes internationalization much
> harder.

Perhaps I am being overly pedantic again but consider u+0000 and u+0001. I believe calling the following Characters is acceptable:

    typedef bit tinyCharacter;
    bit[n] t_string = new bit[n]

Here I have an array which is also a String since there is /always/ a 1:1 correspondence between elements and Characters.

Let me guess... You want Strings that support a larger subset of Unicode Characters?

Well fortunately C originally supported the Unicode range from u+0000 to u+007F with arrays of Characters.

Arrays of Java's char type do not make a string. Likewise arrays of char or wchar in C++ do not make strings. Fortunately there are String classes to support the more trying requirement of supporting just the Unicode range from u+0000 to u+FFFF.

> 
> 
>>I would strongly suggest adding a String class to phobos which implements a String of Characters and reserve the term string to that class alone. Hence if a String class is written then Object.toString() should return a String reference.
> 
> 
> We D users can write a Unicode aware String class (and I believe Hauke is doing
> that); we can publish it; we can even /suggest/ that it be moved into Phobos.
> But Walter is the only one who can approve/disapprove/implement that suggestion.
> Phobos is Walter's baby. Deimos is one place where we can put things in the
> meantime, but the tight integration that you suggest can only happen if
> everything is in the same place.

I am happy he controls entries as I am sure phobos' quality will be much improved as a result.

> 
> But I'm tempted to ask why? I mean, what's wrong with a char[] (UTF-8 sequence)?

I believe I explain why farther below in my post. If that did not answer the question you are asking then please help me understand better what you want to know.

> 
>>Of course a String class is not the only valid solution and I do not have enough experience with D or Unicode to suggest that it would be the best. I certainly would not suggest that it should be done because that is how it was done in Java...
> 
> 
> True. And Java made the mistake of declaring a String class /final/. I found
> that damned annoying, as I couldn't extend it. If I wanted additional
> functionality not provided by Java String, I would have had to have written a
> brand new class from scratch, and even then it wouldn't have cast. I seriously
> hope D doesn't make /that/ mistake. However much functionality a String may
> provide, there's always going to be at least one user who wants /just one more
> function/.
> 
> 
> 
> 
>>I am under the impression that while Unicode has enities that require more than 16 bits to represent it has been said that such 32 bit examples will be "vanishingly rare". Thus 16 bits is the normal case with occasional use of 32 bit entities.
> 
> 
> Depends what you want to do. As a musician, I've often wanted to use the musical
> characters U+1D100 to U+1D1DD. As a mathematician, I similarly would want to use
> the mathematical letters U+1D400 to U+1D7FF. Mystical types would probably like
> to use the tetragrams between U+1D306 and U+1D356. So you see, the characters
> beyond U+FFFF are not /all/ strange alphabets we've never heard of, and I
> certainly wouldn't call the desire to go beyond U+FFFF "vanishingly rare".
> 

Actually the "vanishingly rare" from the Unicode documents meant the frequency with which they will be extedning beyond 32 bits. I wish I had a link so I could find it again...

However I do still doubt the likelihood of ever seeing full sentence which contains exclusively (or perhaps even mostly) entities above u+FFFF. Perhaps a transcription in Linear B.

>>and another sparse wchar array for the cases in which a wchar is too small. 
> 
> 
> I don't understand that. UTF-16 is better, from the point of view of most common
> case and memory usage.

I apologize I should have made this much more clear:
================================================================
    class String {
        private wchar[] loBits;
        private SparseArray hiBits;
        // implementation here
    }
================================================================
For every Character in the String there is an entry for that Character in lowBits. So for length() returning loBits.length will accurately indicate the number of Characters in the calling String object.

For any Unicode with a value from u+0000 to u+FFFF that value is stored in loBits and hiBits remains unchanged. For values greater than u+FFFF the lowest 16 bits are stored in lowBits and the upper 16 bits are stored in hiBits.

Memory usage will be almost exactly the same as encoding with UTF-16. (Identical in big O terms)

> 
>>The querry-length function is obviously constant time in all cases. However so is the substring operation - thanks to copy-on-write.
> 
> 
> It's not /that/ hard to count characters in UTF-8 and UTF-16. In UTF-8, you only

It is not an issue of difficultly but rather efficiency. A String class can perform length and substring in constant time wheras parsing UTF16 will always require a loop.
of it being not possible in constant time:

So to recap in big O terms:
    1) The memory requirements of String are identical to UTF16
    2) For length()
       2a) The time requirements of String are 1
       3a) The time requirements of UTF16 are N
    3) For substring()
       3a) The time requirements of String are 1
       3a) The time requirements of UTF16 are N

July 28, 2004
Ben Hinkle wrote:

>>The D docs (in Arrays.Special Array Types.Strings) say this:
>>================================
>>Dynamic arrays in D suggest the obvious solution - a string is
>>just a dynamic array of characters. String literals become just
>>an easy way to write character arrays.
>>================================
>>
...
> The section of the D doc that you quote is followed by an example and then
> "char[] strings are in UTF-8 format. wchar[] strings are in UTF-16 format.
> dchar[] strings are in UTF-32 format."
> Would it help to move those sentances to right after the one you quote
> instead of putting it after the example? That way users will see that UTF-8
> and realize how Walter is using the words "character" and the type "char". 

No I do not believe that would help. I think that would simply attempt to evade the issue that D has no types corresponding to Characters (and thus in fact effectively has no String support) whatsoever because the docs also clearly state D *wants* to provide string support.

> Or maybe change the first sentance to "Dynamic arrays in D suggest the
> obvious solution - a string is just a dynamic array of characters in UTF-8,
> UTF-16 or UTF-32 format." Nipping in the bud any questions about what is
> meant by the word "character".

But that is not true. A string is a sequence of Characters and so it is not at all an obvious solutoin to implement strings using arrays of encoded data in which a Character will be anywhere from 1-4 characters and must be parsed to obtain Character data accoring to the appropriate UTF standard.

July 28, 2004
In article <ce8e8g$al2$1@digitaldaemon.com>, J C Calvarese says...

>Unless I don't understand the question (which is always a strong possibility), DMD already supports non-ASCII names for identifiers:
>
>"Identifiers start with a letter, _, or unicode alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.)"

So it does. Cool.

I looked at that document (ISO/IEC 9899:1999(E) Appendix D). It describes a
fixed list of identifier characters, which will never change with time (as
opposed to up-to-date Unicode, which contains an ever-growing list, growing with
each new version of Unicode). Anyway, I'm impressed. This is brilliant.

Jill


July 28, 2004
J C Calvarese wrote:

> In article <ce7o9r$2uo$1@digitaldaemon.com>, Arcane Jill says...
> 
>>In article <ce6pke$2o2m$1@digitaldaemon.com>, parabolis says...
>>
>>>Actually the start has happened. What I was referring to was that the conception of the string in D has seemingly been defined. The Object.toString method returns a UTF sequence.
>>>(I will explain further down...)
>>>
>>>I would be curious whether non-ASCII names will be supported (ie classes and variable names, etc).
>>
>>You'll have to ask Walter that one. (I mean, you'll have to wait and see if
>>Walter answers this question). I suspect not, because I'm only providing a
>>library, and it's written in D. The DMD compiler is written in C, and so can't
>>call D libraries, and therefore won't be able to take advantage of any D library
>>I provide. Adding Unicode support to /the compiler/ would also bloat the
>>compiler somewhat. But that's just a guess. As I said, only Walter can answer
>>this one definitively.
> 
> 
> Unless I don't understand the question (which is always a strong possibility),
> DMD already supports non-ASCII names for identifiers:
> 
> "Identifiers start with a letter, _, or unicode alpha, and are followed by any
> number of letters, _, digits, or universal alphas. Universal alphas are as
> defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.)" 
> 
> http://www.digitalmars.com/d/lex.html
> 
> I've tested it before and it worked for me.
> 
> jcc7


Wow I am impressed. That was really forward thinking.
July 28, 2004
In article <ce8hhq$c9o$1@digitaldaemon.com>, parabolis says...

>The result is very subtle bugs when assuming char[].length or wchar[].length counts the /Characters/ in an array or that char[i..k] slices the /Character/ in an array.

I'm not disagreeing with you, but see my separate post on graphemes and glyphs and things. There are distinctions in Unicode which never existed in ASCII, so people are not used to them. In ASCII, every character was either a control or a grapheme. This correspondence no longer holds in Unicode, so even basing your strings on characters is not always the desirable thing to do.

What, for example, is (cast(dchar[]) "café")[3..4] ?
or...                 (cast(dchar[]) "café").length ?

The answer depends on how your text editor composed the "é" when you wrote the source code. To paraphrase you, the result is very subtle bugs when assuming dchar[].length counts the /graphemes/ in an array or that dchar[i..k] slices the /graphemes/ in an array. Of course "char" doesn't suggest "grapheme" in the same way that it suggests "character" - but in reality, most people don't know the difference (because there pretty much is no difference in ASCII).

So - like I say - I'm not disagreeing with you. But I don't see where you're going with this. I see the flaws in current support, and I think "We can fix that". Hence the planned future functionality. You see the same flaws, but you seem instead to be saying "ditch the char". But you know that's not going to happen. Have I misunderstood you?


>>>       char[].sort  (just amusing to consider)
>> 
>> This one will actually work. Lexicographical UTF-8 order is the same as lexicographical Unicode order.
>
>I agree with you but you missed the multiple code units case where sort has the nice property that it /destroys/ a valid encoding:

Yeah, my bad. I read that as char[][].sort. You're right that char[].sort will break the (conceptual) char[] invariant. You'll get a UTF conversion exception later on.

I see what you're saying, but I'm sure that a string class will exist in the future. That it doesn't exist yet, to me, makes it just something to look forward to, not the end of the world.



>See STL which defines String to work in a manner consistent with my Characters of Strings notion.

Now that's cheating. std::string (being a typedef for std::basic_string<char>)
has the same concept of character as (char *). It's dependent on the source code
encoding.


>Java obviously also uses "string" the same way... see java.lang.String.

I just looked at it. Seems to be based on 16-bit wide Java chars to me. That smells of UTF-16, hence /not/ the 1-1 correspondence you suggest.



>Perhaps I am being overly pedantic again but consider u+0000 and u+0001. I believe calling the following Characters is acceptable:
>
>     typedef bit tinyCharacter;

Errm. Sort of. Really the only definition of "character" that makes sense is that a character is a member of some character set, so if you first defined a character set with two characters in it, then you could indeed encode such characters with one bit. But you can't just go around picking arbitrary subsets of existing character sets and representing them in fewer than the required number of bits.



>Let me guess... You want Strings that support a larger subset of Unicode Characters?

Either we're talking Unicode or we're not. There are Unicode strings; there are Latin-1 strings; there are ASCII strings. I don't get the question.



>Well fortunately C originally supported the Unicode range from u+0000 to u+007F with arrays of Characters.

If we're going to be /really/ pedantic here, it did not. It supported ASCII. The fact that there is a 1-1 correspondence between the codepoints of ASCII and the codepoints U+0000 to U_007F of Unicode was a design feature of Unicode, not a design feature of C.

But really, you know - who cares? I mean, I see no point in this little tangent. I think we've drifted into the utterly trivial here, and I'm keen to move out of it.


>However I do still doubt the likelihood of ever seeing full sentence which contains exclusively (or perhaps even mostly) entities above u+FFFF.

Depends what language you speak.



July 28, 2004
Arcane Jill wrote:

> In article <ce8hhq$c9o$1@digitaldaemon.com>, parabolis says...
> 
> 
>>The result is very subtle bugs when assuming char[].length or wchar[].length counts the /Characters/ in an array or that char[i..k] slices the /Character/ in an array.
> 
> 
> I'm not disagreeing with you, but see my separate post on graphemes and glyphs
> and things. There are distinctions in Unicode which never existed in ASCII, so
> people are not used to them. In ASCII, every character was either a control or a
> grapheme. This correspondence no longer holds in Unicode, so even basing your
> strings on characters is not always the desirable thing to do.
> 
> What, for example, is (cast(dchar[]) "café")[3..4] ?
> or...                 (cast(dchar[]) "café").length ?
> 
> The answer depends on how your text editor composed the "é" when you wrote the
> source code. To paraphrase you, the result is very subtle bugs when assuming
> dchar[].length counts the /graphemes/ in an array or that dchar[i..k] slices the
> /graphemes/ in an array. Of course "char" doesn't suggest "grapheme" in the same
> way that it suggests "character" - but in reality, most people don't know the
> difference (because there pretty much is no difference in ASCII).
> 
> So - like I say - I'm not disagreeing with you. But I don't see where you're
> going with this. I see the flaws in current support, and I think "We can fix
> that". Hence the planned future functionality. You see the same flaws, but you
> seem instead to be saying "ditch the char". But you know that's not going to
> happen. Have I misunderstood you?
> 

Perhaps I am missing something, but the general idea that I am used to
operating with is a standard internal to the language.  I.e. all strings
are encoded UTF-32BE, but the IO should be able to translate the native
string to whatever format is necessary/available.  So the file (from
your editor) might write UTF-8, but using an encoding scheme with your
IO stream would be able to convert it to UTF-32BE--which would be native
for the language.

I am only using it as an example.  I do the same thing with things
bigger than strings myself.  For example, I have a model that becomes
the basis for decoupling the translation side and the usage side.  It
works very well.

As long as the library was consistent with its standard, wouldn't that
work well for D?
July 28, 2004
In article <ce8i48$civ$1@digitaldaemon.com>, parabolis says...

>No I do not believe that would help. I think that would simply attempt to evade the issue that D has no types corresponding to Characters

except of course dchar

>(and thus in fact effectively has no String support)
>whatsoever

except of course dchar[]


Jill



July 28, 2004
Arcane Jill wrote:

I am leaving everything you said there...


> 
> Yeah, my bad. I read that as char[][].sort. You're right that char[].sort will
> break the (conceptual) char[] invariant. You'll get a UTF conversion exception
> later on.

Actually on the topic of UTF conversion exception... There really is no such thing according to the standard.

Personally I also prefer failing fast but I figured I would point out that it is non-standard behavoiur.

> 
> I see what you're saying, but I'm sure that a string class will exist in the
> future. That it doesn't exist yet, to me, makes it just something to look
> forward to, not the end of the world.
> 

Back to the comment that started this discussion:
================================================
In my opinion D is off to a really bad start with Unicode.
================================================

And the reason for the comment:
================================================
I have only seen the phobos.std.string and the D docs which mistankenly say UTF implements Strings. I was not previously privy to the D's Unicode plans. I saw what appeared to be a significant ambiguity between thd doc's use of String and Unicode string and that suggested a bad start.
================================================

I am much less pessimistic now that I know D will support inuitive Strings (and indeed a plethora at that - Character, Grapheme and Glyph Strings).

I will be quite blown away if D actually manages to pull this off without requiring any knowledge of Unicode except where Unicode specific features are required.

> 
> But really, you know - who cares? I mean, I see no point in this little tangent.
> I think we've drifted into the utterly trivial here, and I'm keen to move out of
> it. 

I think if it has any relevence it will come up in the newly started thread and is probably best dealt with there.


>>However I do still doubt the likelihood of ever seeing full sentence which contains exclusively (or perhaps even mostly) entities above u+FFFF.
> 
> 
> Depends what language you speak.

I would be suprised to find that the UC has defined characters in that range that are used in a living languguage. I was kind of hoping you might have an example.
July 28, 2004
Sean Kelly wrote:

> parabolis wrote:
> 
>>
>> But the general case is this horrible version:
>>
>> # for (int i=0; i>=String.length(); i++) { /* blah */ }
>>   (call String.length(); every iteration)
> 
> 
> Though I'm generally too prone to premature optimization to do this, I think the above code has the potential to be just as fast as the 

Emphasis on the potential. That means that you must first be able to guarantee that any compiler that gets your code will optimize it before you should feel safe writing it.