September 29, 2006
Immutability and some guarantees about the validity of the state of an immutable string in a concurrent setting are what set Java strings apart. Garbage collection without immutable strings in the standard library is quite out of the ordinary.

Walter Bright wrote:
> Derek Parnell wrote:
>> And is it there yet? I mean, given that a string is just a lump of text, is
>> there any text processing operation that cannot be simply done to a char[]
>> item? I can't think of any but maybe somebody else can.
> 
> I believe it's there. I don't think std::string or java.lang.String have anything over it.
> 
>> And if a char[] is just as capable as a std::string, then why not have an
>> official alias in Phobos? Will 'alias char[] string' cause anyone any
>> problems?
> 
> I don't think it'll cause problems, it just seems pointless.
September 29, 2006
BCS wrote:
> Johan Granberg wrote:
> 
>>
>>
>> I completely agree, char should hold a character independently of encoding and NOT a code unit or something else. I think it would be
>> beneficial to D in the long term if chars where done right (meaning that they can store any character) how it is implemented is not important and i believe performance is not a problem here, so ease of use and correctness would be appreciated.
> 
> 
> Why isn't performance a problem?
> 
> If you are saying that this won't cause performance hits in run times or  memory space, I might be able to buy it, but I'm not yet convinced.
> 
> If you are saying that causing a performance hit in run times or memory space is not a problem... in that case I think you are dead wrong and you will not convince me otherwise.
> 
> In my opinion, any compiled language should allow fairly direct access to the most efficient practical means of doing something*. If I didn't care about speed and memory I wound use some sort of scripting language.
> 
> A good set of libs should make most of this moot. Leave the char as is and define a typedef struct or whatever that provides the added functionality that you want.
> 
> * OTOH a language should not mandate code to be efficient at the expense of ease of coding.

I will go ahead and say that the current state of char[] is incorrect. That is, if you write a program manipulating char[] strings, then run it in china, you will be dissapointed with the results.  It won't matter how fast the program runs, because bad stuff will happen like entire strings becoming unreadable to the user.

Technically if you follow UTF and do your char[] manipulations very carefully, it is correct, but realistically few if any people will do such things (I won't).  Also, if you do this, your program will probably run as slow as one with the proposed char/string solution, maybe slower (since language/stdlib level support can be heavily optimized).

What I'd like then, is a program that is correct and as fast as possible while still being correct.

Sure you can get some speed gains by just using ASCII and saying to hell with UTF, but you should probably only do that when profiling has shown that such speed gains are actually useful/needed in your program.

Ultimately we have to decide whether we want D to default to UTF code which might run slightly slower but allow better localization and international friendliness, or if we want it to default to ASCII or some such encoding that runs slightly faster but is mostly limited to english.

I'd like the default to be UTF.  Then we can have a base of code to correctly manipulate UTF strings (in phobos and language supported). Writing correct ASCII manipulation routine without good library/language support is a lot easier than writing good UTF manipulation routines without good library/language support, and UTF will probably be used much more than ASCII.

Also, if we move over to full blown UTF, we won't have to give up ASCII.  It seems to me like the phobos std.string functions are pretty much ASCII string manipulating functions (no multibyte string support).  So just copy those out to a seperate library, call it "ASCII lib", and there's your library support for ASCII.  That leaves string literals, which is a slight problem, but I suppose easily fixed:
ubyte[] hi = "hello!"a;
Just add a postfix 'a' for strings which makes the string an ASCII literal, of type ubyte[].  D arrays don't seem powerful enough to do UTF manipulations without special attention, but they are powerful enough to do ASCII manipulations without special attention, so using ubyte[] as an ASCII string should give full language support for these.  Given that and ASCIILIB you pretty much have the current D string manipulation capabilities afaik, and it will be fast.
September 29, 2006
Chad J > wrote:

> I'd like the default to be UTF. Then we can have a base of code to
> correctly manipulate UTF strings (in phobos and language supported).
> Writing correct ASCII manipulation routine without good library/language
> support is a lot easier than writing good UTF manipulation routines
> without good library/language support, and UTF will probably be used
> much more than ASCII.

But D already uses Unicode for all strings, encoded as UTF ?

When you say "ASCII", do you mean 8-bit encodings perhaps ?
(since all proper 7-bit ASCII are already valid UTF-8 too)

> Also, if we move over to full blown UTF, we won't have to give up ASCII.  It seems to me like the phobos std.string functions are pretty much ASCII string manipulating functions (no multibyte string support).  So just copy those out to a seperate library, call it "ASCII lib", and there's your library support for ASCII.  That leaves string literals, which is a slight problem, but I suppose easily fixed:
> ubyte[] hi = "hello!"a;

I don't understand this, why can't you use UTF-8 for this ?

char[] hi = "hello!";

> Just add a postfix 'a' for strings which makes the string an ASCII literal, of type ubyte[].  D arrays don't seem powerful enough to do UTF manipulations without special attention, but they are powerful enough to do ASCII manipulations without special attention, so using ubyte[] as an ASCII string should give full language support for these.  Given that and ASCIILIB you pretty much have the current D string manipulation capabilities afaik, and it will be fast.

What is not powerful enough about the foreach(dchar c; str) ?
It will step through that UTF-8 array one codepoint at a time.

--anders
September 29, 2006
Chad J > wrote:
> I will go ahead and say that the current state of char[] is incorrect. That is, if you write a program manipulating char[] strings, then run it in china, you will be dissapointed with the results.  It won't matter how fast the program runs, because bad stuff will happen like entire strings becoming unreadable to the user.

Wrong.

And that's precisely what I meant about the Daddy holding bike allegory a few messages back.

The current system seems to work "by magic". So, if you do go to China, itll "just work".

At this point you _should_ not believe me. :-) But it still works.

---

The secret is, there actually is a delicate balance between UTF-8 and the library string operations. As long as you use library functions to extract substrings, join or manipulate them, everything is OK. And very few of us actually either need to, or see the effort of bit-twiddling individual octets in these "char" arrays.

So things just keep on working.

---

Not convinced yet? Well, a lot of folks here are from Europe, and our languages contain "non-ASCII" characters. Our text manipulating programs still work allright. And, actually D is pretty popular in Japan. Every once in a while some Japanese guys pop on-and-off here, and some of them don't even speak English, so they use a machine translator(!) to talk with us. Just guess if they use ASCII in their programs. And you know what, most of these guys even use their own characters for variable names in D!

And not one of them has complained about "disappointing results".

---

That's why I continued with: keep your eyes shut and keep on coding.
September 29, 2006
Anders F Björklund wrote:
> Chad J > wrote:
> 
>> I'd like the default to be UTF. Then we can have a base of code to
>> correctly manipulate UTF strings (in phobos and language supported).
>> Writing correct ASCII manipulation routine without good library/language
>> support is a lot easier than writing good UTF manipulation routines
>> without good library/language support, and UTF will probably be used
>> much more than ASCII.
> 
> 
> But D already uses Unicode for all strings, encoded as UTF ?
> 
> When you say "ASCII", do you mean 8-bit encodings perhaps ?
> (since all proper 7-bit ASCII are already valid UTF-8 too)
> 

Probably 7-bit.  Anything where the size of one character is ALWAYS one byte.  I am already assuming that ASCII is a subset or at least is mostly a subset of UTF8.  However, I talk about it in an exclusive manner because if you handle UTF8 strings properly then the code will probably run at least slightly slower than with ASCII-only strings.

>> Also, if we move over to full blown UTF, we won't have to give up ASCII.  It seems to me like the phobos std.string functions are pretty much ASCII string manipulating functions (no multibyte string support).  So just copy those out to a seperate library, call it "ASCII lib", and there's your library support for ASCII.  That leaves string literals, which is a slight problem, but I suppose easily fixed:
>> ubyte[] hi = "hello!"a;
> 
> 
> I don't understand this, why can't you use UTF-8 for this ?
> 
> char[] hi = "hello!";
> 

I was talking about IF we made char[] into a datatype that handles all of those odd corner cases correctly (slices into multibyte strings, for instance) then it will no longer be the same fast ASCII-only routines. So for those who want the fast ASCII-only stuff, it would nice to specify a way to make string literals such that each character in the literal takes only one byte, without ugly casting.  To get an ASCII monobyte string from a string literal in D I would have to do the following:

ubyte[] hi = cast(ubyte[])"hello!";

hmmm, yuck.

>> Just add a postfix 'a' for strings which makes the string an ASCII literal, of type ubyte[].  D arrays don't seem powerful enough to do UTF manipulations without special attention, but they are powerful enough to do ASCII manipulations without special attention, so using ubyte[] as an ASCII string should give full language support for these.  Given that and ASCIILIB you pretty much have the current D string manipulation capabilities afaik, and it will be fast.
> 
> 
> What is not powerful enough about the foreach(dchar c; str) ?
> It will step through that UTF-8 array one codepoint at a time.
> 

I'm assuming 'str' is a char[], which would make that very nice.  But it doesn't solve correctly slicing or indexing into a char[].  If nothing was done about this and I absolutely needed UTF support, I'd probably make a class like so:

class String
{
  char[] data;

  ...

  dchar opIndex( int index )
  {
    foreach( int i, dchar c; data )
    {
      if ( i == index )
        return c;

      i++;
    }
  }

  // similar thing for opSlice down here
  ...
}

Which is probably slower than could be done.

All in all it is a drag that we should have to learn all of this UTF stuff.  I want char[] to just work!
September 29, 2006
Georg Wrede wrote:
> The secret is, there actually is a delicate balance between UTF-8 and the library string operations. As long as you use library functions to extract substrings, join or manipulate them, everything is OK. And very few of us actually either need to, or see the effort of bit-twiddling individual octets in these "char" arrays.
> 

But this is what I'm talking about... you can't slice them or index them.  I might actually index a character out of an array from time to time.  If I don't know about UTF, and I do just keep on coding, and I do something like this:

char[] str = "some string in nonenglish text";
for ( int i = 0; i < str.length; i++ )
{
  str[i] = doSomething( str[i] );
}

and this will fail right?

If it does fail, then everything is not alright.  You do have to worry about UTF.  Someone has to tell you to use a foreach there.
September 29, 2006
BCS wrote:
> Why isn't performance a problem?
> 
> If you are saying that this won't cause performance hits in run times or  memory space, I might be able to buy it, but I'm not yet convinced.
> 
> If you are saying that causing a performance hit in run times or memory space is not a problem... in that case I think you are dead wrong and you will not convince me otherwise.
> 
> In my opinion, any compiled language should allow fairly direct access to the most efficient practical means of doing something*. If I didn't care about speed and memory I wound use some sort of scripting language.
> 
> A good set of libs should make most of this moot. Leave the char as is and define a typedef struct or whatever that provides the added functionality that you want.
> 
> * OTOH a language should not mandate code to be efficient at the expense of ease of coding.

I don't think any performance hit will be so big that it causes problems (max x4 memory and negligible computation overhead). Hope that made clear what I meant.
September 29, 2006
Georg Wrede wrote:
> Wrong.
> 
> And that's precisely what I meant about the Daddy holding bike allegory a few messages back.
> 
> The current system seems to work "by magic". So, if you do go to China, itll "just work".
> 
> At this point you _should_ not believe me. :-) But it still works.
> 
> ---

But is this not a needless source of confusion, that could be eliminated by defining char as "big enough to hold a unicode code point" or something else that eliminates the possibility to incorrectly divide utf tokens.

I will have to try using char[] with non ascii characters thou I have been using dchar fore that up till now.
September 29, 2006
Chad J > wrote:

> Probably 7-bit.  Anything where the size of one character is ALWAYS one byte.  I am already assuming that ASCII is a subset or at least is mostly a subset of UTF8.  However, I talk about it in an exclusive manner because if you handle UTF8 strings properly then the code will probably run at least slightly slower than with ASCII-only strings.

It's mostly about looking out for the UTF "control" characters, which is  not more than a simple assertion in your ASCII-only functions really...

I don't think handling UTF-8 properly is a burden for string functions, when you compare it with the enormous gain that it has over ASCII-only.

>> What is not powerful enough about the foreach(dchar c; str) ?
>> It will step through that UTF-8 array one codepoint at a time.
> 
> I'm assuming 'str' is a char[], which would make that very nice.  But it doesn't solve correctly slicing or indexing into a char[].  

Well, it's also a lot "trickier" than that... For instance, my last name
can be written in Unicode as Björklund or Bj¨orklund, both of which are valid - only that in one of them, the 'ö' occupies two full code points!
It's still a single character, which is why Unicode avoids that term...

As you know, if you need to access your strings by codepoint (something that the Unicode group explicitly recommends against, in their FAQ) then char[] isn't a very nice format - because of the conversion overhead...
But it's still possible to translate, transform, and translate back ?

> If nothing was done about this and I absolutely needed UTF support,
> I'd probably make a class like so: [...]

In my own mock String class, I cached the dchar[] codepoints on demand.
(viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)

> All in all it is a drag that we should have to learn all of this UTF stuff.  I want char[] to just work!

Using Unicode strings and characters does require a little learning...
(where http://www.unicode.org/faq/utf_bom.html is a very good page)
And D does force you to think about string implementation, no question.
This has both pros and cons, but it is a deliberate language decision.

If you're willing to handle the "surrogates", then UTF-16 is a rather
good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
A downside is that it is not "ascii-compatible" (has embedded NUL chars)
and that it is endian-dependant unlike the more universal UTF-8 format.

--anders
September 29, 2006
Chad J > wrote:

> char[] data; 

>   dchar opIndex( int index )
>   {
>     foreach( int i, dchar c; data )
>     {
>       if ( i == index )
>         return c;
> 
>       i++;
>     }
>   }

This code probably does not work as you think it does...

If you loop through a char[] using dchars (with a foreach),
then the int will get the codeunit index - *not* codepoint.
(the ++ in your code above looks more like a typo though,
since it needs to *either* foreach i, or do it "manually")

import std.stdio;
void main()
{
   char[] str = "Björklund";
   foreach(int i, dchar c; str)
   {
     writefln("%4d \\U%08X '%s'", i, c, ""d ~ c);
   }
}

Will print the following sequence:

   0 \U00000042 'B'
   1 \U0000006A 'j'
   2 \U000000F6 'ö'
   4 \U00000072 'r'
   5 \U0000006B 'k'
   6 \U0000006C 'l'
   7 \U00000075 'u'
   8 \U0000006E 'n'
   9 \U00000064 'd'

Notice how the non-ASCII character takes *two* code units ?
(if you expect indexing to use characters, that'd be wrong)

More at http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs

--anders