September 30, 2006
Sean Kelly wrote:
> Walter Bright wrote:
>>
>> Contrast that with C++, which has no usable or portable support for UTF-8, UTF-16, or any Unicode. All your carefully coded use of std::string needs to be totally scrapped and redone with your own custom classes, should you decide your app needs to support unicode.
> 
> As long as you're aware that you are working in UTF-8 I think std::string could still be used.  It just may be strange to use substring searches to find multibyte characters with no built-in support for dchar-type searching.

It's so broken that there are proposals to reengineer core C++ to add support for UTF types.

1) implementation-defined whether a char is signed or unsigned, so you've got to cast the result of any string[i]

2) none of the iteration, insertion, appending, etc., operations can handle multibyte

3) no UTF conversion or transliteration

4) C++ source text encoding is implementation-defined, so no using UTF characters in source code (have to use \u or \U notation)
October 01, 2006
Anders F Björklund wrote:
>>> What is not powerful enough about the foreach(dchar c; str) ?
>>> It will step through that UTF-8 array one codepoint at a time.
>>
>>
>> I'm assuming 'str' is a char[], which would make that very nice.  But it doesn't solve correctly slicing or indexing into a char[].  
> 
> 
> Well, it's also a lot "trickier" than that... For instance, my last name
> can be written in Unicode as Björklund or Bj¨orklund, both of which are valid - only that in one of them, the 'ö' occupies two full code points!
> It's still a single character, which is why Unicode avoids that term...
> 

So it seems to me the problem is that those 2 bytes are both 2 characters and 1 character at the same time.

In this case, I'd prefer being able to index to a safe default (like the ö, instead of the umlauts next to the o), or not being able to index at all.

> As you know, if you need to access your strings by codepoint (something that the Unicode group explicitly recommends against, in their FAQ) then char[] isn't a very nice format - because of the conversion overhead...
> But it's still possible to translate, transform, and translate back ?
> 

I read that FAQ at the bottom of this post, and didn't see anything about accessing strings by codepoint.  Maybe you mean a different FAQ here, in which case, could I have a link please?  I've been to the unicode site before and all I remember was being confused and having a hard time finding the info I wanted :(

Also I still am not sure exactly what a code point is.  And that FAQ at the bottom used the word "surrogate" a lot; I'm not sure about that one either.

When you say char[] isn't a nice format, I wasn't thinking about having the string class I mentioned earlier store the data ONLY as char[].  It might be wchar[].  Or dchar[].  Then it would be automatically converted between the two either at compile time (when possible) or dynamically at runtime (hopefully only when needed).  So if someone throws a Chinese character literal at it, there is a very big clue there to use UTF32 or something that can store all of the characters in a uniform width sort of way, to speed indexing.  Algorithms could be used so that a program 'learns' at runtime what kind of strings are dominating the program, and uses algorithms optimized for those.  Maybe this is a bit too complex, but I can dream, hehe.

>> If nothing was done about this and I absolutely needed UTF support,
>> I'd probably make a class like so: [...]
> 
> 
> In my own mock String class, I cached the dchar[] codepoints on demand.
> (viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)
> 
>> All in all it is a drag that we should have to learn all of this UTF stuff.  I want char[] to just work!
> 
> 
> Using Unicode strings and characters does require a little learning...
> (where http://www.unicode.org/faq/utf_bom.html is a very good page)
> And D does force you to think about string implementation, no question.
> This has both pros and cons, but it is a deliberate language decision.
> 
> If you're willing to handle the "surrogates", then UTF-16 is a rather
> good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
> A downside is that it is not "ascii-compatible" (has embedded NUL chars)
> and that it is endian-dependant unlike the more universal UTF-8 format.
> 
> --anders

My impression has gone from being quite scared of UTF to being not so worried, but only for myself.  D seems to be good at handling UTF, but only if someone tells you to never handle strings as arrays of characters.  Unfortunately, the first thing you see in a lot of D programs is "int main( char[][] args )" and there are some arrays of characters being used as strings.  This also means that some array capabilities like indexing and the braggable slicing are more dangerous than useful for string handling.  It's a newbie trap.

Like I said earlier, I either want to be able to index/slice strings safely, or not at all (or better yet, not by any intuitive means).
October 01, 2006
On Fri, 29 Sep 2006 23:11:37 -0700, Walter Bright wrote:

> Derek Parnell wrote:
>> I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character.
> 
> That cannot happen, because multibyte sequences *always* have the high bit set, and 'a' does not. That's one of the things that sets UTF-8 apart from other multibyte formats. You might be thinking of the older Shift-JIS multibyte encoding, which did suffer from such problems.

Thanks. That has cleared up some misconceptions and pre-concenptions that I had with utf encoding. I can reduce some of my home-grown routines now and reduce that number of times that I (think I) need dchar[] ;-)


>> It may very well be pointless for your way of thinking, but your language is also for people who may not necessarily think in the same manner as yourself. I, for example, think there is a point to having my code read like its dealing with strings rather than arrays of characters. I suspect I'm not alone. We could all write the alias in all our code, but you could also be helpful and do it for us - like you did with bit/bool.
> 
> I'm concerned about just adding more names that don't add real value. As I wrote in a private email discussion about C++ typedefs, they should only be used when:
> 
> 1) they provide an abstraction against the presumption that the underlying type may change
> 
> 2) they provide a self-documentation purpose
> 
> (1) certainly doesn't apply to string.

No argument there.

>  (2) may, but char[] has no use
> other than that of being a string, as a char[] is always a string and a
> string is always a char[]. So I don't think string fits (2).

This is a lttle more debatable, but not worth generating hostility.

A string of text contains characters whose position in the string is significant - there are semantics to be applied to the entire text. It is quite possible to conceive of an application in which the characters in the char[] array have no importance attached to their relative position within the array *where compared to neighboring characters*. The order of characters in text is significant but not necessarily so in a arbitary character array.

Conceptually a string is different from a char[], even though they are implemented using the same technology.

> And lastly, there's the inevitable confusion. People learning the language will see char[] and string, and wonder which should be used when. I can't think of any consistent understandable rule for that. So it just winds up being wishy-washy. Adding more names into the global space (which is what names in object.d are) should be done extremely conservatively.

And yet we have "toString" and not "toCharArray" or "toUTF"!

And we still have the "printf" in object.d too!

> If someone wants to use the string alias as their personal or company style, I have no issue with that, as other people *do* think differently than me (which is abundantly clear here!).

I'll revert Build to string again as it is a lot easier to read. It started out that way but I converted it to char[] to appease you (why I thought you need appeasing is lost though). :-)

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"
October 01, 2006
Derek Parnell wrote:
>>  (2) may, but char[] has no use other than that of being a string, as a char[] is always a string and a string is always a char[]. So I don't think string fits (2).
>  This is a lttle more debatable, but not worth generating hostility.

I certainly hope this thread doesn't degenerate into that like some of the others.

> A string of text contains characters whose position in the string is
> significant - there are semantics to be applied to the entire text. It is
> quite possible to conceive of an application in which the characters in the
> char[] array have no importance attached to their relative position within
> the array *where compared to neighboring characters*. The order of
> characters in text is significant but not necessarily so in a arbitary
> character array. 
> 
> Conceptually a string is different from a char[], even though they are
> implemented using the same technology.

You do have a point there.

>> And lastly, there's the inevitable confusion. People learning the language will see char[] and string, and wonder which should be used when. I can't think of any consistent understandable rule for that. So it just winds up being wishy-washy. Adding more names into the global space (which is what names in object.d are) should be done extremely conservatively.
> 
> And yet we have "toString" and not "toCharArray" or "toUTF"!

True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful.

I suppose that since I grew up with char* meaning string, using char[] seems perfectly natural. I tried typedef'ing char* to string now and then, but always wound up going back to just using char*.

> And we still have the "printf" in object.d too!

I know many feel that printf doesn't belong there. It certainly isn't there for purity or consistency. It's there purely (!) for the convenience of writing short quickie programs. I tend to use it for quick debugging test cases, because it doesn't rely on the rest of D working.

>> If someone wants to use the string alias as their personal or company style, I have no issue with that, as other people *do* think differently than me (which is abundantly clear here!).
> 
> I'll revert Build to string again as it is a lot easier to read. It started
> out that way but I converted it to char[] to appease you (why I thought you
> need appeasing is lost though). :-)

No, you certainly don't need to appease me! I do care about maintaining a reasonably consistent style in Phobos, but I don't believe a language should enforce a particular style beyond the standard library. Viva la difference.

P.S. I did say to not 'enforce', but that doesn't mean I am above recommending a particular style, as in http://www.digitalmars.com/d/dstyle.html
October 01, 2006
Chad J > wrote:

> I read that FAQ at the bottom of this post, and didn't see anything about accessing strings by codepoint.  Maybe you mean a different FAQ here, in which case, could I have a link please?  I've been to the unicode site before and all I remember was being confused and having a hard time finding the info I wanted :(

I meant http://www.unicode.org/faq/utf_bom.html#12

> Also I still am not sure exactly what a code point is.  And that FAQ at the bottom used the word "surrogate" a lot; I'm not sure about that one either.

Code point is the closest thing to a "character", although it might take more than one Unicode code point to represent a single Unicode grapheme.

Surrogates are used with UTF-16, to represent "too large" code points...
i.e. they always occur in "surrogate pairs", which combine to a single

> When you say char[] isn't a nice format, I wasn't thinking about having the string class I mentioned earlier store the data ONLY as char[].  It might be wchar[].  Or dchar[].  Then it would be automatically converted between the two either at compile time (when possible) or dynamically at runtime (hopefully only when needed).  So if someone throws a Chinese character literal at it, there is a very big clue there to use UTF32 or something that can store all of the characters in a uniform width sort of way, to speed indexing.  Algorithms could be used so that a program 'learns' at runtime what kind of strings are dominating the program, and uses algorithms optimized for those.  Maybe this is a bit too complex, but I can dream, hehe.

Actually I said that dchar[] (i.e. UTF-32) wasn't ideal, but anyway...
(UTF-8 or UTF-16 is preferrable, for the reasons in the UTF FAQ above)

We already have char[] as the string default in D, but most models for
a String class uses wchar[] (i.e. UTF-16), for instance Mango or Java:
* http://mango.dsource.org/classUString.html (uses the ICU lib)
* http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html

All formats do use Unicode, so converting from one UTF to another is mostly a question of memory/performance and not about any data loss.
However, it is not converted at compile time (without using templates)
so mixing and matching different representations is somewhat of a pain.

I think that char[] for string and wchar[] for String are good defaults.

> My impression has gone from being quite scared of UTF to being not so worried, but only for myself.  D seems to be good at handling UTF, but only if someone tells you to never handle strings as arrays of characters.  Unfortunately, the first thing you see in a lot of D programs is "int main( char[][] args )" and there are some arrays of characters being used as strings.  This also means that some array capabilities like indexing and the braggable slicing are more dangerous than useful for string handling.  It's a newbie trap.

It is, since it isn't really "arrays of characters" but "arrays of code units". What muddies the waters further is that sometimes they're equal.
That is, with ASCII characters each character fits into a a D char unit.
Without surrogates, each character (from BMP) fits into one wchar unit.

However, all code that handles the shorter formats should be prepared to handle non-ASCII (for UTF-8) and surrogates (for UTF-16), or use UTF-32:
bool isAscii(char c) { return (c <= 0x7f); }
bool isSurrogate(wchar c) { return (c >= 0xD800 && c <= 0xDFFF); }

But a warning that D uses multi-byte strings might be in order, yes...
Another warning that it only supports UTF-8 platforms* might also be ?

--anders

* "main(char[][] args)" does not work for any non-UTF consoles,
  as you will get invalid UTF sequences for the non-ASCII chars.
October 01, 2006
On Sat, 30 Sep 2006 21:18:02 -0700, Walter Bright wrote:

> P.S. I did say to not 'enforce', but that doesn't mean I am above recommending a particular style, as in http://www.digitalmars.com/d/dstyle.html

Oh, I threw trhat away ages ago ;-)

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"
October 01, 2006
Walter Bright wrote:
> Sean Kelly wrote:
>> Walter Bright wrote:
>>>
>>> Contrast that with C++, which has no usable or portable support for UTF-8, UTF-16, or any Unicode. All your carefully coded use of std::string needs to be totally scrapped and redone with your own custom classes, should you decide your app needs to support unicode.
>>
>> As long as you're aware that you are working in UTF-8 I think std::string could still be used.  It just may be strange to use substring searches to find multibyte characters with no built-in support for dchar-type searching.
> 
> It's so broken that there are proposals to reengineer core C++ to add support for UTF types.
> 
> 1) implementation-defined whether a char is signed or unsigned, so you've got to cast the result of any string[i]

Oops, forgot about this.

> 2) none of the iteration, insertion, appending, etc., operations can handle multibyte

True.  And I hinted at this above.

> 3) no UTF conversion or transliteration
> 
> 4) C++ source text encoding is implementation-defined, so no using UTF characters in source code (have to use \u or \U notation)

Personally, I see this as a language deficiency more than a deficiency in std::string.  std::string is really just a vector with some search capabilities thrown in.  It's not that great for a string class, but it works well enough as a general sequence container.  And it will work a tad better once they impose the came data contiguity guarantee that vector has (I believe that's one of the issues set to be resolved for 0x).

Overall, I do agree with you.  Though I suppose that's obvious as I'm a former C++ advocate who now uses D quite a bit :-)


Sean
October 01, 2006
Sean Kelly wrote:
>> 3) no UTF conversion or transliteration
>>
>> 4) C++ source text encoding is implementation-defined, so no using UTF characters in source code (have to use \u or \U notation)
> 
> Personally, I see this as a language deficiency more than a deficiency in std::string.

That's why the proposals to fix it are rewriting some of the *core* C++ language.

> std::string is really just a vector with some search capabilities thrown in.

Another difficulty with it is it doesn't have a connection with std::vector<char>.

> It's not that great for a string class, but it works well enough as a general sequence container.  And it will work a tad better once they impose the came data contiguity guarantee that vector has (I believe that's one of the issues set to be resolved for 0x).
> 
> Overall, I do agree with you.  Though I suppose that's obvious as I'm a former C++ advocate who now uses D quite a bit :-)

:-)
October 01, 2006
Derek Parnell wrote:
>>  (2) may, but char[] has no use other than that of being a string, as a char[] is always a string and a string is always a char[]. So I don't think string fits (2).
>  This is a lttle more debatable, but not worth generating hostility. 
> 
> A string of text contains characters whose position in the string is
> significant - there are semantics to be applied to the entire text. It is
> quite possible to conceive of an application in which the characters in the
> char[] array have no importance attached to their relative position within
> the array *where compared to neighboring characters*. The order of
> characters in text is significant but not necessarily so in a arbitary
> character array. 
> 
> Conceptually a string is different from a char[], even though they are
> implemented using the same technology.
> 

Precisely! And even if such conceptual difference didn't exist, or is very rare, 'string' is nonetheless more readable than 'char[]', a fact I am constantly reminded of when I see 'int main(char[][] args)' instead of 'int main(string[] args)', which translates much more quickly into the  brain as 'array of strings' than its current counterpart.

-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
October 01, 2006
Johan Granberg wrote:
> BCS wrote:
> 
>> Why isn't performance a problem?
>>
[...]
>> If you are saying that causing a performance hit in run times or memory space is not a problem... in that case I think you are dead wrong and you will not convince me otherwise.
> 
> I don't think any performance hit will be so big that it causes problems (max x4 memory and negligible computation overhead). Hope that made clear what I meant.

If you will note, I said nothing about the size of the hit. While some may disagree, I think that any unneeded hit is a problem.

One alternative that I could live with would use 4 character types:

char	one codeunit in whatever encoding the runtime uses
schar	one 8 bit code unit (ASCII or utf-8)
wchar	one 16 bit code unit (same as before)
dchar	one 32 bit code unit (same as before)

(using the same thing for ASCII and UTF-8 may be a problem, but this isn't my field)

The point being that char, wchar and dchar are not representing numbers and should be there own type. This also preserves direct access to 8, 16 and 32 bit types.