January 16, 2011
On 1/15/11 9:25 PM, Jonathan M Davis wrote:
> Considering that strings are already dealt with specially in order to have an
> element of dchar, I wouldn't think that it would be all that distruptive to make
> it so that they had an element type of Grapheme instead. Wouldn't that then fix
> all of std.algorithm and the like without really disrupting anything?

It would make everything related a lot (a TON) slower, and it would break all client code that uses dchar as the element type, or is otherwise unprepared to use Graphemes explicitly. There is no question there will be disruption.

Andrei
January 16, 2011
And how would 3rd party libraries handle Graphemes? And C modules? I think making these Graphemes the default would make quite a mess, since you would have to convert back and forth between char[] and Grapheme[] all the time (right?).
January 16, 2011
On 1/15/11 10:45 PM, Michel Fortin wrote:
> On 2011-01-15 18:59:27 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> I'm unclear on where this is converging to. At this point the
>> commitment of the language and its standard library to (a) UTF aray
>> representation and (b) code points conceptualization is quite strong.
>> Changing that would be quite difficult and disruptive, and the
>> benefits are virtually nonexistent for most of D's user base.
>
> There's still a disagreement about whether a string or a code unit array
> should be the default string representation, and whether iterating on a
> code unit array should give you code unit or grapheme elements. Of those
> who who participated in the discussion, I don't think anyone is
> disputing the idea that a grapheme element is better than a dchar
> element for iterating over a string.

Disagreement as that might be, a simple fact that needs to be taken into account is that as of right now all of Phobos uses UTF arrays for string representation and dchar as element type.

Besides, for one I do dispute the idea that a grapheme element is better than a dchar element for iterating over a string. The grapheme has the attractiveness of being theoretically clean but at the same time is woefully inefficient and helps languages that few D users need to work with. At least that's my perception, and we need some serious numbers instead of convincing rhetoric to make a big decision.

It's all a matter of picking one's trade-offs. Clearly ASCII is out as no serious amount of non-English text can be trafficked without diacritics. So switching to UTF makes a lot of sense, and that's what D did.

When I introduced std.range and std.algorithm, they'd handle char[] and wchar[] no differently than any other array. A lot of algorithms simply did the wrong thing by default, so I attempted to fix that situation by defining byDchar(). So instead of passing some string str to an algorithm, one would pass byDchar(str).

A couple of weeks went by in testing that state of affairs, and before late I figured that I need to insert byDchar() virtually _everywhere_. There were a couple of algorithms (e.g. Boyer-Moore) that happened to work with arrays for subtle reasons (needless to say, they won't work with graphemes at all). But by and large the situation was that the simple and intuitive code was wrong and that the correct code necessitated inserting byDchar().

So my next decision, which understandably some of the people who didn't go through the experiment may find unintuitive, was to make byDchar() the default. This cleaned up a lot of crap in std itself and saved a lot of crap in the yet-unwritten client code.

I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried.

Now, thanks to the effort people have spent in this group (thank you!), I have an understanding of the grapheme issue. I guarantee that grapheme-level iteration will have a high cost incurred to it: efficiency and changes in std. The languages that need composing characters for producing meaningful text are few and far between, so it makes sense to confine support for them to libraries that are not the default, unless we find ways to not disrupt everyone else.

>> It may be more realistic to consider using what we have as back-end
>> for grapheme-oriented processing.
>> For example:
>>
>> struct Grapheme(Char) if (isSomeChar!Char)
>> {
>> private const Char[] rep;
>> ...
>> }
>>
>> auto byGrapheme(S)(S s) if (isSomeString!S)
>> {
>> ...
>> }
>>
>> string s = "Hello";
>> foreach (g; byGrapheme(s)
>> {
>> ...
>> }
>
> No doubt it's easier to implement it that way. The problem is that in
> most cases it won't be used. How many people really know what is a
> grapheme?

How many people really should care?

> Of those, how many will forget to use byGrapheme at one time
> or another? And so in most programs string manipulation will misbehave
> in the presence of combining characters or unnormalized strings.

But most strings don't contain combining characters or unnormalized strings.

> If you want to help D programmers write correct code when it comes to
> Unicode manipulation, you need to help them iterate on real characters
> (graphemes), and you need the algorithms to apply to real characters
> (graphemes), not the approximation of a Unicode character that is a code
> point.

I don't think the situation is as clean cut, as grave, and as urgent as you say.


Andrei
January 16, 2011
On 1/15/11 10:47 PM, Michel Fortin wrote:
> On 2011-01-15 22:25:47 -0500, Jonathan M Davis <jmdavisProg@gmx.com> said:
>
>> The issue of foreach remains, but without being willing to change what
>> foreach defaults to, you can't really fix it - though I'd suggest that
>> we at least make it a warning to iterate over strings without
>> specifying the type. And if foreach were made to understand Grapheme
>> like it understands dchar, then you could do
>>
>> foreach(Grapheme g; str) { ... }
>>
>> and have the compiler warn about
>>
>> foreach(g; str) { ... }
>>
>> and tell you to use Grapheme if you want to be comparing actual
>> characters.
>
> Walter's argument against changing this for foreach was that it'd
> *silently* break compatibility with existing D1 code. Changing the
> default to a grapheme makes this argument obsolete: since a grapheme is
> essentially a string, you can't compare it with char or wchar or dchar
> directly, so it'll break at compile time with an error and you'll have
> to decide what to do.
>
> So Walter would have to find another argument to defend the status quo.

I think it's poor abstraction to represent a Grapheme as a string. It should be its own type.

Andrei
January 16, 2011
On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> On 1/15/11 10:45 PM, Michel Fortin wrote:
>> No doubt it's easier to implement it that way. The problem is that in
>> most cases it won't be used. How many people really know what is a
>> grapheme?
> 
> How many people really should care?

I think the only people who should *not* care are those who have validated that the input does not contain any combining code point. If you know the input *can't* contain combining code points, then it's safe to ignore them.

If we don't make correct Unicode handling the default, someday someone is going to ask a developer to fix a problem where his system doesn't handle some text correctly. Later that day, he'll come to the realization that almost none of his D code and none of the D libraries he use handle unicode correctly, and he'll say: can't fix this. His peer working on a similar Objective-C program will have a good laugh.

Sure, correct Unicode handling is slower and more complicated to implement, but at least you know you'll get the right results.


>> Of those, how many will forget to use byGrapheme at one time
>> or another? And so in most programs string manipulation will misbehave
>> in the presence of combining characters or unnormalized strings.
> 
> But most strings don't contain combining characters or unnormalized strings.

I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?

A few years ago, many Unicode symbols didn't even show up correctly on Windows. Today, we have Unicode domain names and people start putting funny symbols in them (for instance: <http://◉.ws>). I haven't seen it yet, but we'll surely see combining characters in domain names soon enough (if only as a way to make fun of programs that can't handle Unicode correctly). Well, let me be the first to make fun of such programs: <http://☺̭̏.michelf.com/>.

Also, not all combining characters are marks meant to be used by some foreign languages. Some are used for mathematics for instance. Or you could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay indicating some kind of prohibition.


>> If you want to help D programmers write correct code when it comes to
>> Unicode manipulation, you need to help them iterate on real characters
>> (graphemes), and you need the algorithms to apply to real characters
>> (graphemes), not the approximation of a Unicode character that is a code
>> point.
> 
> I don't think the situation is as clean cut, as grave, and as urgent as you say.

I agree it's probably not as clean cut as I say (I'm trying to keep complicated things simple here), but it's something important to decide early because the cost of changing it increase as more code is written.


Quoting the first part of the same post (out of order):

> Disagreement as that might be, a simple fact that needs to be taken into account is that as of right now all of Phobos uses UTF arrays for string representation and dchar as element type.
> 
> Besides, for one I do dispute the idea that a grapheme element is better than a dchar element for iterating over a string. The grapheme has the attractiveness of being theoretically clean but at the same time is woefully inefficient and helps languages that few D users need to work with. At least that's my perception, and we need some serious numbers instead of convincing rhetoric to make a big decision.

You'll no doubt get more performance from a grapheme-aware specialized algorithm working directly on code points than by iterating on graphemes returned as string slices. But both will give *correct* results.

Implementing a specialized algorithm of this kind becomes an optimization, and it's likely you'll want an optimized version for most string algorithms.

I'd like to have some numbers too about performance, but I have none at this time.


> It's all a matter of picking one's trade-offs. Clearly ASCII is out as no serious amount of non-English text can be trafficked without diacritics. So switching to UTF makes a lot of sense, and that's what D did.
> 
> When I introduced std.range and std.algorithm, they'd handle char[] and wchar[] no differently than any other array. A lot of algorithms simply did the wrong thing by default, so I attempted to fix that situation by defining byDchar(). So instead of passing some string str to an algorithm, one would pass byDchar(str).
> 
> A couple of weeks went by in testing that state of affairs, and before late I figured that I need to insert byDchar() virtually _everywhere_. There were a couple of algorithms (e.g. Boyer-Moore) that happened to work with arrays for subtle reasons (needless to say, they won't work with graphemes at all). But by and large the situation was that the simple and intuitive code was wrong and that the correct code necessitated inserting byDchar().
> 
> So my next decision, which understandably some of the people who didn't go through the experiment may find unintuitive, was to make byDchar() the default. This cleaned up a lot of crap in std itself and saved a lot of crap in the yet-unwritten client code.

But were your algorithms *correct* in the first place? I'd argue that by making byDchar the default you've not saved yourself from any crap because dchar isn't the right layer of abstraction.


> I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried.

It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis. Other people won't be so happy with this state of affairs, but they'll probably notice only after most of their code has been written unaware of the problem.


> Now, thanks to the effort people have spent in this group (thank you!), I have an understanding of the grapheme issue. I guarantee that grapheme-level iteration will have a high cost incurred to it: efficiency and changes in std. The languages that need composing characters for producing meaningful text are few and far between, so it makes sense to confine support for them to libraries that are not the default, unless we find ways to not disrupt everyone else.

We all are more aware of the problem now, that's a good thing. :-)


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 16, 2011
On 01/14/2011 08:20 PM, Nick Sabalausky wrote:
> "spir"<denis.spir@gmail.com>  wrote in message
> news:mailman.619.1295012086.4748.digitalmars-d@puremagic.com...
>>
>> If anyone finds a pointer to such an explanation, bravo, and than you.
>> (You will certainly not find it in Unicode literature, for instance.)
>> Nick's explanation below is good and concise. (Just 2 notes added.)
>
> Yea, most Unicode explanations seem to talk all about "code-units vs
> code-points" and then they'll just have a brief note like "There's also
> other things like digraphs and combining codes." And that'll be all they
> mention.
>
> You're right about the Unicode literature. It's the usual standards-body
> documentation, same as W3C: "Instead of only some people understanding how
> this works, lets encode the documentation in legalese (and have twenty
> only-slightly-different versions) to make sure that nobody understands how
> it works."

If anyone is interested, ICU's documentation is far more readable (and intended for programmers). ICU is *the* reference library for dealing with unicode (an IBM open source product, with C/C++/Java interfaces), used by many other products in the background.
ICU: http://site.icu-project.org/
user guide: http://userguide.icu-project.org/
section about text segmentation: http://userguide.icu-project.org/boundaryanalysis

Note that just like Unicode, they consider forming graphemes (grouping codes into character representations) a simple particular case of text segmentation, which they call "boundary analysis" (but they have the nice idea to use "character" instead of "grapheme").

The only mention I found in ICU's doc of the issue we have talked about here lengthily is (at http://userguide.icu-project.org/strings):
"Handling Lengths, Indexes, and Offsets in Strings

The length of a string and all indexes and offsets related to the string are always counted in terms of UChar code units, not in terms of UChar32 code points. (This is the same as in common C library functions that use char * strings with multi-byte encodings.)

Often, a user thinks of a "character" as a complete unit in a language, like an 'Ä', while it may be represented with multiple Unicode code points including a base character and combining marks. (See the Unicode standard for details.) This often requires users to index and pass strings (UnicodeString or UChar *) with multiple code units or code points. It cannot be done with single-integer character types. Indexing of such "characters" is done with the BreakIterator class (in C: ubrk_ functions).

Even with such "higher-level" indexing functions, the actual index values will be expressed in terms of UChar code units. When more than one code unit is used at a time, the index value changes by more than one at a time. [...]

(ICU's UChar are like D wchar.)

>> You can also say there are 2 kinds of characters: simple like "u"&
>> composite "ü" or "ü??". The former are coded with a single (base) code,
>> the latter with one (rarely more) base codes and an arbitrary number of
>> combining codes.
>
> Couple questions about the "more than one base codes":
>
> - Do you know an example offhand?

No. I know this only from it beeing mentionned in documentation. Unless we consider (see below) L jamo as base codes.

> - Does that mean like a ligature where the base codes form a single glyph,
> or does it mean that the combining code either spans or operates over
> multiple glyphs? Or can it go either way?

IIRC examples like ij in nederlands are only considered "compability equivalent" to the corresponding ligatures, just like eg "ss" for "ß" in german. Meaning they should not be considered equal by default, this would be an additional feature, and langage- and app-dependant). Unlike base "e"+ combining "^" really == "ê".

>> For a majority of _common_ characters made of 2 or 3 codes (western
>> language letters, korean Hangul syllables,...), precombined codes have
>> been added to the set. Thus, they can be coded with a single code like
>> simple characters.
>>
>
> Out of curiosity, how do decomposed Hangul characters work? (Or do you
> know?) Not actually knowing any Korean, my understanding is that they're a
> set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it
> is like a series of base codes that automatically combine, or are there
> combining characters involved?

I know nothing about Korean language except what I studied about its scripting system for Unicode algorithms (but one can also code said algorithm blindly). See http://en.wikipedia.org/wiki/Hangul and about Hangul in Unicode http://en.wikipedia.org/wiki/Korean_language_and_computers. What I understand (beware, it's just wild deductions) is there are 3 kinds of "jamo" scripting marks (noted L, V, T) that can combine into syllabic "graphemes", resp in first, median, last place. These marks indeed somehow correspond to vocalic or consonantic phonemes.
In unicode, in addition to such jamo, which are simple marks (like base letters and diacritics in latin-based languages), there are precombined codes for LV and LVT combinations (like for "ä" or "û"). We could thus think that Hangul syllables are limited to 3 jamo.
But: according to Unicode's official "grapheme break cluster" algorithm (read: how to group codepoints into characters) (http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries), codes for L jamo can also be followed by _and_ should be combined with other L, LV or LVT codes. Similarly, LV or V should be combined with V or VT, and LVT or T with T. (Seems logical.) So, I do not know how complicated a Hangul syllab can be in practice or in theory.
If there can be in practice whole syllables following other schemes than L / LV / LVT, then this is another example of real language whole characters that cannot be coded by a single codepoint.


Denis
_________________
vita es estrany
spir.wikidot.com

January 17, 2011
On 1/16/11 3:20 PM, Michel Fortin wrote:
> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> On 1/15/11 10:45 PM, Michel Fortin wrote:
>>> No doubt it's easier to implement it that way. The problem is that in
>>> most cases it won't be used. How many people really know what is a
>>> grapheme?
>>
>> How many people really should care?
>
> I think the only people who should *not* care are those who have
> validated that the input does not contain any combining code point. If
> you know the input *can't* contain combining code points, then it's safe
> to ignore them.

I agree. Now let me ask again: how many people really should care?

> If we don't make correct Unicode handling the default, someday someone
> is going to ask a developer to fix a problem where his system doesn't
> handle some text correctly. Later that day, he'll come to the
> realization that almost none of his D code and none of the D libraries
> he use handle unicode correctly, and he'll say: can't fix this. His peer
> working on a similar Objective-C program will have a good laugh.
>
> Sure, correct Unicode handling is slower and more complicated to
> implement, but at least you know you'll get the right results.

I love the increased precision, but again I'm not sure how many people ever manipulate text with combining characters. Meanwhile they'll complain that D is slower than other languages.

>>> Of those, how many will forget to use byGrapheme at one time
>>> or another? And so in most programs string manipulation will misbehave
>>> in the presence of combining characters or unnormalized strings.
>>
>> But most strings don't contain combining characters or unnormalized
>> strings.
>
> I think we should expect combining marks to be used more and more as our
> OS text system and fonts start supporting them better. Them being rare
> might be true today, but what do you know about tomorrow?

I don't think languages will acquire more diacritics soon. I do hope, of course, that D applications gain more usage in the Arabic, Hebrew etc. world.

> A few years ago, many Unicode symbols didn't even show up correctly on
> Windows. Today, we have Unicode domain names and people start putting
> funny symbols in them (for instance: <http://◉.ws>). I haven't seen it
> yet, but we'll surely see combining characters in domain names soon
> enough (if only as a way to make fun of programs that can't handle
> Unicode correctly). Well, let me be the first to make fun of such
> programs: <http://☺̭̏.michelf.com/>.

Would you bet the language on that?

> Also, not all combining characters are marks meant to be used by some
> foreign languages. Some are used for mathematics for instance. Or you
> could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay
> indicating some kind of prohibition.
>
>
>>> If you want to help D programmers write correct code when it comes to
>>> Unicode manipulation, you need to help them iterate on real characters
>>> (graphemes), and you need the algorithms to apply to real characters
>>> (graphemes), not the approximation of a Unicode character that is a code
>>> point.
>>
>> I don't think the situation is as clean cut, as grave, and as urgent
>> as you say.
>
> I agree it's probably not as clean cut as I say (I'm trying to keep
> complicated things simple here), but it's something important to decide
> early because the cost of changing it increase as more code is written.

Agreed.

> Quoting the first part of the same post (out of order):
>
>> Disagreement as that might be, a simple fact that needs to be taken
>> into account is that as of right now all of Phobos uses UTF arrays for
>> string representation and dchar as element type.
>>
>> Besides, for one I do dispute the idea that a grapheme element is
>> better than a dchar element for iterating over a string. The grapheme
>> has the attractiveness of being theoretically clean but at the same
>> time is woefully inefficient and helps languages that few D users need
>> to work with. At least that's my perception, and we need some serious
>> numbers instead of convincing rhetoric to make a big decision.
>
> You'll no doubt get more performance from a grapheme-aware specialized
> algorithm working directly on code points than by iterating on graphemes
> returned as string slices. But both will give *correct* results.
>
> Implementing a specialized algorithm of this kind becomes an
> optimization, and it's likely you'll want an optimized version for most
> string algorithms.
>
> I'd like to have some numbers too about performance, but I have none at
> this time.

I spent a fair amount of time comparing ASCII vs. Unicode code speed. The fact of the matter is that the overhead is measurable and often high. Also it occurs at a very core level. For starters, the grapheme itself is larger and has one extra indirection. I am confident the marginal overhead for graphemes would be considerable.

>> It's all a matter of picking one's trade-offs. Clearly ASCII is out as
>> no serious amount of non-English text can be trafficked without
>> diacritics. So switching to UTF makes a lot of sense, and that's what
>> D did.
>>
>> When I introduced std.range and std.algorithm, they'd handle char[]
>> and wchar[] no differently than any other array. A lot of algorithms
>> simply did the wrong thing by default, so I attempted to fix that
>> situation by defining byDchar(). So instead of passing some string str
>> to an algorithm, one would pass byDchar(str).
>>
>> A couple of weeks went by in testing that state of affairs, and before
>> late I figured that I need to insert byDchar() virtually _everywhere_.
>> There were a couple of algorithms (e.g. Boyer-Moore) that happened to
>> work with arrays for subtle reasons (needless to say, they won't work
>> with graphemes at all). But by and large the situation was that the
>> simple and intuitive code was wrong and that the correct code
>> necessitated inserting byDchar().
>>
>> So my next decision, which understandably some of the people who
>> didn't go through the experiment may find unintuitive, was to make
>> byDchar() the default. This cleaned up a lot of crap in std itself and
>> saved a lot of crap in the yet-unwritten client code.
>
> But were your algorithms *correct* in the first place? I'd argue that by
> making byDchar the default you've not saved yourself from any crap
> because dchar isn't the right layer of abstraction.

It was correct for all but a couple languages. Again: most of today's languages don't ever need combining characters.

>> I think it's reasonable to understand why I'm happy with the current
>> state of affairs. It is better than anything we've had before and
>> better than everything else I've tried.
>
> It is indeed easy to understand why you're happy with the current state
> of affairs: you never had to deal with multi-code-point character and
> can't imagine yourself having to deal with them on a semi-frequent
> basis.

Do you, and can you?

> Other people won't be so happy with this state of affairs, but
> they'll probably notice only after most of their code has been written
> unaware of the problem.

They can't be unaware and write said code.

>> Now, thanks to the effort people have spent in this group (thank
>> you!), I have an understanding of the grapheme issue. I guarantee that
>> grapheme-level iteration will have a high cost incurred to it:
>> efficiency and changes in std. The languages that need composing
>> characters for producing meaningful text are few and far between, so
>> it makes sense to confine support for them to libraries that are not
>> the default, unless we find ways to not disrupt everyone else.
>
> We all are more aware of the problem now, that's a good thing. :-)

All I wish is it's not blown out of proportion. It fares rather low on my list of library issues that D has right now.


Andrei
January 17, 2011
Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
> On 1/16/11 3:20 PM, Michel Fortin wrote:
>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> said:
>>> But most strings don't contain combining characters or unnormalized
>>> strings.
>>
>> I think we should expect combining marks to be used more and more as our
>> OS text system and fonts start supporting them better. Them being rare
>> might be true today, but what do you know about tomorrow?
>
> I don't think languages will acquire more diacritics soon. I do hope, of
> course, that D applications gain more usage in the Arabic, Hebrew etc.
> world.
>

So why does D use unicode anyway?
If you don't care about not-often used languages anyway, you could have used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide which encoding he wants/needs).

You could as well say "we don't need to use dchar to represent a proper code point, wchar is enough for most use cases and has fewer overhead anyway".


>>> I think it's reasonable to understand why I'm happy with the current
>>> state of affairs. It is better than anything we've had before and
>>> better than everything else I've tried.
>>
>> It is indeed easy to understand why you're happy with the current state
>> of affairs: you never had to deal with multi-code-point character and
>> can't imagine yourself having to deal with them on a semi-frequent
>> basis.
>
> Do you, and can you?
>
>> Other people won't be so happy with this state of affairs, but
>> they'll probably notice only after most of their code has been written
>> unaware of the problem.
>
> They can't be unaware and write said code.
>

Fun fact: Germany recently introduced a new ID card and some of the software that was developed for this and is used in some record sections fucks up when a name contains diacritics.

I think especially when you're handling names (and much software does, I think) it's crucial to have proper support for all kinds of chars.
Of course many programmers are not aware that, if Umlaute and ß works it doesn't mean that all other kinds of strange characters work as well.


Cheers,
- Daniel



January 17, 2011
On 1/16/11 6:42 PM, Daniel Gibson wrote:
> Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
>> On 1/16/11 3:20 PM, Michel Fortin wrote:
>>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail@erdani.org> said:
>>>> But most strings don't contain combining characters or unnormalized
>>>> strings.
>>>
>>> I think we should expect combining marks to be used more and more as our
>>> OS text system and fonts start supporting them better. Them being rare
>>> might be true today, but what do you know about tomorrow?
>>
>> I don't think languages will acquire more diacritics soon. I do hope, of
>> course, that D applications gain more usage in the Arabic, Hebrew etc.
>> world.
>>
>
> So why does D use unicode anyway?
> If you don't care about not-often used languages anyway, you could have
> used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
> which encoding he wants/needs).
>
> You could as well say "we don't need to use dchar to represent a proper
> code point, wchar is enough for most use cases and has fewer overhead
> anyway".

I consider UTF8 superior to all of the above.

>>>> I think it's reasonable to understand why I'm happy with the current
>>>> state of affairs. It is better than anything we've had before and
>>>> better than everything else I've tried.
>>>
>>> It is indeed easy to understand why you're happy with the current state
>>> of affairs: you never had to deal with multi-code-point character and
>>> can't imagine yourself having to deal with them on a semi-frequent
>>> basis.
>>
>> Do you, and can you?
>>
>>> Other people won't be so happy with this state of affairs, but
>>> they'll probably notice only after most of their code has been written
>>> unaware of the problem.
>>
>> They can't be unaware and write said code.
>>
>
> Fun fact: Germany recently introduced a new ID card and some of the
> software that was developed for this and is used in some record sections
> fucks up when a name contains diacritics.
>
> I think especially when you're handling names (and much software does, I
> think) it's crucial to have proper support for all kinds of chars.
> Of course many programmers are not aware that, if Umlaute and ß works it
> doesn't mean that all other kinds of strange characters work as well.
>
>
> Cheers,
> - Daniel

I think German text works well with dchar.


Andrei
January 17, 2011
On Sunday 16 January 2011 18:45:26 Andrei Alexandrescu wrote:
> On 1/16/11 6:42 PM, Daniel Gibson wrote:
> > Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
> >> On 1/16/11 3:20 PM, Michel Fortin wrote:
> >>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
> >>> 
> >>> <SeeWebsiteForEmail@erdani.org> said:
> >>>> But most strings don't contain combining characters or unnormalized strings.
> >>> 
> >>> I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?
> >> 
> >> I don't think languages will acquire more diacritics soon. I do hope, of course, that D applications gain more usage in the Arabic, Hebrew etc. world.
> > 
> > So why does D use unicode anyway?
> > If you don't care about not-often used languages anyway, you could have
> > used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
> > which encoding he wants/needs).
> > 
> > You could as well say "we don't need to use dchar to represent a proper code point, wchar is enough for most use cases and has fewer overhead anyway".
> 
> I consider UTF8 superior to all of the above.
> 
> >>>> I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried.
> >>> 
> >>> It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis.
> >> 
> >> Do you, and can you?
> >> 
> >>> Other people won't be so happy with this state of affairs, but they'll probably notice only after most of their code has been written unaware of the problem.
> >> 
> >> They can't be unaware and write said code.
> > 
> > Fun fact: Germany recently introduced a new ID card and some of the software that was developed for this and is used in some record sections fucks up when a name contains diacritics.
> > 
> > I think especially when you're handling names (and much software does, I
> > think) it's crucial to have proper support for all kinds of chars.
> > Of course many programmers are not aware that, if Umlaute and ß works it
> > doesn't mean that all other kinds of strange characters work as well.
> > 
> > 
> > Cheers,
> > - Daniel
> 
> I think German text works well with dchar.

I think that whether dchar will be enough will depend primarily on where the unicode is coming from and what the programmer is doing with it. There's plenty which will just work regardless of whether code poinst are pre-combined or not, and there's other stuff which will have subtle bugs if they're not pre-combined.

For the most part, Western languages should have pre-combined characters, but whether a program sees them in combined form or not will depend on where the text comes from. If it comes from a file, then it all depends on the program which wrote the file. If it comes from the console, then it depends on what that console does. If it comes from a socket or pipe or whatnot, then it depends on whatever program is sending the data.

So, the question becomes what the norm is? Are unicode characters normally pre- combined or left as separate code points? The majority of English text will be fine regardless, since English only uses accented characters and the like when including foreign words, but most any other European language will have accented characters and then it's an open question. If it's more likely that a D program will receive pre-combined characters than not, then many programs will likely be safe treating a code point as a character. But if the odds are high that a D program will receive characters which are not yet combined, then certain sets of text will invariably result in bugs in your average D program.

I don't think that there's much question that from a performance standpoint and from the standpoint of trying to avoid breaking TDPL and a lot of pre-existing code, we should continue to treat a code point - a dchar - as an abstract character. Moving to graphemes could really harm performance - and there _are_ plenty of programs that couldn't care less about unicode. However, it's quite clear that in a number of circumstances, that's going to result in buggy code. The question then is whether it's okay to take a performance hit just to correctly handle unicode. And I expect that a _lot_ of people are going to say no to that.

D already does better at handling unicode than many other languages, so it's definitely a step up as it is. The cost for handling unicode completely correctly is quite high from the sounds of it - all of a sudden you're effectively (if not literally) dealing with arrays of arrays instead of arrays. So, I think that it's a viable option to say that the default path that D will take is the _mostly_ correct but still reasonably efficient path, and then - through 3rd party libraries or possibly even with a module in Phobos - we'll provide a means to handle unicode 100% correctly for those who really care.

At minimum, we need the tools to handle unicode correctly, but if we can't handle it both correctly and efficiently, then I'm afraid that it's just not going to be reasonable to handle it correctly - especially if we can handle it _almost_ correctly and still be efficient.

Regardless, the real question is how likely a D program is to deal with unicode which is not pre-combined. If the odds are relatively low in the general case, then sticking to dchar should be fine. But if the adds or relatively high, then not going to graphemes could mean that there will be a _lot_ of buggy D programs out there.

- Jonathan M Davis