View mode: basic / threaded / horizontal-split · Log in · Help
January 16, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/15/11 9:25 PM, Jonathan M Davis wrote:
> Considering that strings are already dealt with specially in order to have an
> element of dchar, I wouldn't think that it would be all that distruptive to make
> it so that they had an element type of Grapheme instead. Wouldn't that then fix
> all of std.algorithm and the like without really disrupting anything?

It would make everything related a lot (a TON) slower, and it would 
break all client code that uses dchar as the element type, or is 
otherwise unprepared to use Graphemes explicitly. There is no question 
there will be disruption.

Andrei
January 16, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
And how would 3rd party libraries handle Graphemes? And C modules? I
think making these Graphemes the default would make quite a mess,
since you would have to convert back and forth between char[] and
Grapheme[] all the time (right?).
January 16, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/15/11 10:45 PM, Michel Fortin wrote:
> On 2011-01-15 18:59:27 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> I'm unclear on where this is converging to. At this point the
>> commitment of the language and its standard library to (a) UTF aray
>> representation and (b) code points conceptualization is quite strong.
>> Changing that would be quite difficult and disruptive, and the
>> benefits are virtually nonexistent for most of D's user base.
>
> There's still a disagreement about whether a string or a code unit array
> should be the default string representation, and whether iterating on a
> code unit array should give you code unit or grapheme elements. Of those
> who who participated in the discussion, I don't think anyone is
> disputing the idea that a grapheme element is better than a dchar
> element for iterating over a string.

Disagreement as that might be, a simple fact that needs to be taken into 
account is that as of right now all of Phobos uses UTF arrays for string 
representation and dchar as element type.

Besides, for one I do dispute the idea that a grapheme element is better 
than a dchar element for iterating over a string. The grapheme has the 
attractiveness of being theoretically clean but at the same time is 
woefully inefficient and helps languages that few D users need to work 
with. At least that's my perception, and we need some serious numbers 
instead of convincing rhetoric to make a big decision.

It's all a matter of picking one's trade-offs. Clearly ASCII is out as 
no serious amount of non-English text can be trafficked without 
diacritics. So switching to UTF makes a lot of sense, and that's what D did.

When I introduced std.range and std.algorithm, they'd handle char[] and 
wchar[] no differently than any other array. A lot of algorithms simply 
did the wrong thing by default, so I attempted to fix that situation by 
defining byDchar(). So instead of passing some string str to an 
algorithm, one would pass byDchar(str).

A couple of weeks went by in testing that state of affairs, and before 
late I figured that I need to insert byDchar() virtually _everywhere_. 
There were a couple of algorithms (e.g. Boyer-Moore) that happened to 
work with arrays for subtle reasons (needless to say, they won't work 
with graphemes at all). But by and large the situation was that the 
simple and intuitive code was wrong and that the correct code 
necessitated inserting byDchar().

So my next decision, which understandably some of the people who didn't 
go through the experiment may find unintuitive, was to make byDchar() 
the default. This cleaned up a lot of crap in std itself and saved a lot 
of crap in the yet-unwritten client code.

I think it's reasonable to understand why I'm happy with the current 
state of affairs. It is better than anything we've had before and better 
than everything else I've tried.

Now, thanks to the effort people have spent in this group (thank you!), 
I have an understanding of the grapheme issue. I guarantee that 
grapheme-level iteration will have a high cost incurred to it: 
efficiency and changes in std. The languages that need composing 
characters for producing meaningful text are few and far between, so it 
makes sense to confine support for them to libraries that are not the 
default, unless we find ways to not disrupt everyone else.

>> It may be more realistic to consider using what we have as back-end
>> for grapheme-oriented processing.
>> For example:
>>
>> struct Grapheme(Char) if (isSomeChar!Char)
>> {
>> private const Char[] rep;
>> ...
>> }
>>
>> auto byGrapheme(S)(S s) if (isSomeString!S)
>> {
>> ...
>> }
>>
>> string s = "Hello";
>> foreach (g; byGrapheme(s)
>> {
>> ...
>> }
>
> No doubt it's easier to implement it that way. The problem is that in
> most cases it won't be used. How many people really know what is a
> grapheme?

How many people really should care?

> Of those, how many will forget to use byGrapheme at one time
> or another? And so in most programs string manipulation will misbehave
> in the presence of combining characters or unnormalized strings.

But most strings don't contain combining characters or unnormalized strings.

> If you want to help D programmers write correct code when it comes to
> Unicode manipulation, you need to help them iterate on real characters
> (graphemes), and you need the algorithms to apply to real characters
> (graphemes), not the approximation of a Unicode character that is a code
> point.

I don't think the situation is as clean cut, as grave, and as urgent as 
you say.


Andrei
January 16, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/15/11 10:47 PM, Michel Fortin wrote:
> On 2011-01-15 22:25:47 -0500, Jonathan M Davis <jmdavisProg@gmx.com> said:
>
>> The issue of foreach remains, but without being willing to change what
>> foreach defaults to, you can't really fix it - though I'd suggest that
>> we at least make it a warning to iterate over strings without
>> specifying the type. And if foreach were made to understand Grapheme
>> like it understands dchar, then you could do
>>
>> foreach(Grapheme g; str) { ... }
>>
>> and have the compiler warn about
>>
>> foreach(g; str) { ... }
>>
>> and tell you to use Grapheme if you want to be comparing actual
>> characters.
>
> Walter's argument against changing this for foreach was that it'd
> *silently* break compatibility with existing D1 code. Changing the
> default to a grapheme makes this argument obsolete: since a grapheme is
> essentially a string, you can't compare it with char or wchar or dchar
> directly, so it'll break at compile time with an error and you'll have
> to decide what to do.
>
> So Walter would have to find another argument to defend the status quo.

I think it's poor abstraction to represent a Grapheme as a string. It 
should be its own type.

Andrei
January 16, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail@erdani.org> said:

> On 1/15/11 10:45 PM, Michel Fortin wrote:
>> No doubt it's easier to implement it that way. The problem is that in
>> most cases it won't be used. How many people really know what is a
>> grapheme?
> 
> How many people really should care?

I think the only people who should *not* care are those who have 
validated that the input does not contain any combining code point. If 
you know the input *can't* contain combining code points, then it's 
safe to ignore them.

If we don't make correct Unicode handling the default, someday someone 
is going to ask a developer to fix a problem where his system doesn't 
handle some text correctly. Later that day, he'll come to the 
realization that almost none of his D code and none of the D libraries 
he use handle unicode correctly, and he'll say: can't fix this. His 
peer working on a similar Objective-C program will have a good laugh.

Sure, correct Unicode handling is slower and more complicated to 
implement, but at least you know you'll get the right results.


>> Of those, how many will forget to use byGrapheme at one time
>> or another? And so in most programs string manipulation will misbehave
>> in the presence of combining characters or unnormalized strings.
> 
> But most strings don't contain combining characters or unnormalized strings.

I think we should expect combining marks to be used more and more as 
our OS text system and fonts start supporting them better. Them being 
rare might be true today, but what do you know about tomorrow?

A few years ago, many Unicode symbols didn't even show up correctly on 
Windows. Today, we have Unicode domain names and people start putting 
funny symbols in them (for instance: <http://◉.ws>). I haven't seen it 
yet, but we'll surely see combining characters in domain names soon 
enough (if only as a way to make fun of programs that can't handle 
Unicode correctly). Well, let me be the first to make fun of such 
programs: <http://☺̭̏.michelf.com/>.

Also, not all combining characters are marks meant to be used by some 
foreign languages. Some are used for mathematics for instance. Or you 
could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay 
indicating some kind of prohibition.


>> If you want to help D programmers write correct code when it comes to
>> Unicode manipulation, you need to help them iterate on real characters
>> (graphemes), and you need the algorithms to apply to real characters
>> (graphemes), not the approximation of a Unicode character that is a code
>> point.
> 
> I don't think the situation is as clean cut, as grave, and as urgent as 
> you say.

I agree it's probably not as clean cut as I say (I'm trying to keep 
complicated things simple here), but it's something important to decide 
early because the cost of changing it increase as more code is written.


Quoting the first part of the same post (out of order):

> Disagreement as that might be, a simple fact that needs to be taken 
> into account is that as of right now all of Phobos uses UTF arrays for 
> string representation and dchar as element type.
> 
> Besides, for one I do dispute the idea that a grapheme element is 
> better than a dchar element for iterating over a string. The grapheme 
> has the attractiveness of being theoretically clean but at the same 
> time is woefully inefficient and helps languages that few D users need 
> to work with. At least that's my perception, and we need some serious 
> numbers instead of convincing rhetoric to make a big decision.

You'll no doubt get more performance from a grapheme-aware specialized 
algorithm working directly on code points than by iterating on 
graphemes returned as string slices. But both will give *correct* 
results.

Implementing a specialized algorithm of this kind becomes an 
optimization, and it's likely you'll want an optimized version for most 
string algorithms.

I'd like to have some numbers too about performance, but I have none at 
this time.


> It's all a matter of picking one's trade-offs. Clearly ASCII is out as 
> no serious amount of non-English text can be trafficked without 
> diacritics. So switching to UTF makes a lot of sense, and that's what D 
> did.
> 
> When I introduced std.range and std.algorithm, they'd handle char[] and 
> wchar[] no differently than any other array. A lot of algorithms simply 
> did the wrong thing by default, so I attempted to fix that situation by 
> defining byDchar(). So instead of passing some string str to an 
> algorithm, one would pass byDchar(str).
> 
> A couple of weeks went by in testing that state of affairs, and before 
> late I figured that I need to insert byDchar() virtually _everywhere_. 
> There were a couple of algorithms (e.g. Boyer-Moore) that happened to 
> work with arrays for subtle reasons (needless to say, they won't work 
> with graphemes at all). But by and large the situation was that the 
> simple and intuitive code was wrong and that the correct code 
> necessitated inserting byDchar().
> 
> So my next decision, which understandably some of the people who didn't 
> go through the experiment may find unintuitive, was to make byDchar() 
> the default. This cleaned up a lot of crap in std itself and saved a 
> lot of crap in the yet-unwritten client code.

But were your algorithms *correct* in the first place? I'd argue that 
by making byDchar the default you've not saved yourself from any crap 
because dchar isn't the right layer of abstraction.


> I think it's reasonable to understand why I'm happy with the current 
> state of affairs. It is better than anything we've had before and 
> better than everything else I've tried.

It is indeed easy to understand why you're happy with the current state 
of affairs: you never had to deal with multi-code-point character and 
can't imagine yourself having to deal with them on a semi-frequent 
basis. Other people won't be so happy with this state of affairs, but 
they'll probably notice only after most of their code has been written 
unaware of the problem.


> Now, thanks to the effort people have spent in this group (thank you!), 
> I have an understanding of the grapheme issue. I guarantee that 
> grapheme-level iteration will have a high cost incurred to it: 
> efficiency and changes in std. The languages that need composing 
> characters for producing meaningful text are few and far between, so it 
> makes sense to confine support for them to libraries that are not the 
> default, unless we find ways to not disrupt everyone else.

We all are more aware of the problem now, that's a good thing. :-)


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 16, 2011
Re: VLERange: a range in between BidirectionalRange andRandomAccessRange
On 01/14/2011 08:20 PM, Nick Sabalausky wrote:
> "spir"<denis.spir@gmail.com>  wrote in message
> news:mailman.619.1295012086.4748.digitalmars-d@puremagic.com...
>>
>> If anyone finds a pointer to such an explanation, bravo, and than you.
>> (You will certainly not find it in Unicode literature, for instance.)
>> Nick's explanation below is good and concise. (Just 2 notes added.)
>
> Yea, most Unicode explanations seem to talk all about "code-units vs
> code-points" and then they'll just have a brief note like "There's also
> other things like digraphs and combining codes." And that'll be all they
> mention.
>
> You're right about the Unicode literature. It's the usual standards-body
> documentation, same as W3C: "Instead of only some people understanding how
> this works, lets encode the documentation in legalese (and have twenty
> only-slightly-different versions) to make sure that nobody understands how
> it works."

If anyone is interested, ICU's documentation is far more readable (and 
intended for programmers). ICU is *the* reference library for dealing 
with unicode (an IBM open source product, with C/C++/Java interfaces), 
used by many other products in the background.
ICU: http://site.icu-project.org/
user guide: http://userguide.icu-project.org/
section about text segmentation: 
http://userguide.icu-project.org/boundaryanalysis

Note that just like Unicode, they consider forming graphemes (grouping 
codes into character representations) a simple particular case of text 
segmentation, which they call "boundary analysis" (but they have the 
nice idea to use "character" instead of "grapheme").

The only mention I found in ICU's doc of the issue we have talked about 
here lengthily is (at http://userguide.icu-project.org/strings):
"Handling Lengths, Indexes, and Offsets in Strings

The length of a string and all indexes and offsets related to the string 
are always counted in terms of UChar code units, not in terms of UChar32 
code points. (This is the same as in common C library functions that use 
char * strings with multi-byte encodings.)

Often, a user thinks of a "character" as a complete unit in a language, 
like an 'Ä', while it may be represented with multiple Unicode code 
points including a base character and combining marks. (See the Unicode 
standard for details.) This often requires users to index and pass 
strings (UnicodeString or UChar *) with multiple code units or code 
points. It cannot be done with single-integer character types. Indexing 
of such "characters" is done with the BreakIterator class (in C: ubrk_ 
functions).

Even with such "higher-level" indexing functions, the actual index 
values will be expressed in terms of UChar code units. When more than 
one code unit is used at a time, the index value changes by more than 
one at a time. [...]

(ICU's UChar are like D wchar.)

>> You can also say there are 2 kinds of characters: simple like "u"&
>> composite "ü" or "ü??". The former are coded with a single (base) code,
>> the latter with one (rarely more) base codes and an arbitrary number of
>> combining codes.
>
> Couple questions about the "more than one base codes":
>
> - Do you know an example offhand?

No. I know this only from it beeing mentionned in documentation. Unless 
we consider (see below) L jamo as base codes.

> - Does that mean like a ligature where the base codes form a single glyph,
> or does it mean that the combining code either spans or operates over
> multiple glyphs? Or can it go either way?

IIRC examples like ij in nederlands are only considered "compability 
equivalent" to the corresponding ligatures, just like eg "ss" for "ß" in 
german. Meaning they should not be considered equal by default, this 
would be an additional feature, and langage- and app-dependant). Unlike 
base "e"+ combining "^" really == "ê".

>> For a majority of _common_ characters made of 2 or 3 codes (western
>> language letters, korean Hangul syllables,...), precombined codes have
>> been added to the set. Thus, they can be coded with a single code like
>> simple characters.
>>
>
> Out of curiosity, how do decomposed Hangul characters work? (Or do you
> know?) Not actually knowing any Korean, my understanding is that they're a
> set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it
> is like a series of base codes that automatically combine, or are there
> combining characters involved?

I know nothing about Korean language except what I studied about its 
scripting system for Unicode algorithms (but one can also code said 
algorithm blindly). See http://en.wikipedia.org/wiki/Hangul and about 
Hangul in Unicode 
http://en.wikipedia.org/wiki/Korean_language_and_computers. What I 
understand (beware, it's just wild deductions) is there are 3 kinds of 
"jamo" scripting marks (noted L, V, T) that can combine into syllabic 
"graphemes", resp in first, median, last place. These marks indeed 
somehow correspond to vocalic or consonantic phonemes.
In unicode, in addition to such jamo, which are simple marks (like base 
letters and diacritics in latin-based languages), there are precombined 
codes for LV and LVT combinations (like for "ä" or "û"). We could thus 
think that Hangul syllables are limited to 3 jamo.
But: according to Unicode's official "grapheme break cluster" algorithm 
(read: how to group codepoints into characters) 
(http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries), codes 
for L jamo can also be followed by _and_ should be combined with other 
L, LV or LVT codes. Similarly, LV or V should be combined with V or VT, 
and LVT or T with T. (Seems logical.) So, I do not know how complicated 
a Hangul syllab can be in practice or in theory.
If there can be in practice whole syllables following other schemes than 
L / LV / LVT, then this is another example of real language whole 
characters that cannot be coded by a single codepoint.


Denis
_________________
vita es estrany
spir.wikidot.com
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/16/11 3:20 PM, Michel Fortin wrote:
> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> On 1/15/11 10:45 PM, Michel Fortin wrote:
>>> No doubt it's easier to implement it that way. The problem is that in
>>> most cases it won't be used. How many people really know what is a
>>> grapheme?
>>
>> How many people really should care?
>
> I think the only people who should *not* care are those who have
> validated that the input does not contain any combining code point. If
> you know the input *can't* contain combining code points, then it's safe
> to ignore them.

I agree. Now let me ask again: how many people really should care?

> If we don't make correct Unicode handling the default, someday someone
> is going to ask a developer to fix a problem where his system doesn't
> handle some text correctly. Later that day, he'll come to the
> realization that almost none of his D code and none of the D libraries
> he use handle unicode correctly, and he'll say: can't fix this. His peer
> working on a similar Objective-C program will have a good laugh.
>
> Sure, correct Unicode handling is slower and more complicated to
> implement, but at least you know you'll get the right results.

I love the increased precision, but again I'm not sure how many people 
ever manipulate text with combining characters. Meanwhile they'll 
complain that D is slower than other languages.

>>> Of those, how many will forget to use byGrapheme at one time
>>> or another? And so in most programs string manipulation will misbehave
>>> in the presence of combining characters or unnormalized strings.
>>
>> But most strings don't contain combining characters or unnormalized
>> strings.
>
> I think we should expect combining marks to be used more and more as our
> OS text system and fonts start supporting them better. Them being rare
> might be true today, but what do you know about tomorrow?

I don't think languages will acquire more diacritics soon. I do hope, of 
course, that D applications gain more usage in the Arabic, Hebrew etc. 
world.

> A few years ago, many Unicode symbols didn't even show up correctly on
> Windows. Today, we have Unicode domain names and people start putting
> funny symbols in them (for instance: <http://◉.ws>). I haven't seen it
> yet, but we'll surely see combining characters in domain names soon
> enough (if only as a way to make fun of programs that can't handle
> Unicode correctly). Well, let me be the first to make fun of such
> programs: <http://☺̭̏.michelf.com/>.

Would you bet the language on that?

> Also, not all combining characters are marks meant to be used by some
> foreign languages. Some are used for mathematics for instance. Or you
> could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay
> indicating some kind of prohibition.
>
>
>>> If you want to help D programmers write correct code when it comes to
>>> Unicode manipulation, you need to help them iterate on real characters
>>> (graphemes), and you need the algorithms to apply to real characters
>>> (graphemes), not the approximation of a Unicode character that is a code
>>> point.
>>
>> I don't think the situation is as clean cut, as grave, and as urgent
>> as you say.
>
> I agree it's probably not as clean cut as I say (I'm trying to keep
> complicated things simple here), but it's something important to decide
> early because the cost of changing it increase as more code is written.

Agreed.

> Quoting the first part of the same post (out of order):
>
>> Disagreement as that might be, a simple fact that needs to be taken
>> into account is that as of right now all of Phobos uses UTF arrays for
>> string representation and dchar as element type.
>>
>> Besides, for one I do dispute the idea that a grapheme element is
>> better than a dchar element for iterating over a string. The grapheme
>> has the attractiveness of being theoretically clean but at the same
>> time is woefully inefficient and helps languages that few D users need
>> to work with. At least that's my perception, and we need some serious
>> numbers instead of convincing rhetoric to make a big decision.
>
> You'll no doubt get more performance from a grapheme-aware specialized
> algorithm working directly on code points than by iterating on graphemes
> returned as string slices. But both will give *correct* results.
>
> Implementing a specialized algorithm of this kind becomes an
> optimization, and it's likely you'll want an optimized version for most
> string algorithms.
>
> I'd like to have some numbers too about performance, but I have none at
> this time.

I spent a fair amount of time comparing ASCII vs. Unicode code speed. 
The fact of the matter is that the overhead is measurable and often 
high. Also it occurs at a very core level. For starters, the grapheme 
itself is larger and has one extra indirection. I am confident the 
marginal overhead for graphemes would be considerable.

>> It's all a matter of picking one's trade-offs. Clearly ASCII is out as
>> no serious amount of non-English text can be trafficked without
>> diacritics. So switching to UTF makes a lot of sense, and that's what
>> D did.
>>
>> When I introduced std.range and std.algorithm, they'd handle char[]
>> and wchar[] no differently than any other array. A lot of algorithms
>> simply did the wrong thing by default, so I attempted to fix that
>> situation by defining byDchar(). So instead of passing some string str
>> to an algorithm, one would pass byDchar(str).
>>
>> A couple of weeks went by in testing that state of affairs, and before
>> late I figured that I need to insert byDchar() virtually _everywhere_.
>> There were a couple of algorithms (e.g. Boyer-Moore) that happened to
>> work with arrays for subtle reasons (needless to say, they won't work
>> with graphemes at all). But by and large the situation was that the
>> simple and intuitive code was wrong and that the correct code
>> necessitated inserting byDchar().
>>
>> So my next decision, which understandably some of the people who
>> didn't go through the experiment may find unintuitive, was to make
>> byDchar() the default. This cleaned up a lot of crap in std itself and
>> saved a lot of crap in the yet-unwritten client code.
>
> But were your algorithms *correct* in the first place? I'd argue that by
> making byDchar the default you've not saved yourself from any crap
> because dchar isn't the right layer of abstraction.

It was correct for all but a couple languages. Again: most of today's 
languages don't ever need combining characters.

>> I think it's reasonable to understand why I'm happy with the current
>> state of affairs. It is better than anything we've had before and
>> better than everything else I've tried.
>
> It is indeed easy to understand why you're happy with the current state
> of affairs: you never had to deal with multi-code-point character and
> can't imagine yourself having to deal with them on a semi-frequent
> basis.

Do you, and can you?

> Other people won't be so happy with this state of affairs, but
> they'll probably notice only after most of their code has been written
> unaware of the problem.

They can't be unaware and write said code.

>> Now, thanks to the effort people have spent in this group (thank
>> you!), I have an understanding of the grapheme issue. I guarantee that
>> grapheme-level iteration will have a high cost incurred to it:
>> efficiency and changes in std. The languages that need composing
>> characters for producing meaningful text are few and far between, so
>> it makes sense to confine support for them to libraries that are not
>> the default, unless we find ways to not disrupt everyone else.
>
> We all are more aware of the problem now, that's a good thing. :-)

All I wish is it's not blown out of proportion. It fares rather low on 
my list of library issues that D has right now.


Andrei
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
> On 1/16/11 3:20 PM, Michel Fortin wrote:
>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> said:
>>> But most strings don't contain combining characters or unnormalized
>>> strings.
>>
>> I think we should expect combining marks to be used more and more as our
>> OS text system and fonts start supporting them better. Them being rare
>> might be true today, but what do you know about tomorrow?
>
> I don't think languages will acquire more diacritics soon. I do hope, of
> course, that D applications gain more usage in the Arabic, Hebrew etc.
> world.
>

So why does D use unicode anyway?
If you don't care about not-often used languages anyway, you could have 
used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide 
which encoding he wants/needs).

You could as well say "we don't need to use dchar to represent a proper 
code point, wchar is enough for most use cases and has fewer overhead 
anyway".


>>> I think it's reasonable to understand why I'm happy with the current
>>> state of affairs. It is better than anything we've had before and
>>> better than everything else I've tried.
>>
>> It is indeed easy to understand why you're happy with the current state
>> of affairs: you never had to deal with multi-code-point character and
>> can't imagine yourself having to deal with them on a semi-frequent
>> basis.
>
> Do you, and can you?
>
>> Other people won't be so happy with this state of affairs, but
>> they'll probably notice only after most of their code has been written
>> unaware of the problem.
>
> They can't be unaware and write said code.
>

Fun fact: Germany recently introduced a new ID card and some of the 
software that was developed for this and is used in some record sections 
fucks up when a name contains diacritics.

I think especially when you're handling names (and much software does, I 
think) it's crucial to have proper support for all kinds of chars.
Of course many programmers are not aware that, if Umlaute and ß works it 
doesn't mean that all other kinds of strange characters work as well.


Cheers,
- Daniel
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/16/11 6:42 PM, Daniel Gibson wrote:
> Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
>> On 1/16/11 3:20 PM, Michel Fortin wrote:
>>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail@erdani.org> said:
>>>> But most strings don't contain combining characters or unnormalized
>>>> strings.
>>>
>>> I think we should expect combining marks to be used more and more as our
>>> OS text system and fonts start supporting them better. Them being rare
>>> might be true today, but what do you know about tomorrow?
>>
>> I don't think languages will acquire more diacritics soon. I do hope, of
>> course, that D applications gain more usage in the Arabic, Hebrew etc.
>> world.
>>
>
> So why does D use unicode anyway?
> If you don't care about not-often used languages anyway, you could have
> used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
> which encoding he wants/needs).
>
> You could as well say "we don't need to use dchar to represent a proper
> code point, wchar is enough for most use cases and has fewer overhead
> anyway".

I consider UTF8 superior to all of the above.

>>>> I think it's reasonable to understand why I'm happy with the current
>>>> state of affairs. It is better than anything we've had before and
>>>> better than everything else I've tried.
>>>
>>> It is indeed easy to understand why you're happy with the current state
>>> of affairs: you never had to deal with multi-code-point character and
>>> can't imagine yourself having to deal with them on a semi-frequent
>>> basis.
>>
>> Do you, and can you?
>>
>>> Other people won't be so happy with this state of affairs, but
>>> they'll probably notice only after most of their code has been written
>>> unaware of the problem.
>>
>> They can't be unaware and write said code.
>>
>
> Fun fact: Germany recently introduced a new ID card and some of the
> software that was developed for this and is used in some record sections
> fucks up when a name contains diacritics.
>
> I think especially when you're handling names (and much software does, I
> think) it's crucial to have proper support for all kinds of chars.
> Of course many programmers are not aware that, if Umlaute and ß works it
> doesn't mean that all other kinds of strange characters work as well.
>
>
> Cheers,
> - Daniel

I think German text works well with dchar.


Andrei
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Sunday 16 January 2011 18:45:26 Andrei Alexandrescu wrote:
> On 1/16/11 6:42 PM, Daniel Gibson wrote:
> > Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
> >> On 1/16/11 3:20 PM, Michel Fortin wrote:
> >>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
> >>> 
> >>> <SeeWebsiteForEmail@erdani.org> said:
> >>>> But most strings don't contain combining characters or unnormalized
> >>>> strings.
> >>> 
> >>> I think we should expect combining marks to be used more and more as
> >>> our OS text system and fonts start supporting them better. Them being
> >>> rare might be true today, but what do you know about tomorrow?
> >> 
> >> I don't think languages will acquire more diacritics soon. I do hope, of
> >> course, that D applications gain more usage in the Arabic, Hebrew etc.
> >> world.
> > 
> > So why does D use unicode anyway?
> > If you don't care about not-often used languages anyway, you could have
> > used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
> > which encoding he wants/needs).
> > 
> > You could as well say "we don't need to use dchar to represent a proper
> > code point, wchar is enough for most use cases and has fewer overhead
> > anyway".
> 
> I consider UTF8 superior to all of the above.
> 
> >>>> I think it's reasonable to understand why I'm happy with the current
> >>>> state of affairs. It is better than anything we've had before and
> >>>> better than everything else I've tried.
> >>> 
> >>> It is indeed easy to understand why you're happy with the current state
> >>> of affairs: you never had to deal with multi-code-point character and
> >>> can't imagine yourself having to deal with them on a semi-frequent
> >>> basis.
> >> 
> >> Do you, and can you?
> >> 
> >>> Other people won't be so happy with this state of affairs, but
> >>> they'll probably notice only after most of their code has been written
> >>> unaware of the problem.
> >> 
> >> They can't be unaware and write said code.
> > 
> > Fun fact: Germany recently introduced a new ID card and some of the
> > software that was developed for this and is used in some record sections
> > fucks up when a name contains diacritics.
> > 
> > I think especially when you're handling names (and much software does, I
> > think) it's crucial to have proper support for all kinds of chars.
> > Of course many programmers are not aware that, if Umlaute and ß works it
> > doesn't mean that all other kinds of strange characters work as well.
> > 
> > 
> > Cheers,
> > - Daniel
> 
> I think German text works well with dchar.

I think that whether dchar will be enough will depend primarily on where the 
unicode is coming from and what the programmer is doing with it. There's plenty 
which will just work regardless of whether code poinst are pre-combined or not, 
and there's other stuff which will have subtle bugs if they're not pre-combined.

For the most part, Western languages should have pre-combined characters, but 
whether a program sees them in combined form or not will depend on where the 
text comes from. If it comes from a file, then it all depends on the program 
which wrote the file. If it comes from the console, then it depends on what that 
console does. If it comes from a socket or pipe or whatnot, then it depends on 
whatever program is sending the data.

So, the question becomes what the norm is? Are unicode characters normally pre-
combined or left as separate code points? The majority of English text will be 
fine regardless, since English only uses accented characters and the like when 
including foreign words, but most any other European language will have accented 
characters and then it's an open question. If it's more likely that a D program 
will receive pre-combined characters than not, then many programs will likely be 
safe treating a code point as a character. But if the odds are high that a D 
program will receive characters which are not yet combined, then certain sets of 
text will invariably result in bugs in your average D program.

I don't think that there's much question that from a performance standpoint and 
from the standpoint of trying to avoid breaking TDPL and a lot of pre-existing 
code, we should continue to treat a code point - a dchar - as an abstract 
character. Moving to graphemes could really harm performance - and there _are_ 
plenty of programs that couldn't care less about unicode. However, it's quite 
clear that in a number of circumstances, that's going to result in buggy code. 
The question then is whether it's okay to take a performance hit just to 
correctly handle unicode. And I expect that a _lot_ of people are going to say 
no to that.

D already does better at handling unicode than many other languages, so it's 
definitely a step up as it is. The cost for handling unicode completely correctly 
is quite high from the sounds of it - all of a sudden you're effectively (if not 
literally) dealing with arrays of arrays instead of arrays. So, I think that 
it's a viable option to say that the default path that D will take is the 
_mostly_ correct but still reasonably efficient path, and then - through 3rd party 
libraries or possibly even with a module in Phobos - we'll provide a means to 
handle unicode 100% correctly for those who really care.

At minimum, we need the tools to handle unicode correctly, but if we can't 
handle it both correctly and efficiently, then I'm afraid that it's just not going 
to be reasonable to handle it correctly - especially if we can handle it 
_almost_ correctly and still be efficient.

Regardless, the real question is how likely a D program is to deal with unicode 
which is not pre-combined. If the odds are relatively low in the general case, 
then sticking to dchar should be fine. But if the adds or relatively high, then 
not going to graphemes could mean that there will be a _lot_ of buggy D programs 
out there.

- Jonathan M Davis
9 10 11 12 13 14 15 16 17
Top | Discussion index | About this forum | D home