January 15, 2011
On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer" <schveiguy@yahoo.com> said:

> On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  <lutger.blijdestijn@gmail.com> wrote:
> 
>> Steven Schveighoffer wrote:
>> 
>> ...
>>>> I think a good standard to evaluate our handling of Unicode is to see
>>>> how easy it is to do things the right way. In the above, foreach would
>>>> slice the string grapheme by grapheme, and the == operator would  perform
>>>> a normalized comparison. While it works correctly, it's probably not  the
>>>> most efficient way to do thing however.
>>> 
>>> I think this is a good alternative, but I'd rather not impose this on
>>> people like myself who deal mostly with English.  I think this should be
>>> possible to do with wrapper types or intermediate ranges which have
>>> graphemes as elements (per my suggestion above).
>>> 
>>> Does this sound reasonable?
>>> 
>>> -Steve
>> 
>> If its a matter of choosing which is the 'default' range, I'd think  proper
>> unicode handling is more reasonable than catering for english / ascii  only.
>> Especially since this is already the case in phobos string algorithms.
> 
> English and (if I understand correctly) most other languages.  Any  language which can be built from composable graphemes would work.  And in  fact, ones that use some graphemes that cannot be composed will also work  to some degree (for example, opEquals).
> 
> What I'm proposing (or think I'm proposing) is not exactly catering to  English and ASCII, what I'm proposing is simply not catering to more  complex languages such as Hebrew and Arabic.  What I'm trying to find is a  middle ground where most languages work, and the code is simple and  efficient, with possibilities to jump down to lower levels for performance  (i.e. switch to char[] when you know ASCII is all you are using) or jump  up to full unicode when necessary.

Why don't we build a compiler with an optimizer that generates correct code *almost* all of the time? If you are worried about it not producing correct code for a given function, you can just add "pragma(correct_code)" in front of that function to disable the risky optimizations. No harm done, right?

One thing I see very often, often on US web sites but also elsewhere, is that if you enter a name with an accented letter in a form (say Émilie), very often the accented letter gets changed to another semi-random character later in the process. Why? Because somewhere in the process lies an encoding mismatch that no one thought about and no one tested for. At the very least, the form should have rejected those unexpected characters and show an error when it could.

Now, with proper Unicode handling up to the code point level, this kind of problem probably won't happen as often because the whole stack works with UTF encodings. But are you going to validate all of your inputs to make sure they have no combining code point?

Don't assume that because you're in the United States no one will try to enter characters where you don't expect them. People love to play with Unicode symbols for fun, putting them in their name, signature, or even domain names (✪df.ws). Just wait until they discover they can combine them. ☺̰̎! There is also a variety of combining mathematical symbols with no pre-combined form, such as ≸. Writing in Arabic, Hebrew, Korean, or some other foreign language isn't a prerequisite to use combining characters.


> Essentially, we would have three levels of types:
> 
> char[], wchar[], dchar[] -- Considered to be arrays in every way.
> string_t!T (string, wstring, dstring) -- Specialized string types that do  normalization to dchars, but do not handle perfectly all graphemes.  Works  with any algorithm that deals with bidirectional ranges.  This is the  default string type, and the type for string literals.  Represented  internally by a single char[], wchar[] or dchar[] array.
> * utfstring_t!T -- specialized string to deal with full unicode, which may  perform worse than string_t, but supports everything unicode supports.   May require a battery of specialized algorithms.
> 
> * - name up for discussion
> 
> Also note that phobos currently does *no* normalization as far as I can  tell for things like opEquals.  Two char[]'s that represent equivalent  strings, but not in the same way, will compare as !=.

Basically, you're suggesting that the default way should be to handle Unicode *almost* right. And then, if you want to handle thing *really* right you need to be explicit about it by using "utfstring_t"? I understand your motivation, but it sounds backward to me.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 15, 2011
On Sat, 15 Jan 2011 15:31:23 -0500, Michel Fortin <michel.fortin@michelf.com> wrote:

> On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer" <schveiguy@yahoo.com> said:
>
>> On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  <lutger.blijdestijn@gmail.com> wrote:
>>
>>> Steven Schveighoffer wrote:
>>>  ...
>>>>> I think a good standard to evaluate our handling of Unicode is to see
>>>>> how easy it is to do things the right way. In the above, foreach would
>>>>> slice the string grapheme by grapheme, and the == operator would  perform
>>>>> a normalized comparison. While it works correctly, it's probably not  the
>>>>> most efficient way to do thing however.
>>>>  I think this is a good alternative, but I'd rather not impose this on
>>>> people like myself who deal mostly with English.  I think this should be
>>>> possible to do with wrapper types or intermediate ranges which have
>>>> graphemes as elements (per my suggestion above).
>>>>  Does this sound reasonable?
>>>>  -Steve
>>>  If its a matter of choosing which is the 'default' range, I'd think  proper
>>> unicode handling is more reasonable than catering for english / ascii  only.
>>> Especially since this is already the case in phobos string algorithms.
>>  English and (if I understand correctly) most other languages.  Any  language which can be built from composable graphemes would work.  And in  fact, ones that use some graphemes that cannot be composed will also work  to some degree (for example, opEquals).
>>  What I'm proposing (or think I'm proposing) is not exactly catering to  English and ASCII, what I'm proposing is simply not catering to more  complex languages such as Hebrew and Arabic.  What I'm trying to find is a  middle ground where most languages work, and the code is simple and  efficient, with possibilities to jump down to lower levels for performance  (i.e. switch to char[] when you know ASCII is all you are using) or jump  up to full unicode when necessary.
>
> Why don't we build a compiler with an optimizer that generates correct code *almost* all of the time? If you are worried about it not producing correct code for a given function, you can just add "pragma(correct_code)" in front of that function to disable the risky optimizations. No harm done, right?
>
> One thing I see very often, often on US web sites but also elsewhere, is that if you enter a name with an accented letter in a form (say Émilie), very often the accented letter gets changed to another semi-random character later in the process. Why? Because somewhere in the process lies an encoding mismatch that no one thought about and no one tested for. At the very least, the form should have rejected those unexpected characters and show an error when it could.
>
> Now, with proper Unicode handling up to the code point level, this kind of problem probably won't happen as often because the whole stack works with UTF encodings. But are you going to validate all of your inputs to make sure they have no combining code point?
>
> Don't assume that because you're in the United States no one will try to enter characters where you don't expect them. People love to play with Unicode symbols for fun, putting them in their name, signature, or even domain names (✪df.ws). Just wait until they discover they can combine them. ☺̰̎! There is also a variety of combining mathematical symbols with no pre-combined form, such as ≸. Writing in Arabic, Hebrew, Korean, or some other foreign language isn't a prerequisite to use combining characters.
>
>
>> Essentially, we would have three levels of types:
>>  char[], wchar[], dchar[] -- Considered to be arrays in every way.
>> string_t!T (string, wstring, dstring) -- Specialized string types that do  normalization to dchars, but do not handle perfectly all graphemes.  Works  with any algorithm that deals with bidirectional ranges.  This is the  default string type, and the type for string literals.  Represented  internally by a single char[], wchar[] or dchar[] array.
>> * utfstring_t!T -- specialized string to deal with full unicode, which may  perform worse than string_t, but supports everything unicode supports.   May require a battery of specialized algorithms.
>>  * - name up for discussion
>>  Also note that phobos currently does *no* normalization as far as I can  tell for things like opEquals.  Two char[]'s that represent equivalent  strings, but not in the same way, will compare as !=.
>
> Basically, you're suggesting that the default way should be to handle Unicode *almost* right. And then, if you want to handle thing *really* right you need to be explicit about it by using "utfstring_t"? I understand your motivation, but it sounds backward to me.

You make very good points.  I concede that using dchar as the element point is not correct for unicode strings.

-Steve
January 15, 2011
Steven Schveighoffer Wrote:

> On Sat, 15 Jan 2011 13:21:12 -0500, foobar <foo@bar.com> wrote:
> 
> > Steven Schveighoffer Wrote:
> 
> >>
> >> English and (if I understand correctly) most other languages.  Any
> >> language which can be built from composable graphemes would work.  And
> >> in
> >> fact, ones that use some graphemes that cannot be composed will also
> >> work
> >> to some degree (for example, opEquals).
> >>
> >> What I'm proposing (or think I'm proposing) is not exactly catering to
> >> English and ASCII, what I'm proposing is simply not catering to more
> >> complex languages such as Hebrew and Arabic.  What I'm trying to find
> >> is a
> >> middle ground where most languages work, and the code is simple and
> >> efficient, with possibilities to jump down to lower levels for
> >> performance
> >> (i.e. switch to char[] when you know ASCII is all you are using) or jump
> >> up to full unicode when necessary.
> >>
> >> Essentially, we would have three levels of types:
> >>
> >> char[], wchar[], dchar[] -- Considered to be arrays in every way.
> >> string_t!T (string, wstring, dstring) -- Specialized string types that
> >> do
> >> normalization to dchars, but do not handle perfectly all graphemes.
> >> Works
> >> with any algorithm that deals with bidirectional ranges.  This is the
> >> default string type, and the type for string literals.  Represented
> >> internally by a single char[], wchar[] or dchar[] array.
> >> * utfstring_t!T -- specialized string to deal with full unicode, which
> >> may
> >> perform worse than string_t, but supports everything unicode supports.
> >> May require a battery of specialized algorithms.
> >>
> >> * - name up for discussion
> >>
> >> Also note that phobos currently does *no* normalization as far as I can tell for things like opEquals.  Two char[]'s that represent equivalent strings, but not in the same way, will compare as !=.
> >>
> >> -Steve
> >
> > The above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs.
> 
> I feel like you might be exaggerating, but maybe I'm completely wrong on this, I'm not well-versed in unicode, or even languages that require unicode.  The clear benefit I see is that with a string type which normalizes to canonical code points, you can use this in any algorithm without having it be unicode-aware for *most languages*.  At least, that is how I see it.  I'm looking at it as a code-reuse proposition.
> 
> It's like calendars.  There are quite a few different calendars in different cultures.  But most people use a Gregorian calendar.  So we have three options:
> 
> a) Use a Gregorian calendar, and leave the other calendars to a 3rd party
> library
> b) Use a complicated calendar system where Gregorian calendars are treated
> with equal respect to all other calendars, none are the default.
> c) Use a Gregorian calendar by default, but include the other calendars as
> a separate module for those who wish to use them.
> 
> I'm looking at my proposal as more of a c) solution.
> 

The calendar example is a very good one. What you're saying equivalent to saying is that most people use Gregorian but for efficiency and other reasons you want to not implement feb 29th.

> Can you show how normalization causes subtle bugs?
> 

That was already shown by Michel and Spir where the equality operator is incorrect due to diacritics (the example with exposé). Your solution makes this far worse since it will reduce the bug to far less cases making the problem far less obvious. One would test with exposé which will work and another test (let's say in Hebrew) and that will *not* work and unless the programmer is a Unicode expert (Which is very unlikely) the programmer is left scratching his head.

> > More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language.
> 
> I hope that the medium type works 'good enough' for those languages, with the high level type needed for advanced usages.  At a minimum, comparison and substring should work for all languages.
> 

As I explained above, 'good enough' in this case is far worse because it masks the problem.  Also, If you want comparison to work in all languages including Hebrew/Arabic than it simply isn't good enough.

> > I firmly believe that in accordance with D's principle that the default behavior should be the correct & safe option, D should have the full unicode type (utfstring_t above) as the default.
> >
> > You need only a subset of the functionality because you only use English? For the same reason, you don't want the Unicode overhead? Use an ASCII type instead. In the same vain, a geneticist should use a DNA sequence type and not Unicode text.
> 
> Or French, or Spanish, or German, etc...
> 
> Look, even the lowest level is valid unicode, but if you want to start extracting individual graphemes, you need more machinery.  In 99% of cases, I'd think you want to use strings as strings, not as sequences of graphemes, or code-units.
> 
> -Steve

I'd like to have full Unicode support. I think it is a good thing for D to have in order to expand in the world. As an alternative, I'd settle for loud errors that make absolutely clear to the non-Unicode expert programmer that D simply does NOT support e.g. Normalization.

As Spir already said, Unicode is something few understand and even it's own official docs do not explain such issues properly. We should not confuse users even further with incomplete support.
January 15, 2011
On Sat, 15 Jan 2011 15:46:11 -0500, foobar <foo@bar.com> wrote:


> I'd like to have full Unicode support. I think it is a good thing for D to have in order to expand in the world. As an alternative, I'd settle for loud errors that make absolutely clear to the non-Unicode expert programmer that D simply does NOT support e.g. Normalization.
>
> As Spir already said, Unicode is something few understand and even it's own official docs do not explain such issues properly. We should not confuse users even further with incomplete support.

Well said, I've changed my mind.  Thanks for explaining.

-Steve
January 15, 2011
On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" <schveiguy@yahoo.com> said:

>> I'm not suggesting we impose it, just that we make it the default. If  you want to iterate by dchar, wchar, or char, just write:
>> 
>> 	foreach (dchar c; "exposé") {}
>> 	foreach (wchar c; "exposé") {}
>> 	foreach (char c; "exposé") {}
>> 	// or
>> 	foreach (dchar c; "exposé".by!dchar()) {}
>> 	foreach (wchar c; "exposé".by!wchar()) {}
>> 	foreach (char c; "exposé".by!char()) {}
>> 
>> and it'll work. But the default would be a slice containing the  grapheme, because this is the right way to represent a Unicode character.
> 
> I think this is a good idea.  I previously was nervous about it, but I'm  not sure it makes a huge difference.  Returning a char[] is certainly less  work than normalizing a grapheme into one or more code points, and then  returning them.  All that it takes is to detect all the code points within  the grapheme.  Normalization can be done if needed, but would probably  have to output another char[], since a normalized grapheme can occupy more  than one dchar.

I'm glad we agree on that now.


> What if I modified my proposed string_t type to return T[] as its element  type, as you say, and string literals are typed as string_t!(whatever)?   In addition, the restrictions I imposed on slicing a code point actually  get imposed on slicing a grapheme.  That is, it is illegal to substring a  string_t in a way that slices through a grapheme (and by deduction, a code  point)?

I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments:

I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?). If strings and arrays of code units are distinct, slicing in the middle of a grapheme or in the middle of a code point could throw an error, but for performance reasons it should probably check for that only when array bounds checking is turned on (that would require compiler support however).


> Actually, we would need a grapheme to be its own type, because comparing  two char[]'s that don't contain equivalent bits and having them be equal,  violates the expectation that char[] is an array.
> 
> So the string_t!char would return a grapheme_t!char (names to be  discussed) as its element type.

Or you could make a grapheme a string_t. ;-)


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 15, 2011
On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin <michel.fortin@michelf.com> wrote:

> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" <schveiguy@yahoo.com> said:
>
>>> I'm not suggesting we impose it, just that we make it the default. If  you want to iterate by dchar, wchar, or char, just write:
>>>  	foreach (dchar c; "exposé") {}
>>> 	foreach (wchar c; "exposé") {}
>>> 	foreach (char c; "exposé") {}
>>> 	// or
>>> 	foreach (dchar c; "exposé".by!dchar()) {}
>>> 	foreach (wchar c; "exposé".by!wchar()) {}
>>> 	foreach (char c; "exposé".by!char()) {}
>>>  and it'll work. But the default would be a slice containing the  grapheme, because this is the right way to represent a Unicode character.
>>  I think this is a good idea.  I previously was nervous about it, but I'm  not sure it makes a huge difference.  Returning a char[] is certainly less  work than normalizing a grapheme into one or more code points, and then  returning them.  All that it takes is to detect all the code points within  the grapheme.  Normalization can be done if needed, but would probably  have to output another char[], since a normalized grapheme can occupy more  than one dchar.
>
> I'm glad we agree on that now.

It's a matter of me slowly wrapping my brain around unicode and how it's used.  It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library.  It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).

I once told a colleague who was on a standards committee that their proposed KLV standard (key length value) was ridiculous.  The wise committee had decided that in order to avoid future issues, the length would be encoded as a single byte if < 128, or 128 + length of the length field for anything higher.  This means you could potentially have to parse and process a 127-byte integer!

>
>
>> What if I modified my proposed string_t type to return T[] as its element  type, as you say, and string literals are typed as string_t!(whatever)?   In addition, the restrictions I imposed on slicing a code point actually  get imposed on slicing a grapheme.  That is, it is illegal to substring a  string_t in a way that slices through a grapheme (and by deduction, a code  point)?
>
> I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments:
>
> I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?).

I would not be opposed to getting rid of those types.  But I am very opposed to char[] not being an array.  If you want a string to be something other than an array, make it have a different syntax.  We also have to consider C compatibility.

However, we are in radical-change mode then, and this is probably pushed to D3 ;)  If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.

> If strings and arrays of code units are distinct, slicing in the middle of a grapheme or in the middle of a code point could throw an error, but for performance reasons it should probably check for that only when array bounds checking is turned on (that would require compiler support however).

Not really, it could use assert, but that throws an assert error instead of a RangeError.  Of course, both are errors and will abort the program.  I do wish there was a version(noboundscheck) to do this kind of stuff with...

>> Actually, we would need a grapheme to be its own type, because comparing  two char[]'s that don't contain equivalent bits and having them be equal,  violates the expectation that char[] is an array.
>>  So the string_t!char would return a grapheme_t!char (names to be  discussed) as its element type.
>
> Or you could make a grapheme a string_t. ;-)

I'm a little uneasy having a range return itself as its element type.  For all intents and purposes, a grapheme is a string of one 'element', so it could potentially be a string_t.

It does seem daunting to have so many types, but at the same time, types convey relationships at compile time that can make coding impossible to get wrong, or make things actually possible when having a single type doesn't.

I'll give you an example from a previous life:

Tango had a type called DateTime.  This type represented *either* a point in time, or a span of time (depending on how you used it).  But I proposed we switch to two distinct types, one for a point in time, one for a span of time.  It was argued that both were so similar, why couldn't we just keep one type?  The answer is simple -- having them be separate types allows me to express relationships that the compiler enforces.  For example, you can add two time spans together, but you can't add two points in time together.  Or maybe you want a function to accept a time span (like a sleep operation).  If there was only one type, then sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;)

I feel that making extra types when the relationship between them is important is worth the possible repetition of functionality.  Catching bugs during compilation is soooo much better than experiencing them during runtime.

-Steve
January 15, 2011
Steven Schveighoffer Wrote:

> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin <michel.fortin@michelf.com> wrote:
> 
> > On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" <schveiguy@yahoo.com> said:
> >
> >>> I'm not suggesting we impose it, just that we make it the default. If
> >>> you want to iterate by dchar, wchar, or char, just write:
> >>>  	foreach (dchar c; "exposé") {}
> >>> 	foreach (wchar c; "exposé") {}
> >>> 	foreach (char c; "exposé") {}
> >>> 	// or
> >>> 	foreach (dchar c; "exposé".by!dchar()) {}
> >>> 	foreach (wchar c; "exposé".by!wchar()) {}
> >>> 	foreach (char c; "exposé".by!char()) {}
> >>>  and it'll work. But the default would be a slice containing the
> >>> grapheme, because this is the right way to represent a Unicode
> >>> character.
> >>  I think this is a good idea.  I previously was nervous about it, but
> >> I'm  not sure it makes a huge difference.  Returning a char[] is
> >> certainly less  work than normalizing a grapheme into one or more code
> >> points, and then  returning them.  All that it takes is to detect all
> >> the code points within  the grapheme.  Normalization can be done if
> >> needed, but would probably  have to output another char[], since a
> >> normalized grapheme can occupy more  than one dchar.
> >
> > I'm glad we agree on that now.
> 
> It's a matter of me slowly wrapping my brain around unicode and how it's used.  It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library.  It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).
> 
> I once told a colleague who was on a standards committee that their proposed KLV standard (key length value) was ridiculous.  The wise committee had decided that in order to avoid future issues, the length would be encoded as a single byte if < 128, or 128 + length of the length field for anything higher.  This means you could potentially have to parse and process a 127-byte integer!
> 
> >
> >
> >> What if I modified my proposed string_t type to return T[] as its element  type, as you say, and string literals are typed as string_t!(whatever)?   In addition, the restrictions I imposed on slicing a code point actually  get imposed on slicing a grapheme.  That is, it is illegal to substring a  string_t in a way that slices through a grapheme (and by deduction, a code  point)?
> >
> > I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments:
> >
> > I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?).
> 
> I would not be opposed to getting rid of those types.  But I am very opposed to char[] not being an array.  If you want a string to be something other than an array, make it have a different syntax.  We also have to consider C compatibility.
> 
> However, we are in radical-change mode then, and this is probably pushed to D3 ;)  If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.
> 
> > If strings and arrays of code units are distinct, slicing in the middle of a grapheme or in the middle of a code point could throw an error, but for performance reasons it should probably check for that only when array bounds checking is turned on (that would require compiler support however).
> 
> Not really, it could use assert, but that throws an assert error instead of a RangeError.  Of course, both are errors and will abort the program. I do wish there was a version(noboundscheck) to do this kind of stuff with...
> 
> >> Actually, we would need a grapheme to be its own type, because
> >> comparing  two char[]'s that don't contain equivalent bits and having
> >> them be equal,  violates the expectation that char[] is an array.
> >>  So the string_t!char would return a grapheme_t!char (names to be
> >> discussed) as its element type.
> >
> > Or you could make a grapheme a string_t. ;-)
> 
> I'm a little uneasy having a range return itself as its element type.  For all intents and purposes, a grapheme is a string of one 'element', so it could potentially be a string_t.
> 
> It does seem daunting to have so many types, but at the same time, types convey relationships at compile time that can make coding impossible to get wrong, or make things actually possible when having a single type doesn't.
> 
> I'll give you an example from a previous life:
> 
> Tango had a type called DateTime.  This type represented *either* a point in time, or a span of time (depending on how you used it).  But I proposed we switch to two distinct types, one for a point in time, one for a span of time.  It was argued that both were so similar, why couldn't we just keep one type?  The answer is simple -- having them be separate types allows me to express relationships that the compiler enforces.  For example, you can add two time spans together, but you can't add two points in time together.  Or maybe you want a function to accept a time span (like a sleep operation).  If there was only one type, then sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;)
> 
> I feel that making extra types when the relationship between them is important is worth the possible repetition of functionality.  Catching bugs during compilation is soooo much better than experiencing them during runtime.
> 
> -Steve

I like Michel's proposed semantics and I also agree with you that it should be a distinct string type and not break consistency of regular arrays.

Regarding your last point: Do you mean that a grapheme would be a sub-type of string? (a specialization where the string represents a single element)? If so, than it sounds good to me.

January 15, 2011
On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer" <schveiguy@yahoo.com> said:

> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  <michel.fortin@michelf.com> wrote:
> 
>> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  <schveiguy@yahoo.com> said:
>> 
>>>> I'm not suggesting we impose it, just that we make it the default. If   you want to iterate by dchar, wchar, or char, just write:
>>>>  	foreach (dchar c; "exposé") {}
>>>> 	foreach (wchar c; "exposé") {}
>>>> 	foreach (char c; "exposé") {}
>>>> 	// or
>>>> 	foreach (dchar c; "exposé".by!dchar()) {}
>>>> 	foreach (wchar c; "exposé".by!wchar()) {}
>>>> 	foreach (char c; "exposé".by!char()) {}
>>>>  and it'll work. But the default would be a slice containing the   grapheme, because this is the right way to represent a Unicode  character.
>>>  I think this is a good idea.  I previously was nervous about it, but  I'm  not sure it makes a huge difference.  Returning a char[] is  certainly less  work than normalizing a grapheme into one or more code  points, and then  returning them.  All that it takes is to detect all  the code points within  the grapheme.  Normalization can be done if  needed, but would probably  have to output another char[], since a  normalized grapheme can occupy more  than one dchar.
>> 
>> I'm glad we agree on that now.
> 
> It's a matter of me slowly wrapping my brain around unicode and how it's  used.  It seems like it's a typical committee defined standard where there  are 10 ways to do everything, I was trying to weed out the lesser used (or  so I perceived) pieces to allow a more implementable library.  It's doubly  hard for me since I have limited experience with other languages, and I've  never tried to write them with a computer (my language classes in high  school were back in the days of actually writing stuff down on paper).

Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand at first, and so they had to add a lot of things but wanted to keep things backward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support.

That said, I'm sure if someone could redesign Unicode by breaking backward-compatibility we'd have something simpler. You could probably get rid of pre-combined characters and reduce the number of normalization forms. But would you be able to get rid of normalization entirely? I don't think so. Reinventing Unicode is probably not worth it.


>> I'm not opposed to that on principle. I'm a little uneasy about having  so many types representing a string however. Some other raw comments:
>> 
>> I agree that things would be more coherent if char[], wchar[], and  dchar[] behaved like other arrays, but I can't really see a  justification for those types to be in the language if there's nothing  special about them (why not a library type?).
> 
> I would not be opposed to getting rid of those types.  But I am very  opposed to char[] not being an array.  If you want a string to be  something other than an array, make it have a different syntax.  We also  have to consider C compatibility.
> 
> However, we are in radical-change mode then, and this is probably pushed  to D3 ;)  If we can find some way to fix the situation without  invalidating TDPL, we should strive for that first IMO.

Indeed, the change would probably be too radical for D2.

I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't really change the type of string literals, can we. The only thing we can change (I hope) at this point is how iterating on strings work.

Walter said earlier that he oppose changing foreach's default element type to dchar for char[] and wchar[] (as Andrei did for ranges) on the ground that it would silently break D1 compatibility. This is a valid point in my opinion.

I think you're right when you say that not treating char[] as an array of character breaks, to a certain extent, C compatibility. Another valid point.

That said, I want to emphasize that iterating by grapheme, contrary to iterating by dchar, does not break any code *silently*. The compiler will complain loudly that you're comparing a string to a char, so you'll have to change your code somewhere if you want things to compile. You'll have to look at the code and decide what to do.

One more thing:

NSString in Cocoa is in essence the same thing as I'm proposing here: as array of UTF-16 code units, but with string behaviour. It supports by-code-unit indexing, but appending, comparing, searching for substrings, etc. all behave correctly as a Unicode string. Again, I agree that it's probably not the best design, but I can tell you it works well in practice. In fact, NSString doesn't even expose the concept of grapheme, it just uses them internally, and you're pretty much limited to the built-in operation. I think what we have here in concept is much better... even if it somewhat conflates code-unit arrays and strings.


>> Or you could make a grapheme a string_t. ;-)
> 
> I'm a little uneasy having a range return itself as its element type.  For  all intents and purposes, a grapheme is a string of one 'element', so it  could potentially be a string_t.
> 
> It does seem daunting to have so many types, but at the same time, types  convey relationships at compile time that can make coding impossible to  get wrong, or make things actually possible when having a single type  doesn't.
> 
> I'll give you an example from a previous life:
> 
> [...]
> I feel that making extra types when the relationship between them is  important is worth the possible repetition of functionality.  Catching  bugs during compilation is soooo much better than experiencing them during  runtime.

I can understand the utility of a separate type in your DateTime example, but in this case I fail to see any advantage.

I mean, a grapheme is a slice of a string, can have multiple code points (like a string), can be appended the same way as a string, can be composed or decomposed using canonical normalization or compatibility normalization (like a string), and should be sorted, uppercased, and lowercased according to Unicode rules (like a string). Basically, a grapheme is just a string that happens to contain only one grapheme. What would a custom type do differently than a string?

Also, grapheme == "a" is easy to understand because both are strings. But if a grapheme is a separate type, what would a grapheme literal look like?

So in the end I don't think a grapheme needs a specific type, at least not for general purpose text processing. If I split a string on whitespace, do I get a range where elements are of type "word"? No, just sliced strings.

That said, I'm much less concerned by the type used to represent a grapheme than by the Unicode correctness. I'm not opposed to a separate type, I just don't really see the point.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 16, 2011
On 1/15/11 4:45 PM, Michel Fortin wrote:
> On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"
> <schveiguy@yahoo.com> said:
>
>> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin
>> <michel.fortin@michelf.com> wrote:
>>
>>> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"
>>> <schveiguy@yahoo.com> said:
>>>
>>>>> I'm not suggesting we impose it, just that we make it the default.
>>>>> If you want to iterate by dchar, wchar, or char, just write:
>>>>> foreach (dchar c; "exposé") {}
>>>>> foreach (wchar c; "exposé") {}
>>>>> foreach (char c; "exposé") {}
>>>>> // or
>>>>> foreach (dchar c; "exposé".by!dchar()) {}
>>>>> foreach (wchar c; "exposé".by!wchar()) {}
>>>>> foreach (char c; "exposé".by!char()) {}
>>>>> and it'll work. But the default would be a slice containing the
>>>>> grapheme, because this is the right way to represent a Unicode
>>>>> character.
>>>> I think this is a good idea. I previously was nervous about it, but
>>>> I'm not sure it makes a huge difference. Returning a char[] is
>>>> certainly less work than normalizing a grapheme into one or more
>>>> code points, and then returning them. All that it takes is to detect
>>>> all the code points within the grapheme. Normalization can be done
>>>> if needed, but would probably have to output another char[], since a
>>>> normalized grapheme can occupy more than one dchar.
>>>
>>> I'm glad we agree on that now.
>>
>> It's a matter of me slowly wrapping my brain around unicode and how
>> it's used. It seems like it's a typical committee defined standard
>> where there are 10 ways to do everything, I was trying to weed out the
>> lesser used (or so I perceived) pieces to allow a more implementable
>> library. It's doubly hard for me since I have limited experience with
>> other languages, and I've never tried to write them with a computer
>> (my language classes in high school were back in the days of actually
>> writing stuff down on paper).
>
> Actually, I don't think Unicode was so badly designed. It's just that
> nobody hat an idea of the real scope of the problem they had in hand at
> first, and so they had to add a lot of things but wanted to keep things
> backward-compatible. We're at Unicode 6.0 now, can you name one other
> standard that evolved enough to get 6 major versions? I'm surprised it's
> not worse given all that it must support.
>
> That said, I'm sure if someone could redesign Unicode by breaking
> backward-compatibility we'd have something simpler. You could probably
> get rid of pre-combined characters and reduce the number of
> normalization forms. But would you be able to get rid of normalization
> entirely? I don't think so. Reinventing Unicode is probably not worth it.
>
>
>>> I'm not opposed to that on principle. I'm a little uneasy about
>>> having so many types representing a string however. Some other raw
>>> comments:
>>>
>>> I agree that things would be more coherent if char[], wchar[], and
>>> dchar[] behaved like other arrays, but I can't really see a
>>> justification for those types to be in the language if there's
>>> nothing special about them (why not a library type?).
>>
>> I would not be opposed to getting rid of those types. But I am very
>> opposed to char[] not being an array. If you want a string to be
>> something other than an array, make it have a different syntax. We
>> also have to consider C compatibility.
>>
>> However, we are in radical-change mode then, and this is probably
>> pushed to D3 ;) If we can find some way to fix the situation without
>> invalidating TDPL, we should strive for that first IMO.
>
> Indeed, the change would probably be too radical for D2.
>
> I think we agree that the default type should behave as a Unicode
> string, not an array of characters. I understand your opposition to
> conflating arrays of char with strings, and I agree with you to a
> certain extent that it could have been done better. But we can't really
> change the type of string literals, can we. The only thing we can change
> (I hope) at this point is how iterating on strings work.
>
> Walter said earlier that he oppose changing foreach's default element
> type to dchar for char[] and wchar[] (as Andrei did for ranges) on the
> ground that it would silently break D1 compatibility. This is a valid
> point in my opinion.
>
> I think you're right when you say that not treating char[] as an array
> of character breaks, to a certain extent, C compatibility. Another valid
> point.
>
> That said, I want to emphasize that iterating by grapheme, contrary to
> iterating by dchar, does not break any code *silently*. The compiler
> will complain loudly that you're comparing a string to a char, so you'll
> have to change your code somewhere if you want things to compile. You'll
> have to look at the code and decide what to do.
>
> One more thing:
>
> NSString in Cocoa is in essence the same thing as I'm proposing here: as
> array of UTF-16 code units, but with string behaviour. It supports
> by-code-unit indexing, but appending, comparing, searching for
> substrings, etc. all behave correctly as a Unicode string. Again, I
> agree that it's probably not the best design, but I can tell you it
> works well in practice. In fact, NSString doesn't even expose the
> concept of grapheme, it just uses them internally, and you're pretty
> much limited to the built-in operation. I think what we have here in
> concept is much better... even if it somewhat conflates code-unit arrays
> and strings.

I'm unclear on where this is converging to. At this point the commitment of the language and its standard library to (a) UTF aray representation and (b) code points conceptualization is quite strong. Changing that would be quite difficult and disruptive, and the benefits are virtually nonexistent for most of D's user base.

It may be more realistic to consider using what we have as back-end for grapheme-oriented processing. For example:

struct Grapheme(Char) if (isSomeChar!Char)
{
    private const Char[] rep;
    ...
}

auto byGrapheme(S)(S s) if (isSomeString!S)
{
   ...
}

string s = "Hello";
foreach (g; byGrapheme(s)
{
    ...
}

Andrei
January 16, 2011
On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
> On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
> 
> <lutger.blijdestijn@gmail.com> said:
> > Nick Sabalausky wrote:
> >> "Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message news:ignon1$2p4k$1@digitalmars.com...
> >> 
> >>> This may sometimes not be what the user expected; most of the time they'd care about the code points.
> >> 
> >> I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.
> > 
> > I agree. This is a very informative thread, thanks spir and everybody else.
> > 
> > Going back to the topic, it seems to me that a unicode string is a surprisingly complicated data structure that can be viewed from multiple types of ranges. In the light of this thread, a dchar doesn't seem like such a useful type anymore, it is still a low level abstraction for the purpose of correctly dealing with text. Perhaps even less useful, since it gives the illusion of correctness for those who are not in the know.
> > 
> > The algorithms in std.string can be upgraded to work correctly with all the issues mentioned, but the generic ones in std.algorithm will just subtly do the wrong thing when presented with dchar ranges. And, as I understood it, the purpose of a VleRange was exactly to make generic algorithms just work (tm) for strings.
> > 
> > Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.
> 
> I have my idea.
> 
> I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.
> 
> The second component would be to make the string equality operator (==)
> for strings compare them in their normalized form, so that ("e" with
> combining acute accent) == (pre-combined "é"). I think this would make
> D support for Unicode much more intuitive.
> 
> This implies some semantic changes, mainly that everywhere you write a "character" you must use double-quotes (string "a") instead of single quote (code point 'a'), but from the user's point of view that's pretty much all there is to change.
> 
> There'll still be plenty of room for specialized algorithms, but their purpose would be limited to optimization. Correctness would be taken care of by the basic range interface, and foreach should follow suit and iterate by grapheme by default.
> 
> I wrote this example (or something similar) earlier in this thread:
> 
> 	foreach (grapheme; "exposé")
> 		if (grapheme == "é")
> 			break;
> 
> In this example, even if one of these two strings use the pre-combined form of "é" and the other uses a combining acute accent, the equality would still hold since foreach iterates on full graphemes and == compares using normalization.
> 
> The important thing to keep in mind here is that the grapheme-splitting algorithm should be optimized for the case where there is no combining character and the compare algorithm for the case where the string is already normalized, since most strings will exhibit these characteristics.
> 
> As for ASCII, we could make it easier to use ubyte[] for it by making string literals implicitly convert to ubyte[] if all their characters are in ASCII range.

I think that that would cause definite problems. Having the element type of the range be the same type as the range seems like it could cause a lot of problems in std.algorithm and the like, and it's _definitely_ going to confuse programmers. I'd expect it to be highly bug-prone. They _need_ to be separate types.

Now, given that dchar can't actually work completely as an element type, you'd either need the string type to be a new type or the element type to be a new type. So, either the string type has char[], wchar[], or dchar[] for its element type, or char[], wchar[], and dchar[] have something like uchar as their element type, where uchar is a struct which contains a char[], wchar[], or dchar[] which holds a single grapheme.

I think that it's a great idea that programmers try to use substrings and slices rather than dchar, but making the element type a slice the original type sounds like it's really asking for trouble.

- Jonathan M Davis