January 15, 2011
On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin <michel.fortin@michelf.com> wrote:

> On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer" <schveiguy@yahoo.com> said:
>
>> On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir@gmail.com> wrote:
>>
>>> The point is not playing like that with Unicode flexibility. Rather that  composite characters are just normal thingies in most languages of the  world. Actually, on this point, english is a rare exception (discarding  letters imported from foreign languages like french 'à'); to the point  of beeing, I guess, the only western language without any diacritic.
>>  Is it common to have multiple modifiers on a single character?
>
> Not in my knowledge. But I rarely deal with non-latin texts, there's probably some scripts out there that takes advantage of this.
>
>
>> The  problem I see with using decomposed canonical form for strings is that we  would have to return a dchar[] for each 'element', which severely  complicates code that, for instance, only expects to handle English.
>
> Actually, returning a sliced char[] or wchar[] could also be valid. User-perceived characters are basically a substring of one or more code points. I'm not sure it complicates that much the semantics of the language -- what's complicated about writing str.front == "a" instead of str.front == 'a'? -- although it probably would complicate the generated code and make it a little slower.

Hm... this pushes the normalization outside the type, and into the algorithms (such as find).  I was hoping to avoid that.  I think I can come up with an algorithm that normalizes into canonical form as it iterates.  It just might return part of a grapheme if the grapheme cannot be composed.

I do think that we could make a byGrapheme member to aid in this:

foreach(grapheme; s.byGrapheme) // grapheme is a substring that contains one composed grapheme.

>
> In the case of NSString in Cocoa, you can only access the 'characters' in their UTF-16 form. But everything from comparison to search for substring is done using graphemes. It's like they implemented specialized Unicode-aware algorithms for these functions. There's no genericness about how it handles graphemes.
>
> I'm not sure yet about what would be the right approach for D.

I hope we can use generic versions, so the type itself handles the conversions.  That makes any algorithm using the string range correct.

>> I was hoping to lazily transform a string into its composed canonical  form, allowing the (hopefully rare) exception when a composed character  does not exist.  My thinking was that this at least gives a useful string  representation for 90% of usages, leaving the remaining 10% of usages to  find a more complex representation (like your Text type).  If we only get  like 20% or 30% there by making dchar the element type, then we haven't  made it useful enough.
>>  Either way, we need a string type that can be compared canonically for  things like searches or opEquals.
>
> I wonder if normalized string comparison shouldn't be built directly in the char[] wchar[] and dchar[] types instead.

No, in my vision of how strings should be typed, char[] is an array, not a string.  It should be treated like an array of code-units, where two forms that create the same grapheme are considered different.

> Also bring the idea above that iterating on a string would yield graphemes as char[] and this code would work perfectly irrespective of whether you used combining characters:
>
> 	foreach (grapheme; "exposé") {
> 		if (grapheme == "é")
> 			break;
> 	}
>
> I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however.

I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English.  I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above).

Does this sound reasonable?

-Steve
January 15, 2011
Steven Schveighoffer wrote:

...
>> I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however.
> 
> I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English.  I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above).
> 
> Does this sound reasonable?
> 
> -Steve

If its a matter of choosing which is the 'default' range, I'd think proper unicode handling is more reasonable than catering for english / ascii only. Especially since this is already the case in phobos string algorithms.
January 15, 2011
On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn <lutger.blijdestijn@gmail.com> wrote:

> Steven Schveighoffer wrote:
>
> ...
>>> I think a good standard to evaluate our handling of Unicode is to see
>>> how easy it is to do things the right way. In the above, foreach would
>>> slice the string grapheme by grapheme, and the == operator would perform
>>> a normalized comparison. While it works correctly, it's probably not the
>>> most efficient way to do thing however.
>>
>> I think this is a good alternative, but I'd rather not impose this on
>> people like myself who deal mostly with English.  I think this should be
>> possible to do with wrapper types or intermediate ranges which have
>> graphemes as elements (per my suggestion above).
>>
>> Does this sound reasonable?
>>
>> -Steve
>
> If its a matter of choosing which is the 'default' range, I'd think proper
> unicode handling is more reasonable than catering for english / ascii only.
> Especially since this is already the case in phobos string algorithms.

English and (if I understand correctly) most other languages.  Any language which can be built from composable graphemes would work.  And in fact, ones that use some graphemes that cannot be composed will also work to some degree (for example, opEquals).

What I'm proposing (or think I'm proposing) is not exactly catering to English and ASCII, what I'm proposing is simply not catering to more complex languages such as Hebrew and Arabic.  What I'm trying to find is a middle ground where most languages work, and the code is simple and efficient, with possibilities to jump down to lower levels for performance (i.e. switch to char[] when you know ASCII is all you are using) or jump up to full unicode when necessary.

Essentially, we would have three levels of types:

char[], wchar[], dchar[] -- Considered to be arrays in every way.
string_t!T (string, wstring, dstring) -- Specialized string types that do normalization to dchars, but do not handle perfectly all graphemes.  Works with any algorithm that deals with bidirectional ranges.  This is the default string type, and the type for string literals.  Represented internally by a single char[], wchar[] or dchar[] array.
* utfstring_t!T -- specialized string to deal with full unicode, which may perform worse than string_t, but supports everything unicode supports.  May require a battery of specialized algorithms.

* - name up for discussion

Also note that phobos currently does *no* normalization as far as I can tell for things like opEquals.  Two char[]'s that represent equivalent strings, but not in the same way, will compare as !=.

-Steve
January 15, 2011
Michel Fortin Wrote:

> What I don't understand is in what way using a string type would make the API less complex and use less templates?
> 
> More generally, in what way would your string type behave differently than char[], wchar[], and dchar[]? I think we need to clarify what how you expect your string type to behave before I can answer anything. I mean, beside cosmetic changes such as having a codePoint property instead of by!dchar or byDchar, what is your string type doing differently?
> 
> The above algorithm is already possible with strings as they are, provided you implement the 'decompose' and the 'compose' function returning a range. In fact, you only changed two things in it: by!dchar became codePoints, and array() became string(). Surely you're expecting more benefits than that.
> 
> -- 
> Michel Fortin
> michel.fortin@michelf.com
> http://michelf.com/
> 

First thing, the question of possibility is irrelevant since I could also write the same algorithm in brainfuck or assembly (with a lot more code). It's never a question of possibility but rather a question of ease of use for the user.

What I want is to encapsulate all the low-level implementation details in one place so that the as a user I will not need to deal with this everywhere. one such detail is the encoding.

auto text = w"whatever"; // should be equivalent to:
auto text = new Text("whatever", Encoding.UTF16);

now I want to write my own string function:

void func(Text a); // instead of current:
void func(T)(T a) if isTextType(T); // why the USER needs all this?

Of course, the Text type would do the correct think by default which we both agree should be graphemes. Only if I need something advanced like in the previous algorithm than I explicitly need to specify that I work on code points or code units.

In a sentence: "Make the common case trivial and the complex case possible". The common case is what we Humans think of as characters (graphemes) and the complex case is the encoding level.

January 15, 2011
Steven Schveighoffer Wrote:

> On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn <lutger.blijdestijn@gmail.com> wrote:
> 
> > Steven Schveighoffer wrote:
> >
> > ...
> >>> I think a good standard to evaluate our handling of Unicode is to see
> >>> how easy it is to do things the right way. In the above, foreach would
> >>> slice the string grapheme by grapheme, and the == operator would
> >>> perform
> >>> a normalized comparison. While it works correctly, it's probably not
> >>> the
> >>> most efficient way to do thing however.
> >>
> >> I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English.  I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above).
> >>
> >> Does this sound reasonable?
> >>
> >> -Steve
> >
> > If its a matter of choosing which is the 'default' range, I'd think
> > proper
> > unicode handling is more reasonable than catering for english / ascii
> > only.
> > Especially since this is already the case in phobos string algorithms.
> 
> English and (if I understand correctly) most other languages.  Any language which can be built from composable graphemes would work.  And in fact, ones that use some graphemes that cannot be composed will also work to some degree (for example, opEquals).
> 
> What I'm proposing (or think I'm proposing) is not exactly catering to English and ASCII, what I'm proposing is simply not catering to more complex languages such as Hebrew and Arabic.  What I'm trying to find is a middle ground where most languages work, and the code is simple and efficient, with possibilities to jump down to lower levels for performance (i.e. switch to char[] when you know ASCII is all you are using) or jump up to full unicode when necessary.
> 
> Essentially, we would have three levels of types:
> 
> char[], wchar[], dchar[] -- Considered to be arrays in every way.
> string_t!T (string, wstring, dstring) -- Specialized string types that do
> normalization to dchars, but do not handle perfectly all graphemes.  Works
> with any algorithm that deals with bidirectional ranges.  This is the
> default string type, and the type for string literals.  Represented
> internally by a single char[], wchar[] or dchar[] array.
> * utfstring_t!T -- specialized string to deal with full unicode, which may
> perform worse than string_t, but supports everything unicode supports.
> May require a battery of specialized algorithms.
> 
> * - name up for discussion
> 
> Also note that phobos currently does *no* normalization as far as I can tell for things like opEquals.  Two char[]'s that represent equivalent strings, but not in the same way, will compare as !=.
> 
> -Steve

The above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs.

More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language.

I firmly believe that in accordance with D's principle that the default behavior should be the correct & safe option, D should have the full unicode type (utfstring_t above) as the default.

You need only a subset of the functionality because you only use English? For the same reason, you don't want the Unicode overhead? Use an ASCII type instead. In the same vain, a geneticist should use a DNA sequence type and not Unicode text.
January 15, 2011
Nick Sabalausky wrote:
> Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes.

I know some German, and to the best of my knowledge there are zero combining characters for it. The umlauts and the B both have their own code points.

> legend has it there are others than can only be represented using a combining character.

??? I've never seen or heard of any. Not even in the old script that was in common use in Germany until after WW2.
January 15, 2011
On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer" <schveiguy@yahoo.com> said:

> On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin  <michel.fortin@michelf.com> wrote:
> 
>> Actually, returning a sliced char[] or wchar[] could also be valid.  User-perceived characters are basically a substring of one or more code  points. I'm not sure it complicates that much the semantics of the  language -- what's complicated about writing str.front == "a" instead of  str.front == 'a'? -- although it probably would complicate the generated  code and make it a little slower.
> 
> Hm... this pushes the normalization outside the type, and into the  algorithms (such as find).
> 
> I was hoping to avoid that.

Not really. It pushes the normalization to the string comparison operator, as explained later.


> I think I can  come up with an algorithm that normalizes into canonical form as it  iterates.  It just might return part of a grapheme if the grapheme cannot  be composed.

The problem with normalization while iterating is that you lose information about what the actual code points part of the grapheme. If you wanted to count the number of grapheme with a particular code point you're lost that information.

Moreover, if all you want is to count the number of grapheme, normalizing the character is a waste of time.

I suggested in another post that we implement ranges for decomposing and recomposing on-the-fly a string in its normalized form. That's basically the same thing as you suggest, but it'd have to be explicit to avoid the problem above.


>> I wonder if normalized string comparison shouldn't be built directly in  the char[] wchar[] and dchar[] types instead.
> 
> No, in my vision of how strings should be typed, char[] is an array, not a  string.  It should be treated like an array of code-units, where two forms  that create the same grapheme are considered different.

Well, I agree there's a need for that sometime. But if what you want is just a dumb array of code units, why not use ubyte[], ushort[] and uint[] instead?

It seems to me that the whole point of having a different type for char[], wchar[], and dchar[] is that you know they are Unicode strings and can treat them as such. And if you treat them as Unicode strings, then perhaps the runtime and the compiler should too, for consistency's sake.


>> Also bring the idea above that iterating on a string would yield  graphemes as char[] and this code would work perfectly irrespective of  whether you used combining characters:
>> 
>> 	foreach (grapheme; "exposé") {
>> 		if (grapheme == "é")
>> 			break;
>> 	}
>> 
>> I think a good standard to evaluate our handling of Unicode is to see  how easy it is to do things the right way. In the above, foreach would  slice the string grapheme by grapheme, and the == operator would perform  a normalized comparison. While it works correctly, it's probably not the  most efficient way to do thing however.
> 
> I think this is a good alternative, but I'd rather not impose this on  people like myself who deal mostly with English.

I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write:

	foreach (dchar c; "exposé") {}
	foreach (wchar c; "exposé") {}
	foreach (char c; "exposé") {}
	// or
	foreach (dchar c; "exposé".by!dchar()) {}
	foreach (wchar c; "exposé".by!wchar()) {}
	foreach (char c; "exposé".by!char()) {}

and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.


> I think this should be  possible to do with wrapper types or intermediate ranges which have  graphemes as elements (per my suggestion above).

I think it should be the reverse. If you want your code to break when it encounters multi-code-point graphemes then it's your choice, but you should have to make your choice explicit. The default should be to handle strings correctly.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 15, 2011
On Sat, 15 Jan 2011 13:21:12 -0500, foobar <foo@bar.com> wrote:

> Steven Schveighoffer Wrote:

>>
>> English and (if I understand correctly) most other languages.  Any
>> language which can be built from composable graphemes would work.  And in
>> fact, ones that use some graphemes that cannot be composed will also work
>> to some degree (for example, opEquals).
>>
>> What I'm proposing (or think I'm proposing) is not exactly catering to
>> English and ASCII, what I'm proposing is simply not catering to more
>> complex languages such as Hebrew and Arabic.  What I'm trying to find is a
>> middle ground where most languages work, and the code is simple and
>> efficient, with possibilities to jump down to lower levels for performance
>> (i.e. switch to char[] when you know ASCII is all you are using) or jump
>> up to full unicode when necessary.
>>
>> Essentially, we would have three levels of types:
>>
>> char[], wchar[], dchar[] -- Considered to be arrays in every way.
>> string_t!T (string, wstring, dstring) -- Specialized string types that do
>> normalization to dchars, but do not handle perfectly all graphemes.  Works
>> with any algorithm that deals with bidirectional ranges.  This is the
>> default string type, and the type for string literals.  Represented
>> internally by a single char[], wchar[] or dchar[] array.
>> * utfstring_t!T -- specialized string to deal with full unicode, which may
>> perform worse than string_t, but supports everything unicode supports.
>> May require a battery of specialized algorithms.
>>
>> * - name up for discussion
>>
>> Also note that phobos currently does *no* normalization as far as I can
>> tell for things like opEquals.  Two char[]'s that represent equivalent
>> strings, but not in the same way, will compare as !=.
>>
>> -Steve
>
> The above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs.

I feel like you might be exaggerating, but maybe I'm completely wrong on this, I'm not well-versed in unicode, or even languages that require unicode.  The clear benefit I see is that with a string type which normalizes to canonical code points, you can use this in any algorithm without having it be unicode-aware for *most languages*.  At least, that is how I see it.  I'm looking at it as a code-reuse proposition.

It's like calendars.  There are quite a few different calendars in different cultures.  But most people use a Gregorian calendar.  So we have three options:

a) Use a Gregorian calendar, and leave the other calendars to a 3rd party library
b) Use a complicated calendar system where Gregorian calendars are treated with equal respect to all other calendars, none are the default.
c) Use a Gregorian calendar by default, but include the other calendars as a separate module for those who wish to use them.

I'm looking at my proposal as more of a c) solution.

Can you show how normalization causes subtle bugs?

> More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language.

I hope that the medium type works 'good enough' for those languages, with the high level type needed for advanced usages.  At a minimum, comparison and substring should work for all languages.

> I firmly believe that in accordance with D's principle that the default behavior should be the correct & safe option, D should have the full unicode type (utfstring_t above) as the default.
>
> You need only a subset of the functionality because you only use English? For the same reason, you don't want the Unicode overhead? Use an ASCII type instead. In the same vain, a geneticist should use a DNA sequence type and not Unicode text.

Or French, or Spanish, or German, etc...

Look, even the lowest level is valid unicode, but if you want to start extracting individual graphemes, you need more machinery.  In 99% of cases, I'd think you want to use strings as strings, not as sequences of graphemes, or code-units.

-Steve
January 15, 2011
On Sat, 15 Jan 2011 13:32:10 -0500, Michel Fortin <michel.fortin@michelf.com> wrote:

> On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer" <schveiguy@yahoo.com> said:
>
>> On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin  <michel.fortin@michelf.com> wrote:
>>
>>> Actually, returning a sliced char[] or wchar[] could also be valid.  User-perceived characters are basically a substring of one or more code  points. I'm not sure it complicates that much the semantics of the  language -- what's complicated about writing str.front == "a" instead of  str.front == 'a'? -- although it probably would complicate the generated  code and make it a little slower.
>>  Hm... this pushes the normalization outside the type, and into the  algorithms (such as find).
>>  I was hoping to avoid that.
>
> Not really. It pushes the normalization to the string comparison operator, as explained later.
>
>
>> I think I can  come up with an algorithm that normalizes into canonical form as it  iterates.  It just might return part of a grapheme if the grapheme cannot  be composed.
>
> The problem with normalization while iterating is that you lose information about what the actual code points part of the grapheme. If you wanted to count the number of grapheme with a particular code point you're lost that information.

Are these common requirements?  I thought users mostly care about graphemes, not code points.  Asking in the dark here, since I have next to zero experience with unicode strings.

>
> Moreover, if all you want is to count the number of grapheme, normalizing the character is a waste of time.

This is true.  I can see this being a common need.

>
> I suggested in another post that we implement ranges for decomposing and recomposing on-the-fly a string in its normalized form. That's basically the same thing as you suggest, but it'd have to be explicit to avoid the problem above.

OK, I see your point.

>
>
>>> I wonder if normalized string comparison shouldn't be built directly in  the char[] wchar[] and dchar[] types instead.
>>  No, in my vision of how strings should be typed, char[] is an array, not a  string.  It should be treated like an array of code-units, where two forms  that create the same grapheme are considered different.
>
> Well, I agree there's a need for that sometime. But if what you want is just a dumb array of code units, why not use ubyte[], ushort[] and uint[] instead?

Because ubyte[] ushort[] and uint[] do not say that their data is unicode text.  The point is, I want to write a function that takes utf-8, ubyte[] opens it up to any data, not just UTF-8 data.  But if we have a method of iterating code-units as you specify below, then I think we are OK.

> It seems to me that the whole point of having a different type for char[], wchar[], and dchar[] is that you know they are Unicode strings and can treat them as such. And if you treat them as Unicode strings, then perhaps the runtime and the compiler should too, for consistency's sake.

I'd agree with you, but then there's that pesky [] after it indicating it's an array.  For consistency's sake, I'd say the compiler should treat T[] as an array of T's.

>>> Also bring the idea above that iterating on a string would yield  graphemes as char[] and this code would work perfectly irrespective of  whether you used combining characters:
>>>  	foreach (grapheme; "exposé") {
>>> 		if (grapheme == "é")
>>> 			break;
>>> 	}
>>>  I think a good standard to evaluate our handling of Unicode is to see  how easy it is to do things the right way. In the above, foreach would  slice the string grapheme by grapheme, and the == operator would perform  a normalized comparison. While it works correctly, it's probably not the  most efficient way to do thing however.
>>  I think this is a good alternative, but I'd rather not impose this on  people like myself who deal mostly with English.
>
> I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write:
>
> 	foreach (dchar c; "exposé") {}
> 	foreach (wchar c; "exposé") {}
> 	foreach (char c; "exposé") {}
> 	// or
> 	foreach (dchar c; "exposé".by!dchar()) {}
> 	foreach (wchar c; "exposé".by!wchar()) {}
> 	foreach (char c; "exposé".by!char()) {}
>
> and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.

I think this is a good idea.  I previously was nervous about it, but I'm not sure it makes a huge difference.  Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them.  All that it takes is to detect all the code points within the grapheme.  Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.

What if I modified my proposed string_t type to return T[] as its element type, as you say, and string literals are typed as string_t!(whatever)?  In addition, the restrictions I imposed on slicing a code point actually get imposed on slicing a grapheme.  That is, it is illegal to substring a string_t in a way that slices through a grapheme (and by deduction, a code point)?

Actually, we would need a grapheme to be its own type, because comparing two char[]'s that don't contain equivalent bits and having them be equal, violates the expectation that char[] is an array.

So the string_t!char would return a grapheme_t!char (names to be discussed) as its element type.

>
>
>> I think this should be  possible to do with wrapper types or intermediate ranges which have  graphemes as elements (per my suggestion above).
>
> I think it should be the reverse. If you want your code to break when it encounters multi-code-point graphemes then it's your choice, but you should have to make your choice explicit. The default should be to handle strings correctly.

You are probably right.

-Steve
January 15, 2011
On Sat, 15 Jan 2011 14:51:47 -0500, Steven Schveighoffer <schveiguy@yahoo.com> wrote:

> I feel like you might be exaggerating, but maybe I'm completely wrong on this, I'm not well-versed in unicode, or even languages that require unicode.  The clear benefit I see is that with a string type which normalizes to canonical code points, you can use this in any algorithm without having it be unicode-aware for *most languages*.  At least, that is how I see it.  I'm looking at it as a code-reuse proposition.
>
> It's like calendars.  There are quite a few different calendars in different cultures.  But most people use a Gregorian calendar.  So we have three options:
>
> a) Use a Gregorian calendar, and leave the other calendars to a 3rd party library
> b) Use a complicated calendar system where Gregorian calendars are treated with equal respect to all other calendars, none are the default.
> c) Use a Gregorian calendar by default, but include the other calendars as a separate module for those who wish to use them.
>
> I'm looking at my proposal as more of a c) solution.
>
> Can you show how normalization causes subtle bugs?

I see from Michel's post how normalization automatically can be bad.  I also see that it can be wasteful.  So I've shifted my position.

Now I agree that we need a full unicode-compliant string type as the default.  See my reply to Michel for more info on my revised proposal.

-Steve