January 16, 2011
On Saturday 15 January 2011 15:59:27 Andrei Alexandrescu wrote:
> On 1/15/11 4:45 PM, Michel Fortin wrote:
> > On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"
> > 
> > <schveiguy@yahoo.com> said:
> >> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin
> >> 
> >> <michel.fortin@michelf.com> wrote:
> >>> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"
> >>> 
> >>> <schveiguy@yahoo.com> said:
> >>>>> I'm not suggesting we impose it, just that we make it the default.
> >>>>> If you want to iterate by dchar, wchar, or char, just write:
> >>>>> foreach (dchar c; "exposé") {}
> >>>>> foreach (wchar c; "exposé") {}
> >>>>> foreach (char c; "exposé") {}
> >>>>> // or
> >>>>> foreach (dchar c; "exposé".by!dchar()) {}
> >>>>> foreach (wchar c; "exposé".by!wchar()) {}
> >>>>> foreach (char c; "exposé".by!char()) {}
> >>>>> and it'll work. But the default would be a slice containing the
> >>>>> grapheme, because this is the right way to represent a Unicode
> >>>>> character.
> >>>> 
> >>>> I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.
> >>> 
> >>> I'm glad we agree on that now.
> >> 
> >> It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).
> > 
> > Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand at first, and so they had to add a lot of things but wanted to keep things backward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support.
> > 
> > That said, I'm sure if someone could redesign Unicode by breaking backward-compatibility we'd have something simpler. You could probably get rid of pre-combined characters and reduce the number of normalization forms. But would you be able to get rid of normalization entirely? I don't think so. Reinventing Unicode is probably not worth it.
> > 
> >>> I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments:
> >>> 
> >>> I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?).
> >> 
> >> I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility.
> >> 
> >> However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.
> > 
> > Indeed, the change would probably be too radical for D2.
> > 
> > I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't really change the type of string literals, can we. The only thing we can change (I hope) at this point is how iterating on strings work.
> > 
> > Walter said earlier that he oppose changing foreach's default element type to dchar for char[] and wchar[] (as Andrei did for ranges) on the ground that it would silently break D1 compatibility. This is a valid point in my opinion.
> > 
> > I think you're right when you say that not treating char[] as an array of character breaks, to a certain extent, C compatibility. Another valid point.
> > 
> > That said, I want to emphasize that iterating by grapheme, contrary to iterating by dchar, does not break any code *silently*. The compiler will complain loudly that you're comparing a string to a char, so you'll have to change your code somewhere if you want things to compile. You'll have to look at the code and decide what to do.
> > 
> > One more thing:
> > 
> > NSString in Cocoa is in essence the same thing as I'm proposing here: as array of UTF-16 code units, but with string behaviour. It supports by-code-unit indexing, but appending, comparing, searching for substrings, etc. all behave correctly as a Unicode string. Again, I agree that it's probably not the best design, but I can tell you it works well in practice. In fact, NSString doesn't even expose the concept of grapheme, it just uses them internally, and you're pretty much limited to the built-in operation. I think what we have here in concept is much better... even if it somewhat conflates code-unit arrays and strings.
> 
> I'm unclear on where this is converging to. At this point the commitment of the language and its standard library to (a) UTF aray representation and (b) code points conceptualization is quite strong. Changing that would be quite difficult and disruptive, and the benefits are virtually nonexistent for most of D's user base.
> 
> It may be more realistic to consider using what we have as back-end for grapheme-oriented processing. For example:
> 
> struct Grapheme(Char) if (isSomeChar!Char)
> {
>      private const Char[] rep;
>      ...
> }
> 
> auto byGrapheme(S)(S s) if (isSomeString!S)
> {
>     ...
> }
> 
> string s = "Hello";
> foreach (g; byGrapheme(s)
> {
>      ...
> }

Considering that strings are already dealt with specially in order to have an element of dchar, I wouldn't think that it would be all that distruptive to make it so that they had an element type of Grapheme instead. Wouldn't that then fix all of std.algorithm and the like without really disrupting anything?

The issue of foreach remains, but without being willing to change what foreach defaults to, you can't really fix it - though I'd suggest that we at least make it a warning to iterate over strings without specifying the type. And if foreach were made to understand Grapheme like it understands dchar, then you could do

foreach(Grapheme g; str) { ... }

and have the compiler warn about

foreach(g; str) { ... }

and tell you to use Grapheme if you want to be comparing actual characters. Regardless, by making strings ranges of Grapheme rather than dchar, I would think that we would solve most of the problem. At minimum, we'd have pretty much the same problems that we have right now with char and wchar arrays, but we'd get rid of a whole class of unicode problems. So, nothing would be worse, but some of it would be better.

- Jonathan M Davis
January 16, 2011
On Saturday 15 January 2011 19:25:47 Jonathan M Davis wrote:
> On Saturday 15 January 2011 15:59:27 Andrei Alexandrescu wrote:
> > On 1/15/11 4:45 PM, Michel Fortin wrote:
> > > On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"
> > > 
> > > <schveiguy@yahoo.com> said:
> > >> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin
> > >> 
> > >> <michel.fortin@michelf.com> wrote:
> > >>> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"
> > >>> 
> > >>> <schveiguy@yahoo.com> said:
> > >>>>> I'm not suggesting we impose it, just that we make it the default.
> > >>>>> If you want to iterate by dchar, wchar, or char, just write:
> > >>>>> foreach (dchar c; "exposé") {}
> > >>>>> foreach (wchar c; "exposé") {}
> > >>>>> foreach (char c; "exposé") {}
> > >>>>> // or
> > >>>>> foreach (dchar c; "exposé".by!dchar()) {}
> > >>>>> foreach (wchar c; "exposé".by!wchar()) {}
> > >>>>> foreach (char c; "exposé".by!char()) {}
> > >>>>> and it'll work. But the default would be a slice containing the
> > >>>>> grapheme, because this is the right way to represent a Unicode
> > >>>>> character.
> > >>>> 
> > >>>> I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.
> > >>> 
> > >>> I'm glad we agree on that now.
> > >> 
> > >> It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).
> > > 
> > > Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand at first, and so they had to add a lot of things but wanted to keep things backward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support.
> > > 
> > > That said, I'm sure if someone could redesign Unicode by breaking backward-compatibility we'd have something simpler. You could probably get rid of pre-combined characters and reduce the number of normalization forms. But would you be able to get rid of normalization entirely? I don't think so. Reinventing Unicode is probably not worth it.
> > > 
> > >>> I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments:
> > >>> 
> > >>> I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?).
> > >> 
> > >> I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility.
> > >> 
> > >> However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.
> > > 
> > > Indeed, the change would probably be too radical for D2.
> > > 
> > > I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't really change the type of string literals, can we. The only thing we can change (I hope) at this point is how iterating on strings work.
> > > 
> > > Walter said earlier that he oppose changing foreach's default element type to dchar for char[] and wchar[] (as Andrei did for ranges) on the ground that it would silently break D1 compatibility. This is a valid point in my opinion.
> > > 
> > > I think you're right when you say that not treating char[] as an array of character breaks, to a certain extent, C compatibility. Another valid point.
> > > 
> > > That said, I want to emphasize that iterating by grapheme, contrary to iterating by dchar, does not break any code *silently*. The compiler will complain loudly that you're comparing a string to a char, so you'll have to change your code somewhere if you want things to compile. You'll have to look at the code and decide what to do.
> > > 
> > > One more thing:
> > > 
> > > NSString in Cocoa is in essence the same thing as I'm proposing here: as array of UTF-16 code units, but with string behaviour. It supports by-code-unit indexing, but appending, comparing, searching for substrings, etc. all behave correctly as a Unicode string. Again, I agree that it's probably not the best design, but I can tell you it works well in practice. In fact, NSString doesn't even expose the concept of grapheme, it just uses them internally, and you're pretty much limited to the built-in operation. I think what we have here in concept is much better... even if it somewhat conflates code-unit arrays and strings.
> > 
> > I'm unclear on where this is converging to. At this point the commitment of the language and its standard library to (a) UTF aray representation and (b) code points conceptualization is quite strong. Changing that would be quite difficult and disruptive, and the benefits are virtually nonexistent for most of D's user base.
> > 
> > It may be more realistic to consider using what we have as back-end for grapheme-oriented processing. For example:
> > 
> > struct Grapheme(Char) if (isSomeChar!Char)
> > {
> > 
> >      private const Char[] rep;
> >      ...
> > 
> > }
> > 
> > auto byGrapheme(S)(S s) if (isSomeString!S)
> > {
> > 
> >     ...
> > 
> > }
> > 
> > string s = "Hello";
> > foreach (g; byGrapheme(s)
> > {
> > 
> >      ...
> > 
> > }
> 
> Considering that strings are already dealt with specially in order to have an element of dchar, I wouldn't think that it would be all that distruptive to make it so that they had an element type of Grapheme instead. Wouldn't that then fix all of std.algorithm and the like without really disrupting anything?
> 
> The issue of foreach remains, but without being willing to change what foreach defaults to, you can't really fix it - though I'd suggest that we at least make it a warning to iterate over strings without specifying the type. And if foreach were made to understand Grapheme like it understands dchar, then you could do
> 
> foreach(Grapheme g; str) { ... }
> 
> and have the compiler warn about
> 
> foreach(g; str) { ... }
> 
> and tell you to use Grapheme if you want to be comparing actual characters. Regardless, by making strings ranges of Grapheme rather than dchar, I would think that we would solve most of the problem. At minimum, we'd have pretty much the same problems that we have right now with char and wchar arrays, but we'd get rid of a whole class of unicode problems. So, nothing would be worse, but some of it would be better.

I suppose that the one major omission though is that string comparisons would be by code unit, not graphemes, which would be a problem. == could be made to use graphemes instead, but then you couldn't compare them by code units or code points unless you cast to ubyte[], ushort[], or uint[]... It would still probably be worth making == use graphemes though.

- Jonathan M Davis
January 16, 2011
On 2011-01-15 18:59:27 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> I'm unclear on where this is converging to. At this point the commitment of the language and its standard library to (a) UTF aray representation and (b) code points conceptualization is quite strong. Changing that would be quite difficult and disruptive, and the benefits are virtually nonexistent for most of D's user base.

There's still a disagreement about whether a string or a code unit array should be the default string representation, and whether iterating on a code unit array should give you code unit or grapheme elements. Of those who who participated in the discussion, I don't think anyone is disputing the idea that a grapheme element is better than a dchar element for iterating over a string.


> It may be more realistic to consider using what we have as back-end for grapheme-oriented processing.
> For example:
> 
> struct Grapheme(Char) if (isSomeChar!Char)
> {
>      private const Char[] rep;
>      ...
> }
> 
> auto byGrapheme(S)(S s) if (isSomeString!S)
> {
>     ...
> }
> 
> string s = "Hello";
> foreach (g; byGrapheme(s)
> {
>      ...
> }

No doubt it's easier to implement it that way. The problem is that in most cases it won't be used. How many people really know what is a grapheme? Of those, how many will forget to use byGrapheme at one time or another? And so in most programs string manipulation will misbehave in the presence of combining characters or unnormalized strings.

If you want to help D programmers write correct code when it comes to Unicode manipulation, you need to help them iterate on real characters (graphemes), and you need the algorithms to apply to real characters (graphemes), not the approximation of a Unicode character that is a code point.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 16, 2011
On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg@gmx.com> said:

> On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
>> I have my idea.
>> 
>> I think it'd be a good idea is to improve upon Andrei's first idea --
>> which was to treat char[], wchar[], and dchar[] all as ranges of dchar
>> elements -- by changing the element type to be the same as the string.
>> For instance, iterating on a char[] would give you slices of char[],
>> each having one grapheme.
>> 
>> The second component would be to make the string equality operator (=
> =)
>> for strings compare them in their normalized form, so that ("e" with
>> combining acute accent) == (pre-combined "é"). I think this would m
> ake
>> D support for Unicode much more intuitive.
>> 
>> This implies some semantic changes, mainly that everywhere you write a
>> "character" you must use double-quotes (string "a") instead of single
>> quote (code point 'a'), but from the user's point of view that's pretty
>> much all there is to change.
>> 
>> There'll still be plenty of room for specialized algorithms, but their
>> purpose would be limited to optimization. Correctness would be taken
>> care of by the basic range interface, and foreach should follow suit
>> and iterate by grapheme by default.
>> 
>> I wrote this example (or something similar) earlier in this thread:
>> 
>> 	foreach (grapheme; "exposé")
>> 		if (grapheme == "é")
>> 			break;
>> 
>> In this example, even if one of these two strings use the pre-combined
>> form of "é" and the other uses a combining acute accent, the equality
>> would still hold since foreach iterates on full graphemes and =
>> compares using normalization.
> 
> I think that that would cause definite problems. Having the element type of the range be the same type as the range seems like it could cause a lot of problems in std.algorithm and the like, and it's _definitely_ going to confuse programmers. I'd expect it to be highly bug-prone. They _need_ to be separate types.

I remember that someone already complained about this issue because he had a tree of ranges, and Andrei said he would take a look at this problem eventually. Perhaps now would be a good time.


> Now, given that dchar can't actually work completely as an element type, you'd either need the string type to be a new type or the element type to be a new type. So, either the string type has char[], wchar[], or dchar[] for its element type, or char[], wchar[], and dchar[] have something like uchar as their element type, where uchar is a struct which contains a char[], wchar[], or dchar[]
> which holds a single grapheme.

Having a new type for grapheme would work too. My preference still goes to reusing the string type because it makes the semantic simpler to understand, especially when comparing graphemes with literals.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 16, 2011
On 2011-01-15 22:25:47 -0500, Jonathan M Davis <jmdavisProg@gmx.com> said:

> The issue of foreach remains, but without being willing to change what foreach defaults to, you can't really fix it - though I'd suggest that we at least make it a warning to iterate over strings without specifying the type. And if foreach were made to understand Grapheme like it understands dchar, then you could do
> 
> foreach(Grapheme g; str) { ... }
> 
> and have the compiler warn about
> 
> foreach(g; str) { ... }
> 
> and tell you to use Grapheme if you want to be comparing actual characters.

Walter's argument against changing this for foreach was that it'd *silently* break compatibility with existing D1 code. Changing the default to a grapheme makes this argument obsolete: since a grapheme is essentially a string, you can't compare it with char or wchar or dchar directly, so it'll break at compile time with an error and you'll have to decide what to do.

So Walter would have to find another argument to defend the status quo.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 16, 2011
On Saturday 15 January 2011 20:45:53 Michel Fortin wrote:
> On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg@gmx.com> said:
> > On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
> >> I have my idea.
> >> 
> >> I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.
> >> 
> >> The second component would be to make the string equality operator (=
> > 
> > =)
> > 
> >> for strings compare them in their normalized form, so that ("e" with
> >> combining acute accent) == (pre-combined "é"). I think this would m
> > 
> > ake
> > 
> >> D support for Unicode much more intuitive.
> >> 
> >> This implies some semantic changes, mainly that everywhere you write a "character" you must use double-quotes (string "a") instead of single quote (code point 'a'), but from the user's point of view that's pretty much all there is to change.
> >> 
> >> There'll still be plenty of room for specialized algorithms, but their purpose would be limited to optimization. Correctness would be taken care of by the basic range interface, and foreach should follow suit and iterate by grapheme by default.
> >> 
> >> I wrote this example (or something similar) earlier in this thread:
> >> 	foreach (grapheme; "exposé")
> >> 
> >> 		if (grapheme == "é")
> >> 
> >> 			break;
> >> 
> >> In this example, even if one of these two strings use the pre-combined form of "é" and the other uses a combining acute accent, the equality would still hold since foreach iterates on full graphemes and = compares using normalization.
> > 
> > I think that that would cause definite problems. Having the element type of the range be the same type as the range seems like it could cause a lot of problems in std.algorithm and the like, and it's _definitely_ going to confuse programmers. I'd expect it to be highly bug-prone. They _need_ to be separate types.
> 
> I remember that someone already complained about this issue because he had a tree of ranges, and Andrei said he would take a look at this problem eventually. Perhaps now would be a good time.
> 
> > Now, given that dchar can't actually work completely as an element
> > type, you'd either need the string type to be a new type or the element
> > type to be a new type. So, either the string type has char[], wchar[],
> > or dchar[] for its element type, or char[], wchar[], and dchar[] have
> > something like uchar as their element type, where uchar is a struct
> > which contains a char[], wchar[], or dchar[]
> > which holds a single grapheme.
> 
> Having a new type for grapheme would work too. My preference still goes to reusing the string type because it makes the semantic simpler to understand, especially when comparing graphemes with literals.

If a character literal actually became a grapheme instead of a dchar, then that would likely solve that issue. But I fear that the semantics of having a range be its own element type actually make understanding it _harder_, not simpler. Being forced to compare a string literals against what should be a character would definitely confuse programmers. Making a new character or grapheme type which represented a grapheme would be _far_ simpler to understand IMO. However, making it work really well would likely require that the compiler know about the grapheme type like it knows about dchar.

- Jonathan M Davis
January 16, 2011
On 2011-01-15 23:58:30 -0500, Jonathan M Davis <jmdavisProg@gmx.com> said:

> On Saturday 15 January 2011 20:45:53 Michel Fortin wrote:
>> On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg@gmx.com> said:
>>> On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
>>>> I have my idea.
>>>> 
>>>> I think it'd be a good idea is to improve upon Andrei's first idea --
>>>> which was to treat char[], wchar[], and dchar[] all as ranges of dchar
>>>> elements -- by changing the element type to be the same as the string.
>>>> For instance, iterating on a char[] would give you slices of char[],
>>>> each having one grapheme.
>>>> 
>>>> The second component would be to make the string equality operator (
>>> 
>>> =)
>>> 
>>>> for strings compare them in their normalized form, so that ("e" with
>>>> combining acute accent) == (pre-combined "é"). I think this woul
> d m
>>> 
>>> ake
>>> 
>>>> D support for Unicode much more intuitive.
>>>> 
>>>> This implies some semantic changes, mainly that everywhere you write a
>>>> "character" you must use double-quotes (string "a") instead of single
>>>> quote (code point 'a'), but from the user's point of view that's pretty
>>>> much all there is to change.
>>>> 
>>>> There'll still be plenty of room for specialized algorithms, but their
>>>> purpose would be limited to optimization. Correctness would be taken
>>>> care of by the basic range interface, and foreach should follow suit
>>>> and iterate by grapheme by default.
>>>> 
>>>> I wrote this example (or something similar) earlier in this thread:
>>>> 	foreach (grapheme; "exposé")
>>>> 	
>>>> 		if (grapheme == "é")
>>>> 		
>>>> 			break;
>>>> 
>>>> In this example, even if one of these two strings use the pre-combined
>>>> form of "é" and the other uses a combining acute accent, the equality
>>>> would still hold since foreach iterates on full graphemes and
>>>> compares using normalization.
>>> 
>>> I think that that would cause definite problems. Having the element
>>> type of the range be the same type as the range seems like it could
>>> cause a lot of problems in std.algorithm and the like, and it's
>>> _definitely_ going to confuse programmers. I'd expect it to be highly
>>> bug-prone. They _need_ to be separate types.
>> 
>> I remember that someone already complained about this issue because he
>> had a tree of ranges, and Andrei said he would take a look at this
>> problem eventually. Perhaps now would be a good time.
>> 
>>> Now, given that dchar can't actually work completely as an element
>>> type, you'd either need the string type to be a new type or the element
>>> type to be a new type. So, either the string type has char[], wchar[],
>>> or dchar[] for its element type, or char[], wchar[], and dchar[] have
>>> something like uchar as their element type, where uchar is a struct
>>> which contains a char[], wchar[], or dchar[]
>>> which holds a single grapheme.
>> 
>> Having a new type for grapheme would work too. My preference still goes
>> to reusing the string type because it makes the semantic simpler to
>> understand, especially when comparing graphemes with literals.
> 
> If a character literal actually became a grapheme instead of a dchar, then
> that would likely solve that issue. But I fear that the semantics of having a range
> be its own element type actually make understanding it _harder_, not simpler.
> Being forced to compare a string literals against what should be a character would definitely confuse programmers.

Character literals are treated as simple numbers by the language. By that I mean that you can write 'b' - 'a' == 1 and it'll be true. Arithmetic makes absolutely no sense for graphemes. If you want a special literal for graphemes, I'm afraid you'll have to invent something new. And at this point, why not use a string?


> Making a new character or grapheme type which represented a grapheme would be _far_ simpler to understand IMO. However, making it work really well would likely require that the compiler know about the grapheme type like it knows about dchar.

I'm looking for a simple solution. One that doesn't involve inventing a new grapheme literal syntax or adding new types the compiler most know about. I'm not really opposed to any of this, but the more complicated is the solution, the less likely it is to be adopted.

All I'm asking is that Unicode strings behave as Unicode strings should behave. Making iteration use graphemes by default and string comparison use the normalized form by default seems like a simple way to achieve that goal.

The most important is not the implementation, but that the default behaviour be the right behaviour.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 16, 2011
Michel Fortin Wrote:


> Character literals are treated as simple numbers by the language. By that I mean that you can write 'b' - 'a' == 1 and it'll be true. Arithmetic makes absolutely no sense for graphemes. If you want a special literal for graphemes, I'm afraid you'll have to invent something new. And at this point, why not use a string?
> 
> 
> > Making a new character or grapheme type which represented a grapheme would be _far_ simpler to understand IMO. However, making it work really well would likely require that the compiler know about the grapheme type like it knows about dchar.
> 
> I'm looking for a simple solution. One that doesn't involve inventing a new grapheme literal syntax or adding new types the compiler most know about. I'm not really opposed to any of this, but the more complicated is the solution, the less likely it is to be adopted.
> 
> All I'm asking is that Unicode strings behave as Unicode strings should behave. Making iteration use graphemes by default and string comparison use the normalized form by default seems like a simple way to achieve that goal.
> 
> The most important is not the implementation, but that the default behaviour be the right behaviour.
> 
> 
> -- 
> Michel Fortin
> michel.fortin@michelf.com
> http://michelf.com/
> 

I Understand your concern regarding a simpler implementation. You want to minimize the disruption caused by the proposed change.

I'd argue that creating a specialized string type as Steve suggests makes integration *easier*. Your suggestion requires that foreach will be changed to default to grapheme. I agree that this can be done because it will not break silently but with Steve's string type this is unnecessary since the type itself would provide a grapheme range interface and the compiler doesn't need to know about this type at all. string becomes a regular library type.

Of course, the type should support:
string foo = "bar";
by making an implicit conversion from current arrays (to minimize compiler changes)

The only disruption as far as I can tell would be using 'a' type literals instead of "a" but that will come up in compilation after string defaults to the new type. Also, all occurrences of:
string foo = ...;
foreach (c; foo) {...} // c is now a grapheme
will now do the correct thing by default.

January 16, 2011
On 2011-01-16 02:11:14 -0500, foobar <foo@bar.com> said:

> I Understand your concern regarding a simpler implementation. You want to minimize the disruption caused by the proposed change.
> 
> I'd argue that creating a specialized string type as Steve suggests makes integration *easier*. Your suggestion requires that foreach will be changed to default to grapheme. I agree that this can be done because it will not break silently but with Steve's string type this is unnecessary since the type itself would provide a grapheme range interface and the compiler doesn't need to know about this type at all. string becomes a regular library type.
> 
> Of course, the type should support:
> string foo = "bar";
> by making an implicit conversion from current arrays (to minimize compiler changes)

It should also work for:

	auto foo = "bar";


> The only disruption as far as I can tell would be using 'a' type literals instead of "a" but that will come up in compilation after string defaults to the new type.

You say "after string defaults to the new type", but I don't think this change to the language will pass. It'll break TDPL for one thing, so it's surely out for D2. And I somewhat doubt it's low-level enough for Walter's taste.

I don't care much if the default type is an array or not, I just want the default type to work properly as a Unicode string. The very small participation to this thread from the key decision makers (Andrei and Walter) worries me however. I'm not even sure we'll achieve that goal.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 16, 2011
Michel Fortin Wrote:

> On 2011-01-16 02:11:14 -0500, foobar <foo@bar.com> said:
> 
> > I Understand your concern regarding a simpler implementation. You want to minimize the disruption caused by the proposed change.
> > 
> > I'd argue that creating a specialized string type as Steve suggests makes integration *easier*. Your suggestion requires that foreach will be changed to default to grapheme. I agree that this can be done because it will not break silently but with Steve's string type this is unnecessary since the type itself would provide a grapheme range interface and the compiler doesn't need to know about this type at all. string becomes a regular library type.
> > 
> > Of course, the type should support:
> > string foo = "bar";
> > by making an implicit conversion from current arrays (to minimize
> > compiler changes)
> 
> It should also work for:
> 
> 	auto foo = "bar";

Right. This does require compiler changes.

> 
> 
> > The only disruption as far as I can tell would be using 'a' type literals instead of "a" but that will come up in compilation after string defaults to the new type.
> 
> You say "after string defaults to the new type", but I don't think this change to the language will pass. It'll break TDPL for one thing, so it's surely out for D2. And I somewhat doubt it's low-level enough for Walter's taste.
> 

string is an alias in phobos so it's more of a stdlib change but I see your point about TDPL. I did get the feeling that Andrei is willing to make a change if it proves worthwhile by preventing writing bad code (Which we both agree this change accomplishes).

> I don't care much if the default type is an array or not, I just want the default type to work properly as a Unicode string. The very small participation to this thread from the key decision makers (Andrei and Walter) worries me however. I'm not even sure we'll achieve that goal.
> 
> 

Anderi did take part and even asked for links that explain the subject. Perhaps the quite is due to the mastermind doing research on the topic rather than reluctance to do any changes. :)

> -- 
> Michel Fortin
> michel.fortin@michelf.com
> http://michelf.com/
>