View mode: basic / threaded / horizontal-split · Log in · Help
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer" 
<schveiguy@yahoo.com> said:

> On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  
> <lutger.blijdestijn@gmail.com> wrote:
> 
>> Steven Schveighoffer wrote:
>> 
>> ...
>>>> I think a good standard to evaluate our handling of Unicode is to see
>>>> how easy it is to do things the right way. In the above, foreach would
>>>> slice the string grapheme by grapheme, and the == operator would  perform
>>>> a normalized comparison. While it works correctly, it's probably not  the
>>>> most efficient way to do thing however.
>>> 
>>> I think this is a good alternative, but I'd rather not impose this on
>>> people like myself who deal mostly with English.  I think this should be
>>> possible to do with wrapper types or intermediate ranges which have
>>> graphemes as elements (per my suggestion above).
>>> 
>>> Does this sound reasonable?
>>> 
>>> -Steve
>> 
>> If its a matter of choosing which is the 'default' range, I'd think  proper
>> unicode handling is more reasonable than catering for english / ascii  only.
>> Especially since this is already the case in phobos string algorithms.
> 
> English and (if I understand correctly) most other languages.  Any  
> language which can be built from composable graphemes would work.  And 
> in  fact, ones that use some graphemes that cannot be composed will 
> also work  to some degree (for example, opEquals).
> 
> What I'm proposing (or think I'm proposing) is not exactly catering to  
> English and ASCII, what I'm proposing is simply not catering to more  
> complex languages such as Hebrew and Arabic.  What I'm trying to find 
> is a  middle ground where most languages work, and the code is simple 
> and  efficient, with possibilities to jump down to lower levels for 
> performance  (i.e. switch to char[] when you know ASCII is all you are 
> using) or jump  up to full unicode when necessary.

Why don't we build a compiler with an optimizer that generates correct 
code *almost* all of the time? If you are worried about it not 
producing correct code for a given function, you can just add 
"pragma(correct_code)" in front of that function to disable the risky 
optimizations. No harm done, right?

One thing I see very often, often on US web sites but also elsewhere, 
is that if you enter a name with an accented letter in a form (say 
Émilie), very often the accented letter gets changed to another 
semi-random character later in the process. Why? Because somewhere in 
the process lies an encoding mismatch that no one thought about and no 
one tested for. At the very least, the form should have rejected those 
unexpected characters and show an error when it could.

Now, with proper Unicode handling up to the code point level, this kind 
of problem probably won't happen as often because the whole stack works 
with UTF encodings. But are you going to validate all of your inputs to 
make sure they have no combining code point?

Don't assume that because you're in the United States no one will try 
to enter characters where you don't expect them. People love to play 
with Unicode symbols for fun, putting them in their name, signature, or 
even domain names (✪df.ws). Just wait until they discover they can 
combine them. ☺̰̎! There is also a variety of combining mathematical 
symbols with no pre-combined form, such as ≸. Writing in Arabic, 
Hebrew, Korean, or some other foreign language isn't a prerequisite to 
use combining characters.


> Essentially, we would have three levels of types:
> 
> char[], wchar[], dchar[] -- Considered to be arrays in every way.
> string_t!T (string, wstring, dstring) -- Specialized string types that 
> do  normalization to dchars, but do not handle perfectly all graphemes. 
>  Works  with any algorithm that deals with bidirectional ranges.  This 
> is the  default string type, and the type for string literals.  
> Represented  internally by a single char[], wchar[] or dchar[] array.
> * utfstring_t!T -- specialized string to deal with full unicode, which 
> may  perform worse than string_t, but supports everything unicode 
> supports.   May require a battery of specialized algorithms.
> 
> * - name up for discussion
> 
> Also note that phobos currently does *no* normalization as far as I can 
>  tell for things like opEquals.  Two char[]'s that represent equivalent 
>  strings, but not in the same way, will compare as !=.

Basically, you're suggesting that the default way should be to handle 
Unicode *almost* right. And then, if you want to handle thing *really* 
right you need to be explicit about it by using "utfstring_t"? I 
understand your motivation, but it sounds backward to me.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Sat, 15 Jan 2011 15:31:23 -0500, Michel Fortin  
<michel.fortin@michelf.com> wrote:

> On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer"  
> <schveiguy@yahoo.com> said:
>
>> On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn   
>> <lutger.blijdestijn@gmail.com> wrote:
>>
>>> Steven Schveighoffer wrote:
>>>  ...
>>>>> I think a good standard to evaluate our handling of Unicode is to see
>>>>> how easy it is to do things the right way. In the above, foreach  
>>>>> would
>>>>> slice the string grapheme by grapheme, and the == operator would   
>>>>> perform
>>>>> a normalized comparison. While it works correctly, it's probably  
>>>>> not  the
>>>>> most efficient way to do thing however.
>>>>  I think this is a good alternative, but I'd rather not impose this on
>>>> people like myself who deal mostly with English.  I think this should  
>>>> be
>>>> possible to do with wrapper types or intermediate ranges which have
>>>> graphemes as elements (per my suggestion above).
>>>>  Does this sound reasonable?
>>>>  -Steve
>>>  If its a matter of choosing which is the 'default' range, I'd think   
>>> proper
>>> unicode handling is more reasonable than catering for english / ascii   
>>> only.
>>> Especially since this is already the case in phobos string algorithms.
>>  English and (if I understand correctly) most other languages.  Any   
>> language which can be built from composable graphemes would work.  And  
>> in  fact, ones that use some graphemes that cannot be composed will  
>> also work  to some degree (for example, opEquals).
>>  What I'm proposing (or think I'm proposing) is not exactly catering  
>> to  English and ASCII, what I'm proposing is simply not catering to  
>> more  complex languages such as Hebrew and Arabic.  What I'm trying to  
>> find is a  middle ground where most languages work, and the code is  
>> simple and  efficient, with possibilities to jump down to lower levels  
>> for performance  (i.e. switch to char[] when you know ASCII is all you  
>> are using) or jump  up to full unicode when necessary.
>
> Why don't we build a compiler with an optimizer that generates correct  
> code *almost* all of the time? If you are worried about it not producing  
> correct code for a given function, you can just add  
> "pragma(correct_code)" in front of that function to disable the risky  
> optimizations. No harm done, right?
>
> One thing I see very often, often on US web sites but also elsewhere, is  
> that if you enter a name with an accented letter in a form (say Émilie),  
> very often the accented letter gets changed to another semi-random  
> character later in the process. Why? Because somewhere in the process  
> lies an encoding mismatch that no one thought about and no one tested  
> for. At the very least, the form should have rejected those unexpected  
> characters and show an error when it could.
>
> Now, with proper Unicode handling up to the code point level, this kind  
> of problem probably won't happen as often because the whole stack works  
> with UTF encodings. But are you going to validate all of your inputs to  
> make sure they have no combining code point?
>
> Don't assume that because you're in the United States no one will try to  
> enter characters where you don't expect them. People love to play with  
> Unicode symbols for fun, putting them in their name, signature, or even  
> domain names (✪df.ws). Just wait until they discover they can combine  
> them. ☺̰̎! There is also a variety of combining mathematical symbols  
> with no pre-combined form, such as ≸. Writing in Arabic, Hebrew,  
> Korean, or some other foreign language isn't a prerequisite to use  
> combining characters.
>
>
>> Essentially, we would have three levels of types:
>>  char[], wchar[], dchar[] -- Considered to be arrays in every way.
>> string_t!T (string, wstring, dstring) -- Specialized string types that  
>> do  normalization to dchars, but do not handle perfectly all graphemes.  
>>  Works  with any algorithm that deals with bidirectional ranges.  This  
>> is the  default string type, and the type for string literals.   
>> Represented  internally by a single char[], wchar[] or dchar[] array.
>> * utfstring_t!T -- specialized string to deal with full unicode, which  
>> may  perform worse than string_t, but supports everything unicode  
>> supports.   May require a battery of specialized algorithms.
>>  * - name up for discussion
>>  Also note that phobos currently does *no* normalization as far as I  
>> can  tell for things like opEquals.  Two char[]'s that represent  
>> equivalent  strings, but not in the same way, will compare as !=.
>
> Basically, you're suggesting that the default way should be to handle  
> Unicode *almost* right. And then, if you want to handle thing *really*  
> right you need to be explicit about it by using "utfstring_t"? I  
> understand your motivation, but it sounds backward to me.

You make very good points.  I concede that using dchar as the element  
point is not correct for unicode strings.

-Steve
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and
Steven Schveighoffer Wrote:

> On Sat, 15 Jan 2011 13:21:12 -0500, foobar <foo@bar.com> wrote:
> 
> > Steven Schveighoffer Wrote:
> 
> >>
> >> English and (if I understand correctly) most other languages.  Any
> >> language which can be built from composable graphemes would work.  And  
> >> in
> >> fact, ones that use some graphemes that cannot be composed will also  
> >> work
> >> to some degree (for example, opEquals).
> >>
> >> What I'm proposing (or think I'm proposing) is not exactly catering to
> >> English and ASCII, what I'm proposing is simply not catering to more
> >> complex languages such as Hebrew and Arabic.  What I'm trying to find  
> >> is a
> >> middle ground where most languages work, and the code is simple and
> >> efficient, with possibilities to jump down to lower levels for  
> >> performance
> >> (i.e. switch to char[] when you know ASCII is all you are using) or jump
> >> up to full unicode when necessary.
> >>
> >> Essentially, we would have three levels of types:
> >>
> >> char[], wchar[], dchar[] -- Considered to be arrays in every way.
> >> string_t!T (string, wstring, dstring) -- Specialized string types that  
> >> do
> >> normalization to dchars, but do not handle perfectly all graphemes.   
> >> Works
> >> with any algorithm that deals with bidirectional ranges.  This is the
> >> default string type, and the type for string literals.  Represented
> >> internally by a single char[], wchar[] or dchar[] array.
> >> * utfstring_t!T -- specialized string to deal with full unicode, which  
> >> may
> >> perform worse than string_t, but supports everything unicode supports.
> >> May require a battery of specialized algorithms.
> >>
> >> * - name up for discussion
> >>
> >> Also note that phobos currently does *no* normalization as far as I can
> >> tell for things like opEquals.  Two char[]'s that represent equivalent
> >> strings, but not in the same way, will compare as !=.
> >>
> >> -Steve
> >
> > The above compromise provides zero benefit. The proposed default type  
> > string_t is incorrect and will cause bugs. I prefer the standard lib to  
> > not provide normalization at all and force me to use a 3rd party lib  
> > rather than provide an incomplete implementation that will give me a  
> > false sense of correctness and cause very subtle and hard to find bugs.
> 
> I feel like you might be exaggerating, but maybe I'm completely wrong on  
> this, I'm not well-versed in unicode, or even languages that require  
> unicode.  The clear benefit I see is that with a string type which  
> normalizes to canonical code points, you can use this in any algorithm  
> without having it be unicode-aware for *most languages*.  At least, that  
> is how I see it.  I'm looking at it as a code-reuse proposition.
> 
> It's like calendars.  There are quite a few different calendars in  
> different cultures.  But most people use a Gregorian calendar.  So we have  
> three options:
> 
> a) Use a Gregorian calendar, and leave the other calendars to a 3rd party  
> library
> b) Use a complicated calendar system where Gregorian calendars are treated  
> with equal respect to all other calendars, none are the default.
> c) Use a Gregorian calendar by default, but include the other calendars as  
> a separate module for those who wish to use them.
> 
> I'm looking at my proposal as more of a c) solution.
> 

The calendar example is a very good one. What you're saying equivalent to saying is that most people use Gregorian but for efficiency and other reasons you want to not implement feb 29th. 

> Can you show how normalization causes subtle bugs?
> 

That was already shown by Michel and Spir where the equality operator is incorrect due to diacritics (the example with exposé). Your solution makes this far worse since it will reduce the bug to far less cases making the problem far less obvious. 
One would test with exposé which will work and another test (let's say in Hebrew) and that will *not* work and unless the programmer is a Unicode expert (Which is very unlikely) the programmer is left scratching his head.

> > More over, Even if you ignore Hebrew as a tiny insignificant minority  
> > you cannot do the same for Arabic which has over one *billion* people  
> > that use that language.
> 
> I hope that the medium type works 'good enough' for those languages, with  
> the high level type needed for advanced usages.  At a minimum, comparison  
> and substring should work for all languages.
> 

As I explained above, 'good enough' in this case is far worse because it masks the problem.  Also, If you want comparison to work in all languages including Hebrew/Arabic than it simply isn't good enough.

> > I firmly believe that in accordance with D's principle that the default  
> > behavior should be the correct & safe option, D should have the full  
> > unicode type (utfstring_t above) as the default.
> >
> > You need only a subset of the functionality because you only use  
> > English? For the same reason, you don't want the Unicode overhead? Use  
> > an ASCII type instead. In the same vain, a geneticist should use a DNA  
> > sequence type and not Unicode text.
> 
> Or French, or Spanish, or German, etc...
> 
> Look, even the lowest level is valid unicode, but if you want to start  
> extracting individual graphemes, you need more machinery.  In 99% of  
> cases, I'd think you want to use strings as strings, not as sequences of  
> graphemes, or code-units.
> 
> -Steve

I'd like to have full Unicode support. I think it is a good thing for D to have in order to expand in the world. As an alternative, I'd settle for loud errors that make absolutely clear to the non-Unicode expert programmer that D simply does NOT support e.g. Normalization. 

As Spir already said, Unicode is something few understand and even it's own official docs do not explain such issues properly. We should not confuse users even further with incomplete support.
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and
On Sat, 15 Jan 2011 15:46:11 -0500, foobar <foo@bar.com> wrote:


> I'd like to have full Unicode support. I think it is a good thing for D  
> to have in order to expand in the world. As an alternative, I'd settle  
> for loud errors that make absolutely clear to the non-Unicode expert  
> programmer that D simply does NOT support e.g. Normalization.
>
> As Spir already said, Unicode is something few understand and even it's  
> own official docs do not explain such issues properly. We should not  
> confuse users even further with incomplete support.

Well said, I've changed my mind.  Thanks for explaining.

-Steve
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" 
<schveiguy@yahoo.com> said:

>> I'm not suggesting we impose it, just that we make it the default. If  
>> you want to iterate by dchar, wchar, or char, just write:
>> 
>> 	foreach (dchar c; "exposé") {}
>> 	foreach (wchar c; "exposé") {}
>> 	foreach (char c; "exposé") {}
>> 	// or
>> 	foreach (dchar c; "exposé".by!dchar()) {}
>> 	foreach (wchar c; "exposé".by!wchar()) {}
>> 	foreach (char c; "exposé".by!char()) {}
>> 
>> and it'll work. But the default would be a slice containing the  
>> grapheme, because this is the right way to represent a Unicode 
>> character.
> 
> I think this is a good idea.  I previously was nervous about it, but 
> I'm  not sure it makes a huge difference.  Returning a char[] is 
> certainly less  work than normalizing a grapheme into one or more code 
> points, and then  returning them.  All that it takes is to detect all 
> the code points within  the grapheme.  Normalization can be done if 
> needed, but would probably  have to output another char[], since a 
> normalized grapheme can occupy more  than one dchar.

I'm glad we agree on that now.


> What if I modified my proposed string_t type to return T[] as its 
> element  type, as you say, and string literals are typed as 
> string_t!(whatever)?   In addition, the restrictions I imposed on 
> slicing a code point actually  get imposed on slicing a grapheme.  That 
> is, it is illegal to substring a  string_t in a way that slices through 
> a grapheme (and by deduction, a code  point)?

I'm not opposed to that on principle. I'm a little uneasy about having 
so many types representing a string however. Some other raw comments:

I agree that things would be more coherent if char[], wchar[], and 
dchar[] behaved like other arrays, but I can't really see a 
justification for those types to be in the language if there's nothing 
special about them (why not a library type?). If strings and arrays of 
code units are distinct, slicing in the middle of a grapheme or in the 
middle of a code point could throw an error, but for performance 
reasons it should probably check for that only when array bounds 
checking is turned on (that would require compiler support however).


> Actually, we would need a grapheme to be its own type, because 
> comparing  two char[]'s that don't contain equivalent bits and having 
> them be equal,  violates the expectation that char[] is an array.
> 
> So the string_t!char would return a grapheme_t!char (names to be  
> discussed) as its element type.

Or you could make a grapheme a string_t. ;-)


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  
<michel.fortin@michelf.com> wrote:

> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  
> <schveiguy@yahoo.com> said:
>
>>> I'm not suggesting we impose it, just that we make it the default. If   
>>> you want to iterate by dchar, wchar, or char, just write:
>>>  	foreach (dchar c; "exposé") {}
>>> 	foreach (wchar c; "exposé") {}
>>> 	foreach (char c; "exposé") {}
>>> 	// or
>>> 	foreach (dchar c; "exposé".by!dchar()) {}
>>> 	foreach (wchar c; "exposé".by!wchar()) {}
>>> 	foreach (char c; "exposé".by!char()) {}
>>>  and it'll work. But the default would be a slice containing the   
>>> grapheme, because this is the right way to represent a Unicode  
>>> character.
>>  I think this is a good idea.  I previously was nervous about it, but  
>> I'm  not sure it makes a huge difference.  Returning a char[] is  
>> certainly less  work than normalizing a grapheme into one or more code  
>> points, and then  returning them.  All that it takes is to detect all  
>> the code points within  the grapheme.  Normalization can be done if  
>> needed, but would probably  have to output another char[], since a  
>> normalized grapheme can occupy more  than one dchar.
>
> I'm glad we agree on that now.

It's a matter of me slowly wrapping my brain around unicode and how it's  
used.  It seems like it's a typical committee defined standard where there  
are 10 ways to do everything, I was trying to weed out the lesser used (or  
so I perceived) pieces to allow a more implementable library.  It's doubly  
hard for me since I have limited experience with other languages, and I've  
never tried to write them with a computer (my language classes in high  
school were back in the days of actually writing stuff down on paper).

I once told a colleague who was on a standards committee that their  
proposed KLV standard (key length value) was ridiculous.  The wise  
committee had decided that in order to avoid future issues, the length  
would be encoded as a single byte if < 128, or 128 + length of the length  
field for anything higher.  This means you could potentially have to parse  
and process a 127-byte integer!

>
>
>> What if I modified my proposed string_t type to return T[] as its  
>> element  type, as you say, and string literals are typed as  
>> string_t!(whatever)?   In addition, the restrictions I imposed on  
>> slicing a code point actually  get imposed on slicing a grapheme.  That  
>> is, it is illegal to substring a  string_t in a way that slices through  
>> a grapheme (and by deduction, a code  point)?
>
> I'm not opposed to that on principle. I'm a little uneasy about having  
> so many types representing a string however. Some other raw comments:
>
> I agree that things would be more coherent if char[], wchar[], and  
> dchar[] behaved like other arrays, but I can't really see a  
> justification for those types to be in the language if there's nothing  
> special about them (why not a library type?).

I would not be opposed to getting rid of those types.  But I am very  
opposed to char[] not being an array.  If you want a string to be  
something other than an array, make it have a different syntax.  We also  
have to consider C compatibility.

However, we are in radical-change mode then, and this is probably pushed  
to D3 ;)  If we can find some way to fix the situation without  
invalidating TDPL, we should strive for that first IMO.

> If strings and arrays of code units are distinct, slicing in the middle  
> of a grapheme or in the middle of a code point could throw an error, but  
> for performance reasons it should probably check for that only when  
> array bounds checking is turned on (that would require compiler support  
> however).

Not really, it could use assert, but that throws an assert error instead  
of a RangeError.  Of course, both are errors and will abort the program.   
I do wish there was a version(noboundscheck) to do this kind of stuff  
with...

>> Actually, we would need a grapheme to be its own type, because  
>> comparing  two char[]'s that don't contain equivalent bits and having  
>> them be equal,  violates the expectation that char[] is an array.
>>  So the string_t!char would return a grapheme_t!char (names to be   
>> discussed) as its element type.
>
> Or you could make a grapheme a string_t. ;-)

I'm a little uneasy having a range return itself as its element type.  For  
all intents and purposes, a grapheme is a string of one 'element', so it  
could potentially be a string_t.

It does seem daunting to have so many types, but at the same time, types  
convey relationships at compile time that can make coding impossible to  
get wrong, or make things actually possible when having a single type  
doesn't.

I'll give you an example from a previous life:

Tango had a type called DateTime.  This type represented *either* a point  
in time, or a span of time (depending on how you used it).  But I proposed  
we switch to two distinct types, one for a point in time, one for a span  
of time.  It was argued that both were so similar, why couldn't we just  
keep one type?  The answer is simple -- having them be separate types  
allows me to express relationships that the compiler enforces.  For  
example, you can add two time spans together, but you can't add two points  
in time together.  Or maybe you want a function to accept a time span  
(like a sleep operation).  If there was only one type, then  
sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;)

I feel that making extra types when the relationship between them is  
important is worth the possible repetition of functionality.  Catching  
bugs during compilation is soooo much better than experiencing them during  
runtime.

-Steve
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and
Steven Schveighoffer Wrote:

> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  
> <michel.fortin@michelf.com> wrote:
> 
> > On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  
> > <schveiguy@yahoo.com> said:
> >
> >>> I'm not suggesting we impose it, just that we make it the default. If   
> >>> you want to iterate by dchar, wchar, or char, just write:
> >>>  	foreach (dchar c; "exposé") {}
> >>> 	foreach (wchar c; "exposé") {}
> >>> 	foreach (char c; "exposé") {}
> >>> 	// or
> >>> 	foreach (dchar c; "exposé".by!dchar()) {}
> >>> 	foreach (wchar c; "exposé".by!wchar()) {}
> >>> 	foreach (char c; "exposé".by!char()) {}
> >>>  and it'll work. But the default would be a slice containing the   
> >>> grapheme, because this is the right way to represent a Unicode  
> >>> character.
> >>  I think this is a good idea.  I previously was nervous about it, but  
> >> I'm  not sure it makes a huge difference.  Returning a char[] is  
> >> certainly less  work than normalizing a grapheme into one or more code  
> >> points, and then  returning them.  All that it takes is to detect all  
> >> the code points within  the grapheme.  Normalization can be done if  
> >> needed, but would probably  have to output another char[], since a  
> >> normalized grapheme can occupy more  than one dchar.
> >
> > I'm glad we agree on that now.
> 
> It's a matter of me slowly wrapping my brain around unicode and how it's  
> used.  It seems like it's a typical committee defined standard where there  
> are 10 ways to do everything, I was trying to weed out the lesser used (or  
> so I perceived) pieces to allow a more implementable library.  It's doubly  
> hard for me since I have limited experience with other languages, and I've  
> never tried to write them with a computer (my language classes in high  
> school were back in the days of actually writing stuff down on paper).
> 
> I once told a colleague who was on a standards committee that their  
> proposed KLV standard (key length value) was ridiculous.  The wise  
> committee had decided that in order to avoid future issues, the length  
> would be encoded as a single byte if < 128, or 128 + length of the length  
> field for anything higher.  This means you could potentially have to parse  
> and process a 127-byte integer!
> 
> >
> >
> >> What if I modified my proposed string_t type to return T[] as its  
> >> element  type, as you say, and string literals are typed as  
> >> string_t!(whatever)?   In addition, the restrictions I imposed on  
> >> slicing a code point actually  get imposed on slicing a grapheme.  That  
> >> is, it is illegal to substring a  string_t in a way that slices through  
> >> a grapheme (and by deduction, a code  point)?
> >
> > I'm not opposed to that on principle. I'm a little uneasy about having  
> > so many types representing a string however. Some other raw comments:
> >
> > I agree that things would be more coherent if char[], wchar[], and  
> > dchar[] behaved like other arrays, but I can't really see a  
> > justification for those types to be in the language if there's nothing  
> > special about them (why not a library type?).
> 
> I would not be opposed to getting rid of those types.  But I am very  
> opposed to char[] not being an array.  If you want a string to be  
> something other than an array, make it have a different syntax.  We also  
> have to consider C compatibility.
> 
> However, we are in radical-change mode then, and this is probably pushed  
> to D3 ;)  If we can find some way to fix the situation without  
> invalidating TDPL, we should strive for that first IMO.
> 
> > If strings and arrays of code units are distinct, slicing in the middle  
> > of a grapheme or in the middle of a code point could throw an error, but  
> > for performance reasons it should probably check for that only when  
> > array bounds checking is turned on (that would require compiler support  
> > however).
> 
> Not really, it could use assert, but that throws an assert error instead  
> of a RangeError.  Of course, both are errors and will abort the program.   
> I do wish there was a version(noboundscheck) to do this kind of stuff  
> with...
> 
> >> Actually, we would need a grapheme to be its own type, because  
> >> comparing  two char[]'s that don't contain equivalent bits and having  
> >> them be equal,  violates the expectation that char[] is an array.
> >>  So the string_t!char would return a grapheme_t!char (names to be   
> >> discussed) as its element type.
> >
> > Or you could make a grapheme a string_t. ;-)
> 
> I'm a little uneasy having a range return itself as its element type.  For  
> all intents and purposes, a grapheme is a string of one 'element', so it  
> could potentially be a string_t.
> 
> It does seem daunting to have so many types, but at the same time, types  
> convey relationships at compile time that can make coding impossible to  
> get wrong, or make things actually possible when having a single type  
> doesn't.
> 
> I'll give you an example from a previous life:
> 
> Tango had a type called DateTime.  This type represented *either* a point  
> in time, or a span of time (depending on how you used it).  But I proposed  
> we switch to two distinct types, one for a point in time, one for a span  
> of time.  It was argued that both were so similar, why couldn't we just  
> keep one type?  The answer is simple -- having them be separate types  
> allows me to express relationships that the compiler enforces.  For  
> example, you can add two time spans together, but you can't add two points  
> in time together.  Or maybe you want a function to accept a time span  
> (like a sleep operation).  If there was only one type, then  
> sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;)
> 
> I feel that making extra types when the relationship between them is  
> important is worth the possible repetition of functionality.  Catching  
> bugs during compilation is soooo much better than experiencing them during  
> runtime.
> 
> -Steve

I like Michel's proposed semantics and I also agree with you that it should be a distinct string type and not break consistency of regular arrays. 

Regarding your last point: Do you mean that a grapheme would be a sub-type of string? (a specialization where the string represents a single element)? If so, than it sounds good to me.
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer" 
<schveiguy@yahoo.com> said:

> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  
> <michel.fortin@michelf.com> wrote:
> 
>> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  
>> <schveiguy@yahoo.com> said:
>> 
>>>> I'm not suggesting we impose it, just that we make it the default. If   
>>>> you want to iterate by dchar, wchar, or char, just write:
>>>>  	foreach (dchar c; "exposé") {}
>>>> 	foreach (wchar c; "exposé") {}
>>>> 	foreach (char c; "exposé") {}
>>>> 	// or
>>>> 	foreach (dchar c; "exposé".by!dchar()) {}
>>>> 	foreach (wchar c; "exposé".by!wchar()) {}
>>>> 	foreach (char c; "exposé".by!char()) {}
>>>>  and it'll work. But the default would be a slice containing the   
>>>> grapheme, because this is the right way to represent a Unicode  
>>>> character.
>>>  I think this is a good idea.  I previously was nervous about it, but  
>>> I'm  not sure it makes a huge difference.  Returning a char[] is  
>>> certainly less  work than normalizing a grapheme into one or more code  
>>> points, and then  returning them.  All that it takes is to detect all  
>>> the code points within  the grapheme.  Normalization can be done if  
>>> needed, but would probably  have to output another char[], since a  
>>> normalized grapheme can occupy more  than one dchar.
>> 
>> I'm glad we agree on that now.
> 
> It's a matter of me slowly wrapping my brain around unicode and how 
> it's  used.  It seems like it's a typical committee defined standard 
> where there  are 10 ways to do everything, I was trying to weed out the 
> lesser used (or  so I perceived) pieces to allow a more implementable 
> library.  It's doubly  hard for me since I have limited experience with 
> other languages, and I've  never tried to write them with a computer 
> (my language classes in high  school were back in the days of actually 
> writing stuff down on paper).

Actually, I don't think Unicode was so badly designed. It's just that 
nobody hat an idea of the real scope of the problem they had in hand at 
first, and so they had to add a lot of things but wanted to keep things 
backward-compatible. We're at Unicode 6.0 now, can you name one other 
standard that evolved enough to get 6 major versions? I'm surprised 
it's not worse given all that it must support.

That said, I'm sure if someone could redesign Unicode by breaking 
backward-compatibility we'd have something simpler. You could probably 
get rid of pre-combined characters and reduce the number of 
normalization forms. But would you be able to get rid of normalization 
entirely? I don't think so. Reinventing Unicode is probably not worth 
it.


>> I'm not opposed to that on principle. I'm a little uneasy about having  
>> so many types representing a string however. Some other raw comments:
>> 
>> I agree that things would be more coherent if char[], wchar[], and  
>> dchar[] behaved like other arrays, but I can't really see a  
>> justification for those types to be in the language if there's nothing  
>> special about them (why not a library type?).
> 
> I would not be opposed to getting rid of those types.  But I am very  
> opposed to char[] not being an array.  If you want a string to be  
> something other than an array, make it have a different syntax.  We 
> also  have to consider C compatibility.
> 
> However, we are in radical-change mode then, and this is probably 
> pushed  to D3 ;)  If we can find some way to fix the situation without  
> invalidating TDPL, we should strive for that first IMO.

Indeed, the change would probably be too radical for D2.

I think we agree that the default type should behave as a Unicode 
string, not an array of characters. I understand your opposition to 
conflating arrays of char with strings, and I agree with you to a 
certain extent that it could have been done better. But we can't really 
change the type of string literals, can we. The only thing we can 
change (I hope) at this point is how iterating on strings work.

Walter said earlier that he oppose changing foreach's default element 
type to dchar for char[] and wchar[] (as Andrei did for ranges) on the 
ground that it would silently break D1 compatibility. This is a valid 
point in my opinion.

I think you're right when you say that not treating char[] as an array 
of character breaks, to a certain extent, C compatibility. Another 
valid point.

That said, I want to emphasize that iterating by grapheme, contrary to 
iterating by dchar, does not break any code *silently*. The compiler 
will complain loudly that you're comparing a string to a char, so 
you'll have to change your code somewhere if you want things to 
compile. You'll have to look at the code and decide what to do.

One more thing:

NSString in Cocoa is in essence the same thing as I'm proposing here: 
as array of UTF-16 code units, but with string behaviour. It supports 
by-code-unit indexing, but appending, comparing, searching for 
substrings, etc. all behave correctly as a Unicode string. Again, I 
agree that it's probably not the best design, but I can tell you it 
works well in practice. In fact, NSString doesn't even expose the 
concept of grapheme, it just uses them internally, and you're pretty 
much limited to the built-in operation. I think what we have here in 
concept is much better... even if it somewhat conflates code-unit 
arrays and strings.


>> Or you could make a grapheme a string_t. ;-)
> 
> I'm a little uneasy having a range return itself as its element type.  
> For  all intents and purposes, a grapheme is a string of one 'element', 
> so it  could potentially be a string_t.
> 
> It does seem daunting to have so many types, but at the same time, 
> types  convey relationships at compile time that can make coding 
> impossible to  get wrong, or make things actually possible when having 
> a single type  doesn't.
> 
> I'll give you an example from a previous life:
> 
> [...]
> I feel that making extra types when the relationship between them is  
> important is worth the possible repetition of functionality.  Catching  
> bugs during compilation is soooo much better than experiencing them 
> during  runtime.

I can understand the utility of a separate type in your DateTime 
example, but in this case I fail to see any advantage.

I mean, a grapheme is a slice of a string, can have multiple code 
points (like a string), can be appended the same way as a string, can 
be composed or decomposed using canonical normalization or 
compatibility normalization (like a string), and should be sorted, 
uppercased, and lowercased according to Unicode rules (like a string). 
Basically, a grapheme is just a string that happens to contain only one 
grapheme. What would a custom type do differently than a string?

Also, grapheme == "a" is easy to understand because both are strings. 
But if a grapheme is a separate type, what would a grapheme literal 
look like?

So in the end I don't think a grapheme needs a specific type, at least 
not for general purpose text processing. If I split a string on 
whitespace, do I get a range where elements are of type "word"? No, 
just sliced strings.

That said, I'm much less concerned by the type used to represent a 
grapheme than by the Unicode correctness. I'm not opposed to a separate 
type, I just don't really see the point.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 16, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/15/11 4:45 PM, Michel Fortin wrote:
> On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"
> <schveiguy@yahoo.com> said:
>
>> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin
>> <michel.fortin@michelf.com> wrote:
>>
>>> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"
>>> <schveiguy@yahoo.com> said:
>>>
>>>>> I'm not suggesting we impose it, just that we make it the default.
>>>>> If you want to iterate by dchar, wchar, or char, just write:
>>>>> foreach (dchar c; "exposé") {}
>>>>> foreach (wchar c; "exposé") {}
>>>>> foreach (char c; "exposé") {}
>>>>> // or
>>>>> foreach (dchar c; "exposé".by!dchar()) {}
>>>>> foreach (wchar c; "exposé".by!wchar()) {}
>>>>> foreach (char c; "exposé".by!char()) {}
>>>>> and it'll work. But the default would be a slice containing the
>>>>> grapheme, because this is the right way to represent a Unicode
>>>>> character.
>>>> I think this is a good idea. I previously was nervous about it, but
>>>> I'm not sure it makes a huge difference. Returning a char[] is
>>>> certainly less work than normalizing a grapheme into one or more
>>>> code points, and then returning them. All that it takes is to detect
>>>> all the code points within the grapheme. Normalization can be done
>>>> if needed, but would probably have to output another char[], since a
>>>> normalized grapheme can occupy more than one dchar.
>>>
>>> I'm glad we agree on that now.
>>
>> It's a matter of me slowly wrapping my brain around unicode and how
>> it's used. It seems like it's a typical committee defined standard
>> where there are 10 ways to do everything, I was trying to weed out the
>> lesser used (or so I perceived) pieces to allow a more implementable
>> library. It's doubly hard for me since I have limited experience with
>> other languages, and I've never tried to write them with a computer
>> (my language classes in high school were back in the days of actually
>> writing stuff down on paper).
>
> Actually, I don't think Unicode was so badly designed. It's just that
> nobody hat an idea of the real scope of the problem they had in hand at
> first, and so they had to add a lot of things but wanted to keep things
> backward-compatible. We're at Unicode 6.0 now, can you name one other
> standard that evolved enough to get 6 major versions? I'm surprised it's
> not worse given all that it must support.
>
> That said, I'm sure if someone could redesign Unicode by breaking
> backward-compatibility we'd have something simpler. You could probably
> get rid of pre-combined characters and reduce the number of
> normalization forms. But would you be able to get rid of normalization
> entirely? I don't think so. Reinventing Unicode is probably not worth it.
>
>
>>> I'm not opposed to that on principle. I'm a little uneasy about
>>> having so many types representing a string however. Some other raw
>>> comments:
>>>
>>> I agree that things would be more coherent if char[], wchar[], and
>>> dchar[] behaved like other arrays, but I can't really see a
>>> justification for those types to be in the language if there's
>>> nothing special about them (why not a library type?).
>>
>> I would not be opposed to getting rid of those types. But I am very
>> opposed to char[] not being an array. If you want a string to be
>> something other than an array, make it have a different syntax. We
>> also have to consider C compatibility.
>>
>> However, we are in radical-change mode then, and this is probably
>> pushed to D3 ;) If we can find some way to fix the situation without
>> invalidating TDPL, we should strive for that first IMO.
>
> Indeed, the change would probably be too radical for D2.
>
> I think we agree that the default type should behave as a Unicode
> string, not an array of characters. I understand your opposition to
> conflating arrays of char with strings, and I agree with you to a
> certain extent that it could have been done better. But we can't really
> change the type of string literals, can we. The only thing we can change
> (I hope) at this point is how iterating on strings work.
>
> Walter said earlier that he oppose changing foreach's default element
> type to dchar for char[] and wchar[] (as Andrei did for ranges) on the
> ground that it would silently break D1 compatibility. This is a valid
> point in my opinion.
>
> I think you're right when you say that not treating char[] as an array
> of character breaks, to a certain extent, C compatibility. Another valid
> point.
>
> That said, I want to emphasize that iterating by grapheme, contrary to
> iterating by dchar, does not break any code *silently*. The compiler
> will complain loudly that you're comparing a string to a char, so you'll
> have to change your code somewhere if you want things to compile. You'll
> have to look at the code and decide what to do.
>
> One more thing:
>
> NSString in Cocoa is in essence the same thing as I'm proposing here: as
> array of UTF-16 code units, but with string behaviour. It supports
> by-code-unit indexing, but appending, comparing, searching for
> substrings, etc. all behave correctly as a Unicode string. Again, I
> agree that it's probably not the best design, but I can tell you it
> works well in practice. In fact, NSString doesn't even expose the
> concept of grapheme, it just uses them internally, and you're pretty
> much limited to the built-in operation. I think what we have here in
> concept is much better... even if it somewhat conflates code-unit arrays
> and strings.

I'm unclear on where this is converging to. At this point the commitment 
of the language and its standard library to (a) UTF aray representation 
and (b) code points conceptualization is quite strong. Changing that 
would be quite difficult and disruptive, and the benefits are virtually 
nonexistent for most of D's user base.

It may be more realistic to consider using what we have as back-end for 
grapheme-oriented processing. For example:

struct Grapheme(Char) if (isSomeChar!Char)
{
    private const Char[] rep;
    ...
}

auto byGrapheme(S)(S s) if (isSomeString!S)
{
   ...
}

string s = "Hello";
foreach (g; byGrapheme(s)
{
    ...
}

Andrei
January 16, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
> On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
> 
> <lutger.blijdestijn@gmail.com> said:
> > Nick Sabalausky wrote:
> >> "Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message
> >> news:ignon1$2p4k$1@digitalmars.com...
> >> 
> >>> This may sometimes not be what the user expected; most of the time
> >>> they'd care about the code points.
> >> 
> >> I dunno, spir has succesfuly convinced me that most of the time it's
> >> graphemes the user cares about, not code points. Using code points is
> >> just as misleading as using UTF-16 code units.
> > 
> > I agree. This is a very informative thread, thanks spir and everybody
> > else.
> > 
> > Going back to the topic, it seems to me that a unicode string is a
> > surprisingly complicated data structure that can be viewed from multiple
> > types of ranges. In the light of this thread, a dchar doesn't seem like
> > such a useful type anymore, it is still a low level abstraction for the
> > purpose of correctly dealing with text. Perhaps even less useful, since
> > it gives the illusion of correctness for those who are not in the know.
> > 
> > The algorithms in std.string can be upgraded to work correctly with all
> > the issues mentioned, but the generic ones in std.algorithm will just
> > subtly do the wrong thing when presented with dchar ranges. And, as I
> > understood it, the purpose of a VleRange was exactly to make generic
> > algorithms just work (tm) for strings.
> > 
> > Is it still possible to solve this problem or are we stuck with
> > specialized string algorithms? Would it work if VleRange of string was a
> > bidirectional range with string slices of graphemes as the ElementType
> > and indexing with code units? Often used string algorithms could be
> > specialized for performance, but if not, generic algorithms would still
> > work.
> 
> I have my idea.
> 
> I think it'd be a good idea is to improve upon Andrei's first idea --
> which was to treat char[], wchar[], and dchar[] all as ranges of dchar
> elements -- by changing the element type to be the same as the string.
> For instance, iterating on a char[] would give you slices of char[],
> each having one grapheme.
> 
> The second component would be to make the string equality operator (==)
> for strings compare them in their normalized form, so that ("e" with
> combining acute accent) == (pre-combined "é"). I think this would make
> D support for Unicode much more intuitive.
> 
> This implies some semantic changes, mainly that everywhere you write a
> "character" you must use double-quotes (string "a") instead of single
> quote (code point 'a'), but from the user's point of view that's pretty
> much all there is to change.
> 
> There'll still be plenty of room for specialized algorithms, but their
> purpose would be limited to optimization. Correctness would be taken
> care of by the basic range interface, and foreach should follow suit
> and iterate by grapheme by default.
> 
> I wrote this example (or something similar) earlier in this thread:
> 
> 	foreach (grapheme; "exposé")
> 		if (grapheme == "é")
> 			break;
> 
> In this example, even if one of these two strings use the pre-combined
> form of "é" and the other uses a combining acute accent, the equality
> would still hold since foreach iterates on full graphemes and ==
> compares using normalization.
> 
> The important thing to keep in mind here is that the grapheme-splitting
> algorithm should be optimized for the case where there is no combining
> character and the compare algorithm for the case where the string is
> already normalized, since most strings will exhibit these
> characteristics.
> 
> As for ASCII, we could make it easier to use ubyte[] for it by making
> string literals implicitly convert to ubyte[] if all their characters
> are in ASCII range.

I think that that would cause definite problems. Having the element type of the 
range be the same type as the range seems like it could cause a lot of problems 
in std.algorithm and the like, and it's _definitely_ going to confuse 
programmers. I'd expect it to be highly bug-prone. They _need_ to be separate 
types.

Now, given that dchar can't actually work completely as an element type, you'd 
either need the string type to be a new type or the element type to be a new 
type. So, either the string type has char[], wchar[], or dchar[] for its element 
type, or char[], wchar[], and dchar[] have something like uchar as their element 
type, where uchar is a struct which contains a char[], wchar[], or dchar[] which 
holds a single grapheme.

I think that it's a great idea that programmers try to use substrings and slices 
rather than dchar, but making the element type a slice the original type sounds 
like it's really asking for trouble.

- Jonathan M Davis
7 8 9 10 11 12 13 14 15
Top | Discussion index | About this forum | D home