View mode: basic / threaded / horizontal-split · Log in · Help
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin  
<michel.fortin@michelf.com> wrote:

> On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer"  
> <schveiguy@yahoo.com> said:
>
>> On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir@gmail.com> wrote:
>>
>>> The point is not playing like that with Unicode flexibility. Rather  
>>> that  composite characters are just normal thingies in most languages  
>>> of the  world. Actually, on this point, english is a rare exception  
>>> (discarding  letters imported from foreign languages like french 'à');  
>>> to the point  of beeing, I guess, the only western language without  
>>> any diacritic.
>>  Is it common to have multiple modifiers on a single character?
>
> Not in my knowledge. But I rarely deal with non-latin texts, there's  
> probably some scripts out there that takes advantage of this.
>
>
>> The  problem I see with using decomposed canonical form for strings is  
>> that we  would have to return a dchar[] for each 'element', which  
>> severely  complicates code that, for instance, only expects to handle  
>> English.
>
> Actually, returning a sliced char[] or wchar[] could also be valid.  
> User-perceived characters are basically a substring of one or more code  
> points. I'm not sure it complicates that much the semantics of the  
> language -- what's complicated about writing str.front == "a" instead of  
> str.front == 'a'? -- although it probably would complicate the generated  
> code and make it a little slower.

Hm... this pushes the normalization outside the type, and into the  
algorithms (such as find).  I was hoping to avoid that.  I think I can  
come up with an algorithm that normalizes into canonical form as it  
iterates.  It just might return part of a grapheme if the grapheme cannot  
be composed.

I do think that we could make a byGrapheme member to aid in this:

foreach(grapheme; s.byGrapheme) // grapheme is a substring that contains  
one composed grapheme.

>
> In the case of NSString in Cocoa, you can only access the 'characters'  
> in their UTF-16 form. But everything from comparison to search for  
> substring is done using graphemes. It's like they implemented  
> specialized Unicode-aware algorithms for these functions. There's no  
> genericness about how it handles graphemes.
>
> I'm not sure yet about what would be the right approach for D.

I hope we can use generic versions, so the type itself handles the  
conversions.  That makes any algorithm using the string range correct.

>> I was hoping to lazily transform a string into its composed canonical   
>> form, allowing the (hopefully rare) exception when a composed character  
>>  does not exist.  My thinking was that this at least gives a useful  
>> string  representation for 90% of usages, leaving the remaining 10% of  
>> usages to  find a more complex representation (like your Text type).   
>> If we only get  like 20% or 30% there by making dchar the element type,  
>> then we haven't  made it useful enough.
>>  Either way, we need a string type that can be compared canonically  
>> for  things like searches or opEquals.
>
> I wonder if normalized string comparison shouldn't be built directly in  
> the char[] wchar[] and dchar[] types instead.

No, in my vision of how strings should be typed, char[] is an array, not a  
string.  It should be treated like an array of code-units, where two forms  
that create the same grapheme are considered different.

> Also bring the idea above that iterating on a string would yield  
> graphemes as char[] and this code would work perfectly irrespective of  
> whether you used combining characters:
>
> 	foreach (grapheme; "exposé") {
> 		if (grapheme == "é")
> 			break;
> 	}
>
> I think a good standard to evaluate our handling of Unicode is to see  
> how easy it is to do things the right way. In the above, foreach would  
> slice the string grapheme by grapheme, and the == operator would perform  
> a normalized comparison. While it works correctly, it's probably not the  
> most efficient way to do thing however.

I think this is a good alternative, but I'd rather not impose this on  
people like myself who deal mostly with English.  I think this should be  
possible to do with wrapper types or intermediate ranges which have  
graphemes as elements (per my suggestion above).

Does this sound reasonable?

-Steve
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
Steven Schveighoffer wrote:

...
>> I think a good standard to evaluate our handling of Unicode is to see
>> how easy it is to do things the right way. In the above, foreach would
>> slice the string grapheme by grapheme, and the == operator would perform
>> a normalized comparison. While it works correctly, it's probably not the
>> most efficient way to do thing however.
> 
> I think this is a good alternative, but I'd rather not impose this on
> people like myself who deal mostly with English.  I think this should be
> possible to do with wrapper types or intermediate ranges which have
> graphemes as elements (per my suggestion above).
> 
> Does this sound reasonable?
> 
> -Steve

If its a matter of choosing which is the 'default' range, I'd think proper 
unicode handling is more reasonable than catering for english / ascii only. 
Especially since this is already the case in phobos string algorithms.
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  
<lutger.blijdestijn@gmail.com> wrote:

> Steven Schveighoffer wrote:
>
> ...
>>> I think a good standard to evaluate our handling of Unicode is to see
>>> how easy it is to do things the right way. In the above, foreach would
>>> slice the string grapheme by grapheme, and the == operator would  
>>> perform
>>> a normalized comparison. While it works correctly, it's probably not  
>>> the
>>> most efficient way to do thing however.
>>
>> I think this is a good alternative, but I'd rather not impose this on
>> people like myself who deal mostly with English.  I think this should be
>> possible to do with wrapper types or intermediate ranges which have
>> graphemes as elements (per my suggestion above).
>>
>> Does this sound reasonable?
>>
>> -Steve
>
> If its a matter of choosing which is the 'default' range, I'd think  
> proper
> unicode handling is more reasonable than catering for english / ascii  
> only.
> Especially since this is already the case in phobos string algorithms.

English and (if I understand correctly) most other languages.  Any  
language which can be built from composable graphemes would work.  And in  
fact, ones that use some graphemes that cannot be composed will also work  
to some degree (for example, opEquals).

What I'm proposing (or think I'm proposing) is not exactly catering to  
English and ASCII, what I'm proposing is simply not catering to more  
complex languages such as Hebrew and Arabic.  What I'm trying to find is a  
middle ground where most languages work, and the code is simple and  
efficient, with possibilities to jump down to lower levels for performance  
(i.e. switch to char[] when you know ASCII is all you are using) or jump  
up to full unicode when necessary.

Essentially, we would have three levels of types:

char[], wchar[], dchar[] -- Considered to be arrays in every way.
string_t!T (string, wstring, dstring) -- Specialized string types that do  
normalization to dchars, but do not handle perfectly all graphemes.  Works  
with any algorithm that deals with bidirectional ranges.  This is the  
default string type, and the type for string literals.  Represented  
internally by a single char[], wchar[] or dchar[] array.
* utfstring_t!T -- specialized string to deal with full unicode, which may  
perform worse than string_t, but supports everything unicode supports.   
May require a battery of specialized algorithms.

* - name up for discussion

Also note that phobos currently does *no* normalization as far as I can  
tell for things like opEquals.  Two char[]'s that represent equivalent  
strings, but not in the same way, will compare as !=.

-Steve
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
Michel Fortin Wrote:

> What I don't understand is in what way using a string type would make 
> the API less complex and use less templates?
> 
> More generally, in what way would your string type behave differently 
> than char[], wchar[], and dchar[]? I think we need to clarify what how 
> you expect your string type to behave before I can answer anything. I 
> mean, beside cosmetic changes such as having a codePoint property 
> instead of by!dchar or byDchar, what is your string type doing 
> differently?
> 
> The above algorithm is already possible with strings as they are, 
> provided you implement the 'decompose' and the 'compose' function 
> returning a range. In fact, you only changed two things in it: by!dchar 
> became codePoints, and array() became string(). Surely you're expecting 
> more benefits than that.
> 
> -- 
> Michel Fortin
> michel.fortin@michelf.com
> http://michelf.com/
> 

First thing, the question of possibility is irrelevant since I could also write the same algorithm in brainfuck or assembly (with a lot more code). It's never a question of possibility but rather a question of ease of use for the user. 

What I want is to encapsulate all the low-level implementation details in one place so that the as a user I will not need to deal with this everywhere. one such detail is the encoding. 

auto text = w"whatever"; // should be equivalent to:
auto text = new Text("whatever", Encoding.UTF16);

now I want to write my own string function:

void func(Text a); // instead of current:
void func(T)(T a) if isTextType(T); // why the USER needs all this?

Of course, the Text type would do the correct think by default which we both agree should be graphemes. Only if I need something advanced like in the previous algorithm than I explicitly need to specify that I work on code points or code units. 

In a sentence: "Make the common case trivial and the complex case possible". The common case is what we Humans think of as characters (graphemes) and the complex case is the encoding level.
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and
Steven Schveighoffer Wrote:

> On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  
> <lutger.blijdestijn@gmail.com> wrote:
> 
> > Steven Schveighoffer wrote:
> >
> > ...
> >>> I think a good standard to evaluate our handling of Unicode is to see
> >>> how easy it is to do things the right way. In the above, foreach would
> >>> slice the string grapheme by grapheme, and the == operator would  
> >>> perform
> >>> a normalized comparison. While it works correctly, it's probably not  
> >>> the
> >>> most efficient way to do thing however.
> >>
> >> I think this is a good alternative, but I'd rather not impose this on
> >> people like myself who deal mostly with English.  I think this should be
> >> possible to do with wrapper types or intermediate ranges which have
> >> graphemes as elements (per my suggestion above).
> >>
> >> Does this sound reasonable?
> >>
> >> -Steve
> >
> > If its a matter of choosing which is the 'default' range, I'd think  
> > proper
> > unicode handling is more reasonable than catering for english / ascii  
> > only.
> > Especially since this is already the case in phobos string algorithms.
> 
> English and (if I understand correctly) most other languages.  Any  
> language which can be built from composable graphemes would work.  And in  
> fact, ones that use some graphemes that cannot be composed will also work  
> to some degree (for example, opEquals).
> 
> What I'm proposing (or think I'm proposing) is not exactly catering to  
> English and ASCII, what I'm proposing is simply not catering to more  
> complex languages such as Hebrew and Arabic.  What I'm trying to find is a  
> middle ground where most languages work, and the code is simple and  
> efficient, with possibilities to jump down to lower levels for performance  
> (i.e. switch to char[] when you know ASCII is all you are using) or jump  
> up to full unicode when necessary.
> 
> Essentially, we would have three levels of types:
> 
> char[], wchar[], dchar[] -- Considered to be arrays in every way.
> string_t!T (string, wstring, dstring) -- Specialized string types that do  
> normalization to dchars, but do not handle perfectly all graphemes.  Works  
> with any algorithm that deals with bidirectional ranges.  This is the  
> default string type, and the type for string literals.  Represented  
> internally by a single char[], wchar[] or dchar[] array.
> * utfstring_t!T -- specialized string to deal with full unicode, which may  
> perform worse than string_t, but supports everything unicode supports.   
> May require a battery of specialized algorithms.
> 
> * - name up for discussion
> 
> Also note that phobos currently does *no* normalization as far as I can  
> tell for things like opEquals.  Two char[]'s that represent equivalent  
> strings, but not in the same way, will compare as !=.
> 
> -Steve

The above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs.

More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language. 

I firmly believe that in accordance with D's principle that the default behavior should be the correct & safe option, D should have the full unicode type (utfstring_t above) as the default. 

You need only a subset of the functionality because you only use English? For the same reason, you don't want the Unicode overhead? Use an ASCII type instead. In the same vain, a geneticist should use a DNA sequence type and not Unicode text.
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
Nick Sabalausky wrote:
> Those *both* get rendered exactly the same, and both represent the same 
> four-letter sequence. In the second example, the 'u' and the {umlaut 
> combining character} combine to form one grapheme. The f's and n's just 
> happen to be single-code-point graphemes.

I know some German, and to the best of my knowledge there are zero combining 
characters for it. The umlauts and the B both have their own code points.

> legend has it there are others than can only be 
> represented using a combining character.

??? I've never seen or heard of any. Not even in the old script that was in 
common use in Germany until after WW2.
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer" 
<schveiguy@yahoo.com> said:

> On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin  
> <michel.fortin@michelf.com> wrote:
> 
>> Actually, returning a sliced char[] or wchar[] could also be valid.  
>> User-perceived characters are basically a substring of one or more code 
>>  points. I'm not sure it complicates that much the semantics of the  
>> language -- what's complicated about writing str.front == "a" instead 
>> of  str.front == 'a'? -- although it probably would complicate the 
>> generated  code and make it a little slower.
> 
> Hm... this pushes the normalization outside the type, and into the  
> algorithms (such as find).
> 
> I was hoping to avoid that.

Not really. It pushes the normalization to the string comparison 
operator, as explained later.


> I think I can  come up with an algorithm that normalizes into canonical 
> form as it  iterates.  It just might return part of a grapheme if the 
> grapheme cannot  be composed.

The problem with normalization while iterating is that you lose 
information about what the actual code points part of the grapheme. If 
you wanted to count the number of grapheme with a particular code point 
you're lost that information.

Moreover, if all you want is to count the number of grapheme, 
normalizing the character is a waste of time.

I suggested in another post that we implement ranges for decomposing 
and recomposing on-the-fly a string in its normalized form. That's 
basically the same thing as you suggest, but it'd have to be explicit 
to avoid the problem above.


>> I wonder if normalized string comparison shouldn't be built directly in 
>>  the char[] wchar[] and dchar[] types instead.
> 
> No, in my vision of how strings should be typed, char[] is an array, 
> not a  string.  It should be treated like an array of code-units, where 
> two forms  that create the same grapheme are considered different.

Well, I agree there's a need for that sometime. But if what you want is 
just a dumb array of code units, why not use ubyte[], ushort[] and 
uint[] instead?

It seems to me that the whole point of having a different type for 
char[], wchar[], and dchar[] is that you know they are Unicode strings 
and can treat them as such. And if you treat them as Unicode strings, 
then perhaps the runtime and the compiler should too, for consistency's 
sake.


>> Also bring the idea above that iterating on a string would yield  
>> graphemes as char[] and this code would work perfectly irrespective of  
>> whether you used combining characters:
>> 
>> 	foreach (grapheme; "exposé") {
>> 		if (grapheme == "é")
>> 			break;
>> 	}
>> 
>> I think a good standard to evaluate our handling of Unicode is to see  
>> how easy it is to do things the right way. In the above, foreach would  
>> slice the string grapheme by grapheme, and the == operator would 
>> perform  a normalized comparison. While it works correctly, it's 
>> probably not the  most efficient way to do thing however.
> 
> I think this is a good alternative, but I'd rather not impose this on  
> people like myself who deal mostly with English.

I'm not suggesting we impose it, just that we make it the default. If 
you want to iterate by dchar, wchar, or char, just write:

	foreach (dchar c; "exposé") {}
	foreach (wchar c; "exposé") {}
	foreach (char c; "exposé") {}
	// or
	foreach (dchar c; "exposé".by!dchar()) {}
	foreach (wchar c; "exposé".by!wchar()) {}
	foreach (char c; "exposé".by!char()) {}

and it'll work. But the default would be a slice containing the 
grapheme, because this is the right way to represent a Unicode 
character.


> I think this should be  possible to do with wrapper types or 
> intermediate ranges which have  graphemes as elements (per my 
> suggestion above).

I think it should be the reverse. If you want your code to break when 
it encounters multi-code-point graphemes then it's your choice, but you 
should have to make your choice explicit. The default should be to 
handle strings correctly.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and
On Sat, 15 Jan 2011 13:21:12 -0500, foobar <foo@bar.com> wrote:

> Steven Schveighoffer Wrote:

>>
>> English and (if I understand correctly) most other languages.  Any
>> language which can be built from composable graphemes would work.  And  
>> in
>> fact, ones that use some graphemes that cannot be composed will also  
>> work
>> to some degree (for example, opEquals).
>>
>> What I'm proposing (or think I'm proposing) is not exactly catering to
>> English and ASCII, what I'm proposing is simply not catering to more
>> complex languages such as Hebrew and Arabic.  What I'm trying to find  
>> is a
>> middle ground where most languages work, and the code is simple and
>> efficient, with possibilities to jump down to lower levels for  
>> performance
>> (i.e. switch to char[] when you know ASCII is all you are using) or jump
>> up to full unicode when necessary.
>>
>> Essentially, we would have three levels of types:
>>
>> char[], wchar[], dchar[] -- Considered to be arrays in every way.
>> string_t!T (string, wstring, dstring) -- Specialized string types that  
>> do
>> normalization to dchars, but do not handle perfectly all graphemes.   
>> Works
>> with any algorithm that deals with bidirectional ranges.  This is the
>> default string type, and the type for string literals.  Represented
>> internally by a single char[], wchar[] or dchar[] array.
>> * utfstring_t!T -- specialized string to deal with full unicode, which  
>> may
>> perform worse than string_t, but supports everything unicode supports.
>> May require a battery of specialized algorithms.
>>
>> * - name up for discussion
>>
>> Also note that phobos currently does *no* normalization as far as I can
>> tell for things like opEquals.  Two char[]'s that represent equivalent
>> strings, but not in the same way, will compare as !=.
>>
>> -Steve
>
> The above compromise provides zero benefit. The proposed default type  
> string_t is incorrect and will cause bugs. I prefer the standard lib to  
> not provide normalization at all and force me to use a 3rd party lib  
> rather than provide an incomplete implementation that will give me a  
> false sense of correctness and cause very subtle and hard to find bugs.

I feel like you might be exaggerating, but maybe I'm completely wrong on  
this, I'm not well-versed in unicode, or even languages that require  
unicode.  The clear benefit I see is that with a string type which  
normalizes to canonical code points, you can use this in any algorithm  
without having it be unicode-aware for *most languages*.  At least, that  
is how I see it.  I'm looking at it as a code-reuse proposition.

It's like calendars.  There are quite a few different calendars in  
different cultures.  But most people use a Gregorian calendar.  So we have  
three options:

a) Use a Gregorian calendar, and leave the other calendars to a 3rd party  
library
b) Use a complicated calendar system where Gregorian calendars are treated  
with equal respect to all other calendars, none are the default.
c) Use a Gregorian calendar by default, but include the other calendars as  
a separate module for those who wish to use them.

I'm looking at my proposal as more of a c) solution.

Can you show how normalization causes subtle bugs?

> More over, Even if you ignore Hebrew as a tiny insignificant minority  
> you cannot do the same for Arabic which has over one *billion* people  
> that use that language.

I hope that the medium type works 'good enough' for those languages, with  
the high level type needed for advanced usages.  At a minimum, comparison  
and substring should work for all languages.

> I firmly believe that in accordance with D's principle that the default  
> behavior should be the correct & safe option, D should have the full  
> unicode type (utfstring_t above) as the default.
>
> You need only a subset of the functionality because you only use  
> English? For the same reason, you don't want the Unicode overhead? Use  
> an ASCII type instead. In the same vain, a geneticist should use a DNA  
> sequence type and not Unicode text.

Or French, or Spanish, or German, etc...

Look, even the lowest level is valid unicode, but if you want to start  
extracting individual graphemes, you need more machinery.  In 99% of  
cases, I'd think you want to use strings as strings, not as sequences of  
graphemes, or code-units.

-Steve
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Sat, 15 Jan 2011 13:32:10 -0500, Michel Fortin  
<michel.fortin@michelf.com> wrote:

> On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer"  
> <schveiguy@yahoo.com> said:
>
>> On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin   
>> <michel.fortin@michelf.com> wrote:
>>
>>> Actually, returning a sliced char[] or wchar[] could also be valid.   
>>> User-perceived characters are basically a substring of one or more  
>>> code  points. I'm not sure it complicates that much the semantics of  
>>> the  language -- what's complicated about writing str.front == "a"  
>>> instead of  str.front == 'a'? -- although it probably would complicate  
>>> the generated  code and make it a little slower.
>>  Hm... this pushes the normalization outside the type, and into the   
>> algorithms (such as find).
>>  I was hoping to avoid that.
>
> Not really. It pushes the normalization to the string comparison  
> operator, as explained later.
>
>
>> I think I can  come up with an algorithm that normalizes into canonical  
>> form as it  iterates.  It just might return part of a grapheme if the  
>> grapheme cannot  be composed.
>
> The problem with normalization while iterating is that you lose  
> information about what the actual code points part of the grapheme. If  
> you wanted to count the number of grapheme with a particular code point  
> you're lost that information.

Are these common requirements?  I thought users mostly care about  
graphemes, not code points.  Asking in the dark here, since I have next to  
zero experience with unicode strings.

>
> Moreover, if all you want is to count the number of grapheme,  
> normalizing the character is a waste of time.

This is true.  I can see this being a common need.

>
> I suggested in another post that we implement ranges for decomposing and  
> recomposing on-the-fly a string in its normalized form. That's basically  
> the same thing as you suggest, but it'd have to be explicit to avoid the  
> problem above.

OK, I see your point.

>
>
>>> I wonder if normalized string comparison shouldn't be built directly  
>>> in  the char[] wchar[] and dchar[] types instead.
>>  No, in my vision of how strings should be typed, char[] is an array,  
>> not a  string.  It should be treated like an array of code-units, where  
>> two forms  that create the same grapheme are considered different.
>
> Well, I agree there's a need for that sometime. But if what you want is  
> just a dumb array of code units, why not use ubyte[], ushort[] and  
> uint[] instead?

Because ubyte[] ushort[] and uint[] do not say that their data is unicode  
text.  The point is, I want to write a function that takes utf-8, ubyte[]  
opens it up to any data, not just UTF-8 data.  But if we have a method of  
iterating code-units as you specify below, then I think we are OK.

> It seems to me that the whole point of having a different type for  
> char[], wchar[], and dchar[] is that you know they are Unicode strings  
> and can treat them as such. And if you treat them as Unicode strings,  
> then perhaps the runtime and the compiler should too, for consistency's  
> sake.

I'd agree with you, but then there's that pesky [] after it indicating  
it's an array.  For consistency's sake, I'd say the compiler should treat  
T[] as an array of T's.

>>> Also bring the idea above that iterating on a string would yield   
>>> graphemes as char[] and this code would work perfectly irrespective  
>>> of  whether you used combining characters:
>>>  	foreach (grapheme; "exposé") {
>>> 		if (grapheme == "é")
>>> 			break;
>>> 	}
>>>  I think a good standard to evaluate our handling of Unicode is to  
>>> see  how easy it is to do things the right way. In the above, foreach  
>>> would  slice the string grapheme by grapheme, and the == operator  
>>> would perform  a normalized comparison. While it works correctly, it's  
>>> probably not the  most efficient way to do thing however.
>>  I think this is a good alternative, but I'd rather not impose this on   
>> people like myself who deal mostly with English.
>
> I'm not suggesting we impose it, just that we make it the default. If  
> you want to iterate by dchar, wchar, or char, just write:
>
> 	foreach (dchar c; "exposé") {}
> 	foreach (wchar c; "exposé") {}
> 	foreach (char c; "exposé") {}
> 	// or
> 	foreach (dchar c; "exposé".by!dchar()) {}
> 	foreach (wchar c; "exposé".by!wchar()) {}
> 	foreach (char c; "exposé".by!char()) {}
>
> and it'll work. But the default would be a slice containing the  
> grapheme, because this is the right way to represent a Unicode character.

I think this is a good idea.  I previously was nervous about it, but I'm  
not sure it makes a huge difference.  Returning a char[] is certainly less  
work than normalizing a grapheme into one or more code points, and then  
returning them.  All that it takes is to detect all the code points within  
the grapheme.  Normalization can be done if needed, but would probably  
have to output another char[], since a normalized grapheme can occupy more  
than one dchar.

What if I modified my proposed string_t type to return T[] as its element  
type, as you say, and string literals are typed as string_t!(whatever)?   
In addition, the restrictions I imposed on slicing a code point actually  
get imposed on slicing a grapheme.  That is, it is illegal to substring a  
string_t in a way that slices through a grapheme (and by deduction, a code  
point)?

Actually, we would need a grapheme to be its own type, because comparing  
two char[]'s that don't contain equivalent bits and having them be equal,  
violates the expectation that char[] is an array.

So the string_t!char would return a grapheme_t!char (names to be  
discussed) as its element type.

>
>
>> I think this should be  possible to do with wrapper types or  
>> intermediate ranges which have  graphemes as elements (per my  
>> suggestion above).
>
> I think it should be the reverse. If you want your code to break when it  
> encounters multi-code-point graphemes then it's your choice, but you  
> should have to make your choice explicit. The default should be to  
> handle strings correctly.

You are probably right.

-Steve
January 15, 2011
Re: VLERange: a range in between BidirectionalRange and
On Sat, 15 Jan 2011 14:51:47 -0500, Steven Schveighoffer  
<schveiguy@yahoo.com> wrote:

> I feel like you might be exaggerating, but maybe I'm completely wrong on  
> this, I'm not well-versed in unicode, or even languages that require  
> unicode.  The clear benefit I see is that with a string type which  
> normalizes to canonical code points, you can use this in any algorithm  
> without having it be unicode-aware for *most languages*.  At least, that  
> is how I see it.  I'm looking at it as a code-reuse proposition.
>
> It's like calendars.  There are quite a few different calendars in  
> different cultures.  But most people use a Gregorian calendar.  So we  
> have three options:
>
> a) Use a Gregorian calendar, and leave the other calendars to a 3rd  
> party library
> b) Use a complicated calendar system where Gregorian calendars are  
> treated with equal respect to all other calendars, none are the default.
> c) Use a Gregorian calendar by default, but include the other calendars  
> as a separate module for those who wish to use them.
>
> I'm looking at my proposal as more of a c) solution.
>
> Can you show how normalization causes subtle bugs?

I see from Michel's post how normalization automatically can be bad.  I  
also see that it can be wasteful.  So I've shifted my position.

Now I agree that we need a full unicode-compliant string type as the  
default.  See my reply to Michel for more info on my revised proposal.

-Steve
6 7 8 9 10 11 12 13 14
Top | Discussion index | About this forum | D home