VLERange: a range in between BidirectionalRange and RandomAccessRange (page 14) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » VLERange: a range in between BidirectionalRange and RandomAccessRange (page 14)

January 17, 2011

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Posted by Daniel Gibson
in reply to Andrei Alexandrescu

Daniel Gibson

Posted in reply to Andrei Alexandrescu

Am 17.01.2011 03:45, schrieb Andrei Alexandrescu:
> On 1/16/11 6:42 PM, Daniel Gibson wrote:
>> Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
>>> On 1/16/11 3:20 PM, Michel Fortin wrote:
>>>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>>>> <SeeWebsiteForEmail@erdani.org> said:
>>>>> But most strings don't contain combining characters or unnormalized
>>>>> strings.
>>>>
>>>> I think we should expect combining marks to be used more and more as our
>>>> OS text system and fonts start supporting them better. Them being rare
>>>> might be true today, but what do you know about tomorrow?
>>>
>>> I don't think languages will acquire more diacritics soon. I do hope, of
>>> course, that D applications gain more usage in the Arabic, Hebrew etc.
>>> world.
>>>
>>
>> So why does D use unicode anyway?
>> If you don't care about not-often used languages anyway, you could have
>> used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
>> which encoding he wants/needs).
>>
>> You could as well say "we don't need to use dchar to represent a proper
>> code point, wchar is enough for most use cases and has fewer overhead
>> anyway".
>
> I consider UTF8 superior to all of the above.
>

Really? UTF32 - maybe. But IMHO even when not considering graphemes and such UTF8 sucks hard in comparison to those because one code point consists of 1-4 code units (even in German 1-2 code units).

>>>>> I think it's reasonable to understand why I'm happy with the current
>>>>> state of affairs. It is better than anything we've had before and
>>>>> better than everything else I've tried.
>>>>
>>>> It is indeed easy to understand why you're happy with the current state
>>>> of affairs: you never had to deal with multi-code-point character and
>>>> can't imagine yourself having to deal with them on a semi-frequent
>>>> basis.
>>>
>>> Do you, and can you?
>>>
>>>> Other people won't be so happy with this state of affairs, but
>>>> they'll probably notice only after most of their code has been written
>>>> unaware of the problem.
>>>
>>> They can't be unaware and write said code.
>>>
>>
>> Fun fact: Germany recently introduced a new ID card and some of the
>> software that was developed for this and is used in some record sections
>> fucks up when a name contains diacritics.
>>
>> I think especially when you're handling names (and much software does, I
>> think) it's crucial to have proper support for all kinds of chars.
>> Of course many programmers are not aware that, if Umlaute and ß works it
>> doesn't mean that all other kinds of strange characters work as well.
>>
>>
>> Cheers,
>> - Daniel
>
> I think German text works well with dchar.
>

Yes, but even in Germany there are people whose names contain "strange" characters ;)
Is it common to have programs that deal with text in a specific language but not with names?

I do understand your resistance to support Unicode properly - it's a lot of trouble and makes things inefficient (more inefficient than UTF8/16 already are because of that code point != code unit thing).
Another thing is that due to bad support from fonts or console/GUI technology it may happen (quite often) that one grapheme is *not* displayed as a single character, thus messing up formatting anyway (Still you probably should cut a string within a grapheme).

So here's what I think can be done (and, at least the first two points, especially the first, should be done):

1. Mention the Grapheme and Digraph situation in string related documentation (std.string and maybe string-related stuff in std.algorithm like Splitter) to make sure people who use Phobos are aware of the problem. Then at least they can't say that nobody told them when their Objective-C using colleagues are laughing at their broken unicode-support ;)

2. Maybe add some functions that *do* deal with this.
Like "bool isPartOfGrapheme(dchar c)" or "bool isDigraph(dchar c)" so people can check themselves, if they just split their string within a grapheme or something.

3. Include a proper Unicode-string type/module, if somebody has the time and knowledge to develop one. spir already started something like that AFAIK and Steven Schveighoffer also is even working on a complete string type - maybe these efforts could be combined?
I guess default strings will stay mostly the way they are (but please add an ASCII type or allow ubyte[] asciiStr = "asdf";).
Having an additional type in Phobos that works correctly in all cases (e.g. Arabic, Hebrew, Japanese, ..) would be really great, though.

  UniString uStr = new UniString("sdfüñẫ");
  UniString uStr2 = uStr[3..$]; // "üñẫ"
  UniGraph ug = uStr[5]; // 'ẫ'
  size_t i = uStr2.length; // 3
something like that maybe (of course plus a lot of other stuff like proper comparison for different encodings of the same char like a modified icmp() discussed before).
But something like
  size_t len = uniLen("sdfüñẫ"); // 6
  string s = uniSlice(str, 3, str.length); // == str.uniSlice(3, str.length);
etc may be just as good.

(I hope this all made sense)

>
> Andrei

Cheers,
- Daniel

January 17, 2011

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Posted by Daniel Gibson
in reply to Daniel Gibson

Daniel Gibson

Posted in reply to Daniel Gibson

Am 17.01.2011 04:38, schrieb Daniel Gibson:
> Am 17.01.2011 03:45, schrieb Andrei Alexandrescu:
>> On 1/16/11 6:42 PM, Daniel Gibson wrote:
>>> Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
>>>> On 1/16/11 3:20 PM, Michel Fortin wrote:
>>>>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>>>>> <SeeWebsiteForEmail@erdani.org> said:
>>>>>> But most strings don't contain combining characters or unnormalized
>>>>>> strings.
>>>>>
>>>>> I think we should expect combining marks to be used more and more as our
>>>>> OS text system and fonts start supporting them better. Them being rare
>>>>> might be true today, but what do you know about tomorrow?
>>>>
>>>> I don't think languages will acquire more diacritics soon. I do hope, of
>>>> course, that D applications gain more usage in the Arabic, Hebrew etc.
>>>> world.
>>>>
>>>
>>> So why does D use unicode anyway?
>>> If you don't care about not-often used languages anyway, you could have
>>> used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
>>> which encoding he wants/needs).
>>>
>>> You could as well say "we don't need to use dchar to represent a proper
>>> code point, wchar is enough for most use cases and has fewer overhead
>>> anyway".
>>
>> I consider UTF8 superior to all of the above.
>>
>
> Really? UTF32 - maybe. But IMHO even when not considering graphemes and such
> UTF8 sucks hard in comparison to those because one code point consists of 1-4
> code units (even in German 1-2 code units).
>
>>>>>> I think it's reasonable to understand why I'm happy with the current
>>>>>> state of affairs. It is better than anything we've had before and
>>>>>> better than everything else I've tried.
>>>>>
>>>>> It is indeed easy to understand why you're happy with the current state
>>>>> of affairs: you never had to deal with multi-code-point character and
>>>>> can't imagine yourself having to deal with them on a semi-frequent
>>>>> basis.
>>>>
>>>> Do you, and can you?
>>>>
>>>>> Other people won't be so happy with this state of affairs, but
>>>>> they'll probably notice only after most of their code has been written
>>>>> unaware of the problem.
>>>>
>>>> They can't be unaware and write said code.
>>>>
>>>
>>> Fun fact: Germany recently introduced a new ID card and some of the
>>> software that was developed for this and is used in some record sections
>>> fucks up when a name contains diacritics.
>>>
>>> I think especially when you're handling names (and much software does, I
>>> think) it's crucial to have proper support for all kinds of chars.
>>> Of course many programmers are not aware that, if Umlaute and ß works it
>>> doesn't mean that all other kinds of strange characters work as well.
>>>
>>>
>>> Cheers,
>>> - Daniel
>>
>> I think German text works well with dchar.
>>
>
> Yes, but even in Germany there are people whose names contain "strange"
> characters ;)
> Is it common to have programs that deal with text in a specific language but not
> with names?
>
>
> I do understand your resistance to support Unicode properly - it's a lot of
> trouble and makes things inefficient (more inefficient than UTF8/16 already are
> because of that code point != code unit thing).
> Another thing is that due to bad support from fonts or console/GUI technology it
> may happen (quite often) that one grapheme is *not* displayed as a single
> character, thus messing up formatting anyway (Still you probably should cut a
> string within a grapheme).

I meant you should *not* cut a string within a grapheme.

>
> So here's what I think can be done (and, at least the first two points,
> especially the first, should be done):
>
> 1. Mention the Grapheme and Digraph situation in string related documentation
> (std.string and maybe string-related stuff in std.algorithm like Splitter) to
> make sure people who use Phobos are aware of the problem. Then at least they
> can't say that nobody told them when their Objective-C using colleagues are
> laughing at their broken unicode-support ;)
>
> 2. Maybe add some functions that *do* deal with this.
> Like "bool isPartOfGrapheme(dchar c)" or "bool isDigraph(dchar c)" so people can
> check themselves, if they just split their string within a grapheme or something.
>
> 3. Include a proper Unicode-string type/module, if somebody has the time and
> knowledge to develop one. spir already started something like that AFAIK and
> Steven Schveighoffer also is even working on a complete string type - maybe
> these efforts could be combined?
> I guess default strings will stay mostly the way they are (but please add an
> ASCII type or allow ubyte[] asciiStr = "asdf";).
> Having an additional type in Phobos that works correctly in all cases (e.g.
> Arabic, Hebrew, Japanese, ..) would be really great, though.
>
> UniString uStr = new UniString("sdfüñẫ");
> UniString uStr2 = uStr[3..$]; // "üñẫ"
> UniGraph ug = uStr[5]; // 'ẫ'
> size_t i = uStr2.length; // 3

of course I forgot:
  string s = uStr2.toString();
  dstring s2 = uStr2.toDString();
to convert it back to a "normal" string

> something like that maybe (of course plus a lot of other stuff like proper
> comparison for different encodings of the same char like a modified icmp()
> discussed before).
> But something like
> size_t len = uniLen("sdfüñẫ"); // 6
> string s = uniSlice(str, 3, str.length); // == str.uniSlice(3, str.length);
> etc may be just as good.
>
> (I hope this all made sense)
>
>>
>> Andrei
>
> Cheers,
> - Daniel

January 17, 2011

Re: VLERange: a range in between BidirectionalRange and

Posted by Steven Schveighoffer
in reply to foobar

Steven Schveighoffer

Posted in reply to foobar

On Sat, 15 Jan 2011 17:19:48 -0500, foobar <foo@bar.com> wrote:

> I like Michel's proposed semantics and I also agree with you that it should be a distinct string type and not break consistency of regular arrays.
>
> Regarding your last point: Do you mean that a grapheme would be a sub-type of string? (a specialization where the string represents a single element)? If so, than it sounds good to me.

A grapheme would be its own specialized type.  I'd probably remove the range primitives to really differentiate it.  Unfortunately, due to the inability to statically check this, the invariant would have to be a runtime check.  Most likely this check would be disabled in release mode.

This can cause problems, and I can see why it is attractive to use strings to implement graphemes, but that also has its problems.  With grapheme being its own type, we are providing a way to optimize functions, and allow further restrictions on function parameters.

At the end of the day, perhaps grapheme *should* just be a string.  We'll have to see how this breaks in practice, either way.

-Steve

January 17, 2011

Re: VLERange: a range in between BidirectionalRange and

Posted by Jonathan M Davis
in reply to Steven Schveighoffer

Jonathan M Davis

Posted in reply to Steven Schveighoffer

On Monday 17 January 2011 04:08:08 Steven Schveighoffer wrote:
> On Sat, 15 Jan 2011 17:19:48 -0500, foobar <foo@bar.com> wrote:
> > I like Michel's proposed semantics and I also agree with you that it should be a distinct string type and not break consistency of regular arrays.
> > 
> > Regarding your last point: Do you mean that a grapheme would be a sub-type of string? (a specialization where the string represents a single element)? If so, than it sounds good to me.
> 
> A grapheme would be its own specialized type.  I'd probably remove the range primitives to really differentiate it.  Unfortunately, due to the inability to statically check this, the invariant would have to be a runtime check.  Most likely this check would be disabled in release mode.
> 
> This can cause problems, and I can see why it is attractive to use strings to implement graphemes, but that also has its problems.  With grapheme being its own type, we are providing a way to optimize functions, and allow further restrictions on function parameters.
> 
> At the end of the day, perhaps grapheme *should* just be a string.  We'll have to see how this breaks in practice, either way.

I think that it would make good sense for a grapheme to be struct which holds a string as Andrei suggested:

struct Grapheme(Char) if (isSomeChar!Char)
{
     private const Char[] rep;
     ...
}

I really think that trying to use strings to represent graphemes is asking for it. The element of a range should be a different type than the that of the range itself.

- Jonathan M Davis

January 17, 2011

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Posted by Steven Schveighoffer
in reply to Michel Fortin

Steven Schveighoffer

Posted in reply to Michel Fortin

On Sat, 15 Jan 2011 17:45:37 -0500, Michel Fortin <michel.fortin@michelf.com> wrote:

> On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer" <schveiguy@yahoo.com> said:
>
>> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  <michel.fortin@michelf.com> wrote:
>>
>>> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  <schveiguy@yahoo.com> said:
>>>
>>>>> I'm not suggesting we impose it, just that we make it the default. If   you want to iterate by dchar, wchar, or char, just write:
>>>>>  	foreach (dchar c; "exposé") {}
>>>>> 	foreach (wchar c; "exposé") {}
>>>>> 	foreach (char c; "exposé") {}
>>>>> 	// or
>>>>> 	foreach (dchar c; "exposé".by!dchar()) {}
>>>>> 	foreach (wchar c; "exposé".by!wchar()) {}
>>>>> 	foreach (char c; "exposé".by!char()) {}
>>>>>  and it'll work. But the default would be a slice containing the   grapheme, because this is the right way to represent a Unicode  character.
>>>>  I think this is a good idea.  I previously was nervous about it, but  I'm  not sure it makes a huge difference.  Returning a char[] is  certainly less  work than normalizing a grapheme into one or more code  points, and then  returning them.  All that it takes is to detect all  the code points within  the grapheme.  Normalization can be done if  needed, but would probably  have to output another char[], since a  normalized grapheme can occupy more  than one dchar.
>>>  I'm glad we agree on that now.
>>  It's a matter of me slowly wrapping my brain around unicode and how it's  used.  It seems like it's a typical committee defined standard where there  are 10 ways to do everything, I was trying to weed out the lesser used (or  so I perceived) pieces to allow a more implementable library.  It's doubly  hard for me since I have limited experience with other languages, and I've  never tried to write them with a computer (my language classes in high  school were back in the days of actually writing stuff down on paper).
>
> Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand at first, and so they had to add a lot of things but wanted to keep things backward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support.

I didn't read the standard, all I understand about unicode is from this NG ;)  What I meant was the ability to do things more than one way seems like a committee-designed standard.  Usually with one of those, you have one party who "absolutely needs" one way of doing things (most likely because all their code is based on it), and other parties who want it a different way.  When compromises occur, the end result is, you have a standard that's unnecessarily difficult to implement.

> Indeed, the change would probably be too radical for D2.
>
> I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't really change the type of string literals, can we. The only thing we can change (I hope) at this point is how iterating on strings work.

I was hoping to change string literal types.  If we don't do that, we have a half-ass solution.  I don't think it's going to be impossible, because string, wstring, dstring are all aliases.

In fact, with my current proposed type, this already works:

mystring s = "hello";

But this doesn't:

auto s = "hello"; // still typed as immutable(char)[]

This isn't so bad, just require one to specify the type, right?  Well, it fails miserably here:

foo(mystring s) {...}
foo("hello"); // fails to match.

In order to have a string type, string literals have to be typed as that type.

> Walter said earlier that he oppose changing foreach's default element type to dchar for char[] and wchar[] (as Andrei did for ranges) on the ground that it would silently break D1 compatibility. This is a valid point in my opinion.
>
> I think you're right when you say that not treating char[] as an array of character breaks, to a certain extent, C compatibility. Another valid point.
>
> That said, I want to emphasize that iterating by grapheme, contrary to iterating by dchar, does not break any code *silently*. The compiler will complain loudly that you're comparing a string to a char, so you'll have to change your code somewhere if you want things to compile. You'll have to look at the code and decide what to do.

Changing iteration and not indexing is not going to fix the mess we have right now.

> One more thing:
>
> NSString in Cocoa is in essence the same thing as I'm proposing here: as array of UTF-16 code units, but with string behaviour. It supports by-code-unit indexing, but appending, comparing, searching for substrings, etc. all behave correctly as a Unicode string. Again, I agree that it's probably not the best design, but I can tell you it works well in practice. In fact, NSString doesn't even expose the concept of grapheme, it just uses them internally, and you're pretty much limited to the built-in operation. I think what we have here in concept is much better... even if it somewhat conflates code-unit arrays and strings.

But is NSString typed the *exact same* as an array, or is it a wrapper for an array?  Looking at the docs, it appears it is not.

>>> Or you could make a grapheme a string_t. ;-)
>>  I'm a little uneasy having a range return itself as its element type.  For  all intents and purposes, a grapheme is a string of one 'element', so it  could potentially be a string_t.
>>  It does seem daunting to have so many types, but at the same time, types  convey relationships at compile time that can make coding impossible to  get wrong, or make things actually possible when having a single type  doesn't.
>>  I'll give you an example from a previous life:
>>  [...]
>> I feel that making extra types when the relationship between them is  important is worth the possible repetition of functionality.  Catching  bugs during compilation is soooo much better than experiencing them during  runtime.
>
> I can understand the utility of a separate type in your DateTime example, but in this case I fail to see any advantage.
>
> I mean, a grapheme is a slice of a string, can have multiple code points (like a string), can be appended the same way as a string, can be composed or decomposed using canonical normalization or compatibility normalization (like a string), and should be sorted, uppercased, and lowercased according to Unicode rules (like a string). Basically, a grapheme is just a string that happens to contain only one grapheme. What would a custom type do differently than a string?

A grapheme type would not be a range, it would be an element of the string range.  You could not append to it (otherwise, that makes it into a string).

In all other respects, it should act similar to a string (as you say, printing, upper-casing, comparison, etc.)

>
> Also, grapheme == "a" is easy to understand because both are strings. But if a grapheme is a separate type, what would a grapheme literal look like?

A grapheme should be comparable to a string literal.  It should be assignable to a string literal.  The drawback is we would need a runtime check to ensure the string literal was actually one grapheme.  Some compiler help in this regard would be useful, but I'm not sure how the mechanics would work (you couldn't exactly type a literal differently based on its contents).  Another possibility is to come up with a different syntax to denote grapheme literals.

> So in the end I don't think a grapheme needs a specific type, at least not for general purpose text processing. If I split a string on whitespace, do I get a range where elements are of type "word"? No, just sliced strings.

It is not clear that using a separate type is the "right answer."  It may be that an element of a string should be a string.  This does work in other languages that don't have a concept of a character.  An extra type however, allows us to have more concrete positions to work with.

> That said, I'm much less concerned by the type used to represent a grapheme than by the Unicode correctness. I'm not opposed to a separate type, I just don't really see the point.

I will try to explain better by making an actual candidate type.

-Steve

January 17, 2011

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Posted by Steven Schveighoffer
in reply to Andrei Alexandrescu

Steven Schveighoffer

Posted in reply to Andrei Alexandrescu

On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 1/15/11 9:25 PM, Jonathan M Davis wrote:
>> Considering that strings are already dealt with specially in order to have an
>> element of dchar, I wouldn't think that it would be all that distruptive to make
>> it so that they had an element type of Grapheme instead. Wouldn't that then fix
>> all of std.algorithm and the like without really disrupting anything?
>
> It would make everything related a lot (a TON) slower, and it would break all client code that uses dchar as the element type, or is otherwise unprepared to use Graphemes explicitly. There is no question there will be disruption.

I would have agreed with you last week.  Now I understand that using dchar is just as useless for unicode as using char.

Will it be slower?  Perhaps.  A TON slower?  Probably not.

But it will be correct.  Correct and slow is better than incorrect and fast.  If I showed you a shortest-path algorithm that ran in O(V) time, but didn't always find the shortest path, would you call it a success?

We need to get some real numbers together.  I'll see what I can create for a type, but someone else needs to supply the input :)  I'm on short supply of unicode data, and any attempts I've made to create some result in failure.  I have one example of one composed character in this thread that I can cling to, but in order to supply some real numbers, we need a large amount of data.

-Steve

January 17, 2011

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Posted by Lars T. Kyllingstad
in reply to Steven Schveighoffer

Lars T. Kyllingstad

Posted in reply to Steven Schveighoffer

On Mon, 17 Jan 2011 07:44:17 -0500, Steven Schveighoffer wrote:

> We need to get some real numbers together.  I'll see what I can create for a type, but someone else needs to supply the input :)  I'm on short supply of unicode data, and any attempts I've made to create some result in failure.  I have one example of one composed character in this thread that I can cling to, but in order to supply some real numbers, we need a large amount of data.

Googling "unicode sample document" turned up a few examples.  This one looks promising:

http://www.humancomp.org/unichtm/unichtm.htm

-Lars

January 17, 2011

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Posted by spir
in reply to Steven Schveighoffer

spir

Posted in reply to Steven Schveighoffer

On 01/15/2011 05:59 PM, Steven Schveighoffer wrote:
> I think this is a good alternative, but I'd rather not impose this on
> people like myself who deal mostly with English.  I think this should be
> possible to do with wrapper types or intermediate ranges which have
> graphemes as elements (per my suggestion above).

I am unsure now about the question of a text's (apparent) natural language in relation to unicode issues. For instance English, precisely, seems to often include foreign words literally (or is it a kind of pedantism from highly educated people?). In fact, users are free to include whatever characters they like, as soon as they text-composition interface allows it. All main OSes, I guess, now have at least one standard way to type in characters (or codepoint) that are not directly accessible on keyboards, and application sometimes offer another.
Some kinds of users love to play with such flexibility. So, maybe, the right question is not the one of natural language but of text-composition means. I guess that as soon as a human user may have freely typed or edited a text, we cannot guarantee much upon its actual content, what do you think?
The case of historic ASCII-only text is relevant, indeed, but will fast become less. And how does an application writer recognises them without iterating the whole content? (The encoding is utf8 compatible.)

Denis
_________________
vita es estrany
spir.wikidot.com

January 17, 2011

Re: VLERange: a range in between BidirectionalRange and

Posted by Andrei Alexandrescu
in reply to Jonathan M Davis

Andrei Alexandrescu

Posted in reply to Jonathan M Davis

On 1/17/11 6:25 AM, Jonathan M Davis wrote:
> On Monday 17 January 2011 04:08:08 Steven Schveighoffer wrote:
>> On Sat, 15 Jan 2011 17:19:48 -0500, foobar<foo@bar.com>  wrote:
>>> I like Michel's proposed semantics and I also agree with you that it
>>> should be a distinct string type and not break consistency of regular
>>> arrays.
>>>
>>> Regarding your last point: Do you mean that a grapheme would be a
>>> sub-type of string? (a specialization where the string represents a
>>> single element)? If so, than it sounds good to me.
>>
>> A grapheme would be its own specialized type.  I'd probably remove the
>> range primitives to really differentiate it.  Unfortunately, due to the
>> inability to statically check this, the invariant would have to be a
>> runtime check.  Most likely this check would be disabled in release mode.
>>
>> This can cause problems, and I can see why it is attractive to use strings
>> to implement graphemes, but that also has its problems.  With grapheme
>> being its own type, we are providing a way to optimize functions, and
>> allow further restrictions on function parameters.
>>
>> At the end of the day, perhaps grapheme *should* just be a string.  We'll
>> have to see how this breaks in practice, either way.
>
> I think that it would make good sense for a grapheme to be struct which holds a
> string as Andrei suggested:
>
> struct Grapheme(Char) if (isSomeChar!Char)
> {
>       private const Char[] rep;
>       ...
> }
>
> I really think that trying to use strings to represent graphemes is asking for
> it. The element of a range should be a different type than the that of the range
> itself.
>
> - Jonathan M Davis

If someone makes a careful submission of a Grapheme to Phobos as described above, it has a high chance of being accepted.

Andrei

January 17, 2011

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Posted by Andrei Alexandrescu
in reply to Steven Schveighoffer

Andrei Alexandrescu

Posted in reply to Steven Schveighoffer

On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
> On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> On 1/15/11 9:25 PM, Jonathan M Davis wrote:
>>> Considering that strings are already dealt with specially in order to
>>> have an
>>> element of dchar, I wouldn't think that it would be all that
>>> distruptive to make
>>> it so that they had an element type of Grapheme instead. Wouldn't
>>> that then fix
>>> all of std.algorithm and the like without really disrupting anything?
>>
>> It would make everything related a lot (a TON) slower, and it would
>> break all client code that uses dchar as the element type, or is
>> otherwise unprepared to use Graphemes explicitly. There is no question
>> there will be disruption.
>
> I would have agreed with you last week. Now I understand that using
> dchar is just as useless for unicode as using char.

This is one extreme. Char only works for English. Dchar works for most languages. It won't work for a few. That doesn't make it useless for languages that work with it.

> Will it be slower? Perhaps. A TON slower? Probably not.

It will be a ton slower.

> But it will be correct. Correct and slow is better than incorrect and
> fast. If I showed you a shortest-path algorithm that ran in O(V) time,
> but didn't always find the shortest path, would you call it a success?

The comparison doesn't apply.

> We need to get some real numbers together. I'll see what I can create
> for a type, but someone else needs to supply the input :) I'm on short
> supply of unicode data, and any attempts I've made to create some result
> in failure. I have one example of one composed character in this thread
> that I can cling to, but in order to supply some real numbers, we need a
> large amount of data.

I very much appreciate that you're doing actual work on this.


Andrei

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation