October 24, 2012
On Wednesday, October 24, 2012 13:39:50 Timon Gehr wrote:
> You realize that the proposed solution is that arrays of code units would no longer be arrays of code units?

Yes and no. They'd be arrays of code units, but any operations on them which weren't unicode safe would require using the rep property. So, for instance, using ptr on them to pass to C functions would be fine, but slicing wouldn't. It definitely would be a case of violating the turtles all the way down principle, because arrays of code units wouldn't really be proper arrays anymore, but as long as they're treated as actual arrays, they _will_ be misued. The trick is doing something that's both correct and reasonably efficient by default but allows fully efficient code if you code with an understanding of unicode, and to do that, you can't have arrays of code units like we do now. But for better or worse, that doesn't look like it's going to change.

What we have right now actually works quite well if you understand the issues involved, but it's not newbie friendly at all.

- Jonathan M Davis
October 24, 2012
Thanks, that is exactly what I wanted to clarify.
Can I do pull request for this or do you plan to?
October 24, 2012
On Wednesday, 24 October 2012 at 12:03:10 UTC, Jonathan M Davis wrote:
> On Wednesday, October 24, 2012 13:39:50 Timon Gehr wrote:
>> You realize that the proposed solution is that arrays of code units
>> would no longer be arrays of code units?
>
> Yes and no. They'd be arrays of code units, but any operations on them which
> weren't unicode safe would require using the rep property. So, for instance,
> using ptr on them to pass to C functions would be fine, but slicing wouldn't.
> It definitely would be a case of violating the turtles all the way down
> principle, because arrays of code units wouldn't really be proper arrays
> anymore, but as long as they're treated as actual arrays, they _will_ be
> misued. The trick is doing something that's both correct and reasonably
> efficient by default but allows fully efficient code if you code with an
> understanding of unicode, and to do that, you can't have arrays of code units
> like we do now. But for better or worse, that doesn't look like it's going to
> change.
>
> What we have right now actually works quite well if you understand the issues
> involved, but it's not newbie friendly at all.
>
> - Jonathan M Davis

What about a compromise - turning this proposal upside down and requiring something like "utfstring".decode to operate on symbols? ( There is front & Co in std.array but I am thinking of more tightly coupled to string ) It would have removed necessity of copy-pasting the very same checks for all algorithms and move decision about usage of code points vs code units to user side. Yes, it is does not prohibit a lot if senseless operations, but at least it is consistent approach. I personally believe that not being able to understand what to await from basic algorithm/operation applied to string (without looking at lib source code) is much more difficult sitation then necessity to properly understand unicode.
October 24, 2012
On Wednesday, October 24, 2012 04:54:36 Jonathan M Davis wrote:
> That being said, there _is_ a bug in commonPrefix that I just noticed when looking it over. It currently operates on code units rather than code points. It can operate on strings just fine like it's doing now (even returning a slice), but it needs to decode the code points as it iterates over them, and it's not doing that.

Wait. No. I think that it's (mostly) okay. I was thinking that you could have different sequences of code units which resolved to the same code point, and upon reflection, I don't think that you can. It's graphemes which can be represented by multiple sequences of code points, not code points which can be represented by multiple sequences of code units (unicode is overly confusing to say the least).

There's still an issue with the predicate though (hence the "mostly" above). If anything _other_ than == or != is used, then the code units would have to be decoded in order to pass dchars to the predicate. So, commonPrefix should be fine as-is in all cases except for when a custom predicate is given, and it's operating on narrow strings.

- Jonathan M Davis
October 24, 2012
Wait. So you consider commonPrefix returning malformed string to be fine? I have lost you here. For example, for code sample given above, output is:

==========
Пи
П[\D0]
==========

Problem is if you use == on code unit you can match only part of valid symbol.
October 24, 2012
On Wednesday, 24 October 2012 at 06:43:13 UTC, Simen Kjaeraas wrote:
> As long as typeof("") != String, this is not going t work:
>
> auto s = "";

Gah, I hate literals.
October 24, 2012
On Wednesday, October 24, 2012 14:37:33 mist wrote:
> Wait. So you consider commonPrefix returning malformed string to be fine? I have lost you here. For example, for code sample given above, output is:
> 
> ==========
> Пи
> П[\D0]
> ==========
> 
> Problem is if you use == on code unit you can match only part of valid symbol.

Hmmm. Let me think this through for a moment. Every code point starts with a code unit that tells you how many code units are in the code point, and each code point should have only one sequence of code units which represents it, so something like find or startsWith should be able to just use code units. commonPrefix is effectively doing a startsWith/find, but it's shortcutted once there's a difference, and that _could_ be in the middle of a code point, since you could have a code point with 3 code units where the first 2 match but no the third one. So, yes. There is a bug here.

Now, a full decode still isn't necessary. It just has to keep track of how long the code point is and return a slice starting at the end of the code previous code point if not all of a code point matches, but you've definitely found a bug.

- Jonathan M Davis
October 24, 2012
On Wed, Oct 24, 2012 at 12:38:41PM -0700, Jonathan M Davis wrote:
> On Wednesday, October 24, 2012 14:37:33 mist wrote:
> > Wait. So you consider commonPrefix returning malformed string to be fine? I have lost you here. For example, for code sample given above, output is:
> > 
> > ==========
> > Пи
> > П[\D0]
> > ==========
> > 
> > Problem is if you use == on code unit you can match only part of valid symbol.
> 
> Hmmm. Let me think this through for a moment. Every code point starts with a code unit that tells you how many code units are in the code point, and each code point should have only one sequence of code units which represents it, so something like find or startsWith should be able to just use code units.  commonPrefix is effectively doing a startsWith/find, but it's shortcutted once there's a difference, and that _could_ be in the middle of a code point, since you could have a code point with 3 code units where the first 2 match but no the third one. So, yes. There is a bug here.
> 
> Now, a full decode still isn't necessary. It just has to keep track of how long the code point is and return a slice starting at the end of the code previous code point if not all of a code point matches, but you've definitely found a bug.
[...]

For many algorithms, full decode is not necessary. This is something that Phobos should take advantage of (at least in theory; I'm not sure how practical this is with the current codebase).

Actually, in the above case, *no* decode is necessary at all. UTF-8 was designed specifically for this: if you see a byte with its highest bits set to 0b10, that means you're in the middle of a code point. You can scan forwards or backwards until the first byte whose highest bits aren't 0b10; that's guaranteed to be the start of a code point (provided the original string is actually well-formed UTF-8). There is no need to keep track of length at all.

Many algorithms can be optimized to take advantage of this. Counting the number of code points is simply counting the number of bytes whose highest bits are not 0b10. Given some arbitrary offset into a char[], you can use std.range.radial to find the nearest code point boundary (i.e., byte whose upper bits are not 0b10).

Given a badly-truncated UTF-8 string (i.e., it got cut in the middle of a code point), you can recover the still-valid substring by deleting the bytes with high bits 0b10 at the beginning/end of the string. You'll lose the truncated code point, but the rest of the string is still usable.

Etc..


T

-- 
Marketing: the art of convincing people to pay for what they didn't need before which you can't deliver after.
October 24, 2012
On Wednesday, October 24, 2012 12:53:23 H. S. Teoh wrote:
> For many algorithms, full decode is not necessary. This is something that Phobos should take advantage of (at least in theory; I'm not sure how practical this is with the current codebase).

It does take advantage of it in a number of cases but not necessarily everywhere that it could. That's actually one major issue with ranges though is that if you've wrapped a string in a range at all (via map, filter, take, or whatever), then the resultant range is forced to decode on every call to front or popFront (well, partial decode on popFront anyway), whereas functions can special case strings to avoid extraneous decoding with them. So, you can take a performance hit if you're operating on wrapped strings rather than on strings directly.

> Actually, in the above case, *no* decode is necessary at all. UTF-8 was designed specifically for this: if you see a byte with its highest bits set to 0b10, that means you're in the middle of a code point. You can scan forwards or backwards until the first byte whose highest bits aren't 0b10; that's guaranteed to be the start of a code point (provided the original string is actually well-formed UTF-8). There is no need to keep track of length at all.

I wouldn't say that "no" decoding is necessary. Rather, I'd say that partial decoding is necessary. If you have to examine the code units to determine where code points are or how long they are or whatnot, then you're still doing part of what decode has to do, whereas a function like find can forgo checking any of that entirely and merely compare the values of the code units. _That_'s what I'd consider to be no decoding required, and commonPrefix is buggy precisely because it's doing no decoding rather than partial decoding. But I suppose that it's arguing semantics.

- Jonathan M Davis
October 25, 2012
On Wednesday, October 24, 2012 14:18:04 mist wrote:
> What about a compromise - turning this proposal upside down and requiring something like "utfstring".decode to operate on symbols? ( There is front & Co in std.array but I am thinking of more tightly coupled to string ) It would have removed necessity of copy-pasting the very same checks for all algorithms and move decision about usage of code points vs code units to user side. Yes, it is does not prohibit a lot if senseless operations, but at least it is consistent approach.

I'm afraid that I don't understand what you're proposing.

> I personally believe that not
> being able to understand what to await from basic
> algorithm/operation applied to string (without looking at lib
> source code) is much more difficult sitation then necessity to
> properly understand unicode.

Well, to use ranges in general, you need to understand hasLength, hasSlicing, isRandomAccessRange, etc., and you need to understand what it means when template constraints fail based on those templates. That being the case, strings are no different from any other range in that if they fail to instantiate with a particular function, then you need to look at the template constraints and see what the function requires, and sometimes you just can't know without looking at the template constraints, because it's not always obvious which range operations a particular function will require just based on what it's supposed to do. The main issue is understanding which range-based operations arrays have but which narrow strings don't. Then when a template fails to instantiate because a string isn't random access or sliceable or whatnot, you understand why rather than getting totally confused about int[] working and char[] not working.

- Jonathan M Davis