May 31, 2016
On 05/31/2016 07:21 PM, Andrei Alexandrescu wrote:
> Could you please substantiate that? My understanding is that code unit
> is a higher-level Unicode notion independent of encoding, whereas code
> point is an encoding-dependent representation detail. -- Andrei

You got the terms mixed up. Code unit is lower level. Code point is higher level.

One code point is encoded with one or more code units. char is a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is both a UTF-32 code unit and a code point, because in UTF-32 it's a 1-to-1 relation.
May 31, 2016
On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:
>
> The 'length' of a character is not one in all contexts.
> The following text takes six columns in my terminal:
>
> 日本語
> 123456

That's a property of your font and font rendering engine, not Unicode. (Also, it's probably not quite six columns; most fonts I've tested, 漢字 are rendered as something like 1.5 characters wide, assuming your terminal doesn't overlap them.)

-Wyatt
May 31, 2016
On Tuesday, May 31, 2016 21:20:19 Timon Gehr via Digitalmars-d wrote:
> On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote:
> > On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d
wrote:
> >> >On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
> >>> > >On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d
> >
> > wrote:
> >>>> > >>On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
> >>>>> > >>>Saying that operating at the code point level - UTF-32 - is
> >>>>> > >>>correct
> >>>>> > >>>is like saying that operating at UTF-16 instead of UTF-8 is
> >>>>> > >>>correct.
> >>>> > >>
> >>>> > >>Could you please substantiate that? My understanding is that code
> >>>> > >>unit
> >>>> > >>is a higher-level Unicode notion independent of encoding, whereas
> >>>> > >>code
> >>>> > >>point is an encoding-dependent representation detail. -- Andrei
> >> >
> >> >Does walkLength yield the same number for all representations?
> >
> > walkLength treats a code point like it's a character. My point is that
> > that's incorrect behavior. It will not result in correct string processing
> > in the general case, because a code point is not guaranteed to be a
> > full character.
> > ...
>
> What's "correct"? Maybe the user intended to count the number of code points in order to pre-allocate a dchar[] of the correct size.
>
> Generally, I don't see how algorithms become magically "incorrect" when applied to utf code units.

In the vast majority of cases what folks care about is full characters, which is not what code points are. But the fact that they want different things in different situation just highlights the fact that just converting everything to code points by default is a bad idea. And even worse, code points are usually the worst choice. Many operations don't require decoding and can be done at the code unit level, meaning that operating at the code point level is just plain inefficient. And the vast majority of the operations that can't operate at the code point level, then need to operate on full characters, which means that they need to be operating at the grapheme level. Code points are in this weird middle ground that's useful in some cases but usually isn't what you want or need.

We need to be able to operate at the code unit level, the code point level, and the grapheme level. But defaulting to the code point level really makes no sense.

> > walkLength does not report the length of a character as one in all cases just like length does not report the length of a character as one in all cases. walkLength is counting bigger units than length, but it's still counting pieces of a character rather than counting full characters.
>
> The 'length' of a character is not one in all contexts.
> The following text takes six columns in my terminal:
>
> 日本語
> 123456

Well, that's getting into displaying characters which is a whole other can of worms, but it also highlights that assuming that the programmer wants a particular level of unicode is not a particularly good idea and that we should avoid converting for them without being asked, since it risks being inefficient to no benefit.

- Jonathan M Davis


May 31, 2016
On Tue, May 31, 2016 at 07:40:13PM +0000, Wyatt via Digitalmars-d wrote:
> On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:
> > 
> > The 'length' of a character is not one in all contexts.
> > The following text takes six columns in my terminal:
> > 
> > 日本語
> > 123456
> 
> That's a property of your font and font rendering engine, not Unicode. (Also, it's probably not quite six columns; most fonts I've tested, 漢字 are rendered as something like 1.5 characters wide, assuming your terminal doesn't overlap them.)
[...]

I believe he was talking about a console terminal that uses 2 columns to render the so-called "double width" characters. The CJK block does contain "double-width" versions of selected blocks (e.g., the ASCII block), to be used with said characters.

Of course, using string length to measure string width is a risky venture fraught with pitfalls, because your terminal may not actually render them the way you think it should. Nevertheless, it does serve to highlight why a construct like s.walkLength is essentially buggy, because there is not enough information to determine which length it should return -- length of the buffer in bytes, or the number of code points, or the number of graphemes, or the width of the string. No matter which choice you make, it only works for a subset of cases and is wrong for the other cases.  This is a prime illustration of why forcing autodecoding on every string in D is a wrong design.


T

-- 
Не дорог подарок, дорога любовь.
May 31, 2016
On 31-May-2016 01:00, Walter Bright wrote:
> On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
>> I don't agree on changing those. Indexing and slicing a char[] is
>> really useful
>> and actually not hard to do correctly (at least with regard to
>> handling code
>> units).
>
> Yup. It isn't hard at all to use arrays of codeunits correctly.

Ehm as long as all you care for is operating on substrings I'd say.
Working with individual character requires either decoding or clever tricks like operating on encoded UTF directly.

-- 
Dmitry Olshansky
May 31, 2016
On 31.05.2016 21:40, Wyatt wrote:
> On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:
>>
>> The 'length' of a character is not one in all contexts.
>> The following text takes six columns in my terminal:
>>
>> 日本語
>> 123456
>
> That's a property of your font and font rendering engine, not Unicode.

Sure. Hence "context". If you are e.g. trying to manually underline some text in console output, for example in a compiler error message, counting the number of characters will not actually be what you want, even though it works reliably for ASCII text.

> (Also, it's probably not quite six columns; most fonts I've tested, 漢字
> are rendered as something like 1.5 characters wide, assuming your
> terminal doesn't overlap them.)
>
> -Wyatt

It's precisely six columns in my terminal (also in emacs and in gedit).

My point was, how can std.algorithm ever guess correctly what you /actually/ intended to do?
May 31, 2016
On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:
> On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> [...]
>> Does walkLength yield the same number for all representations?
>
> Let's put the question this way. Given the following string, what do
> *you* think walkLength should return?

Compiler error.

-Steve
May 31, 2016
On Tuesday, May 31, 2016 15:33:38 Andrei Alexandrescu via Digitalmars-d wrote:
> On 05/31/2016 02:53 PM, Jonathan M Davis via Digitalmars-d wrote:
> > walkLength treats a code point like it's a character.
>
> No, it treats a code point like it's a code point. -- Andrei

Wasn't the whole point of operating at the code point level by default to make it so that code would be operating on full characters by default instead of chopping them up as is so easy to do when operating at the code unit level? Thanks to how Phobos treats strings as ranges of dchar, most D code treats code points as if they were characters. So, whether it's correct or not, a _lot_ of D code is treating walkLength like it returns the number of characters in a string. And if walkLength doesn't provide the number of characters in a string, why would I want to use it under normal circumstances? Why would I want to be operating at the code point level in my code? It's not necessarily a full character, since it's not necessarily a grapheme. So, by using walkLength and front and popFront and whatnot with strings, I'm not getting full characters. I'm still only getting pieces of characters - just like would happen if strings were treated as ranges of code units. I'm just getting bigger pieces of the characters out of the deal. But if they're not full characters, what's the point?

I am sure that there is code that is going to want to operate at the code point level, but your average program is either operating on strings as a whole or individual characters. As long as strings are being operated on as a whole, code units are generally plenty, and careful encoding of characters into code units for comparisons means that much of the time that you want to operate on individual characters, you can still operate at the code unit level. But if you can't, then you need the grapheme level, because a code point is not necessarily a full character.

So, what is the point of operating on code points in your average D program? walkLength will not always tell me the number of characters in a string. front risks giving me a partial character rather than a whole one. Slicing dchar[] risks chopping up characters just like slicing char[] does. Operating on code points by default does not result in correct string processing.

I honestly don't see how autodecoding is defensible. We may not be able to get rid of it due to the breakage that doing that would cause, but I fail to see how it is at all desirable that we have autodecoded strings. I can understand how we got it if it's based on a misunderstanding on your part about how Unicode works. We all make mistakes. But I fail to see how autodecoding wasn't a mistake. It's the worst of both worlds - inefficient while still incorrect. At least operating at the code unit level would be fast while being incorrect, and it would be obviously incorrect once you did anything with non-ASCII values, whereas it's easy to miss that ranges of dchar are doing the wrong thing too

- Jonathan M Davis

May 31, 2016
On Tuesday, May 31, 2016 21:48:36 Timon Gehr via Digitalmars-d wrote:
> On 31.05.2016 21:40, Wyatt wrote:
> > On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:
> >> The 'length' of a character is not one in all contexts.
> >> The following text takes six columns in my terminal:
> >>
> >> 日本語
> >> 123456
> >
> > That's a property of your font and font rendering engine, not Unicode.
>
> Sure. Hence "context". If you are e.g. trying to manually underline some text in console output, for example in a compiler error message, counting the number of characters will not actually be what you want, even though it works reliably for ASCII text.
>
> > (Also, it's probably not quite six columns; most fonts I've tested, 漢字 are rendered as something like 1.5 characters wide, assuming your terminal doesn't overlap them.)
> >
> > -Wyatt
>
> It's precisely six columns in my terminal (also in emacs and in gedit).
>
> My point was, how can std.algorithm ever guess correctly what you /actually/ intended to do?

It can't, which is precisely why having it select for you was a bad design decision. The programmer needs to be making that decision. And the fact that Phobos currently makes that decision for you means that it's often doing the wrong thing, and the fact that it chose to decode code points by default means that it's often eating up unnecessary cycles to boot.

- Jonathan M Davis


May 31, 2016
On Tuesday, May 31, 2016 22:47:56 Dmitry Olshansky via Digitalmars-d wrote:
> On 31-May-2016 01:00, Walter Bright wrote:
> > On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
> >> I don't agree on changing those. Indexing and slicing a char[] is
> >> really useful
> >> and actually not hard to do correctly (at least with regard to
> >> handling code
> >> units).
> >
> > Yup. It isn't hard at all to use arrays of codeunits correctly.
>
> Ehm as long as all you care for is operating on substrings I'd say. Working with individual character requires either decoding or clever tricks like operating on encoded UTF directly.

Yeah, but Phobos provides the tools to do that reasonably easily even when autodecoding isn't involved. Sure, it's slightly more tedious to call std.utf.decode or std.utf.encode yourself rather than letting autodecoding take care of it, but it's easy enough to do and allows you to control when it's done. And we have stuff like byChar!dchar or byGrapheme for the cases where you don't want to actually operate on arrays of code units.

- Jonathan M Davis