December 31, 2011
Andrei Alexandrescu:

> We need .raw and we must abolish .length and [] for narrow strings.

I don't know if we need, but I agree those things are an improvement over the current state. To replace the disabled slicing I think something Python islice() will be useful.

Bye,bear
bearophile
December 31, 2011
On 31.12.2011 01:56, Timon Gehr wrote:
> On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
>> On 12/30/11 6:07 PM, Timon Gehr wrote:
>>> alias std.string.representation raw;
>>
>> I meant your implementation is incomplete.
>
> It was more a sketch than an implementation. It is not even type safe :o).
>
>>
>> But the main point is that presence of representation/raw is not the
>> issue.
>> The availability of good-for-nothing .length and operator[] are
>> the issue. Putting in place the convention of using .raw is hardly
>> useful within the context.
>>
>
> D strings are arrays. An array without .length and operator[] is close
> to being good for nothing. The language specification is quite clear
> about the fact that e.g. char is not a character but an utf-8 code unit.
> Therefore char[] is an array of code units.

No, it isn't. That's the problem. char[] is not an array of char.
It has an additional invariant: it is a UTF8 string. If you randomly change elements, the invariant is violated.

In reality, char[] and wchar[] are compressed forms of dstring.

> .raw would return ubyte[], therefore it
> would lose all type information. Effectively, what .raw does is a type
> cast that will let code point data alias with integral data.

Exactly. It's just a "I know what I'm doing" signal.
December 31, 2011
On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> On 12/31/11 2:04 AM, Walter Bright wrote:
> 
>> We're chasing phantoms here, and I worry a lot about over-engineering
>> trivia.
> 
> I disagree. I understand that seems trivia to you, but that doesn't make your opinion any less wrong, not to mention provincial through insistence it's applicable beyond a small team of experts. Again: I know no other - I literally mean not one - person who writes string code like you do (and myself after learning it from you); the current system is adequate; the proposed system is perfect - save for breaking backwards compatibility, which makes the discussion moot. But it being moot does not afford me to concede this point. I am right.

Perfect? At one time Java and other frameworks started to use UTF-16 as if they were characters, that turned wrong on them. Now we know that not even code points should be considered characters, thanks to characters spanning on multiple code points. You might call it perfect, but for that you have made two assumptions:

1. treating code points as characters is good enough, and
2. the performance penalty of decoding everything is tolerable

Ranges of code points might be perfect for you, but it's a tradeoff that won't work in every situations.

The whole concept of generic algorithms working on strings efficiently doesn't work. Applying generic algorithms to strings by treating them as a range of code points is both wasteful (because it forces you to decode everything) and incomplete (because of multi-code-point characters) and it should be avoided. Algorithms working on Unicode strings should be designed with Unicode in mind. And the best way to design efficient Unicode algorithms is to access the array of code units directly and read each character at the level of abstraction required and know what you're doing.

I'm not against making strings more opaque to encourage people to use the Unicode algorithms from the standard library instead of rolling their own. But I doubt the current approach of using .raw alone will prevent many from doing dumb things. On the other side I'm sure it'll make it it more complicated to write Unicode algorithms because accessing and especially slicing the raw content of char[] will become tiresome. I'm not convinced it's a net win.

As for Walter being the only one coding by looking at the code units directly, that's not true. All my parser code look at code units directly and only decode to code points where necessary (just look at the XML parsing code I posted a while ago to get an idea to how it can apply to ranges). And I don't think it's because I've seen Walter code before, I think it is because I know how Unicode works and I want to make my parser efficient. I've done the same for a parser in C++ a while ago. I can hardly imagine I'm the only one (with Walter and you). I think this is how efficient algorithms dealing with Unicode should be written.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

December 31, 2011
On 12/31/11 8:17 CST, Michel Fortin wrote:
> On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> On 12/31/11 2:04 AM, Walter Bright wrote:
>>
>>> We're chasing phantoms here, and I worry a lot about over-engineering
>>> trivia.
>>
>> I disagree. I understand that seems trivia to you, but that doesn't
>> make your opinion any less wrong, not to mention provincial through
>> insistence it's applicable beyond a small team of experts. Again: I
>> know no other - I literally mean not one - person who writes string
>> code like you do (and myself after learning it from you); the current
>> system is adequate; the proposed system is perfect - save for breaking
>> backwards compatibility, which makes the discussion moot. But it being
>> moot does not afford me to concede this point. I am right.
>
> Perfect?

Sorry, I exaggerated. I meant "a net improvement while keeping simplicity".

> At one time Java and other frameworks started to use UTF-16 as
> if they were characters, that turned wrong on them. Now we know that not
> even code points should be considered characters, thanks to characters
> spanning on multiple code points. You might call it perfect, but for
> that you have made two assumptions:
>
> 1. treating code points as characters is good enough, and
> 2. the performance penalty of decoding everything is tolerable

I'm not sure how you concluded I drew such assumptions.

> Ranges of code points might be perfect for you, but it's a tradeoff that
> won't work in every situations.

Ranges can be defined to span logical glyphs that span multiple code points.

> The whole concept of generic algorithms working on strings efficiently
> doesn't work.

Apparently std.algorithm does.

> Applying generic algorithms to strings by treating them as
> a range of code points is both wasteful (because it forces you to decode
> everything) and incomplete (because of multi-code-point characters) and
> it should be avoided.

An algorithm that gains by accessing the encoding can do so - and indeed some do. Spanning multi-code-point characters is a matter of defining the range appropriately; it doesn't break the abstraction.

> Algorithms working on Unicode strings should be
> designed with Unicode in mind. And the best way to design efficient
> Unicode algorithms is to access the array of code units directly and
> read each character at the level of abstraction required and know what
> you're doing.

As I said, that's happening already.

> I'm not against making strings more opaque to encourage people to use
> the Unicode algorithms from the standard library instead of rolling
> their own.

I'd say we're discussing making the two kinds of manipulation (encoded sequence of logical character vs. array of code units) more distinguished from each other. That's a Good Thing(tm).

> But I doubt the current approach of using .raw alone will
> prevent many from doing dumb things.

I agree. But I think it would be a sensible improvement over now, when you get to do a ton of dumb things with much more ease.

> On the other side I'm sure it'll
> make it it more complicated to write Unicode algorithms because
> accessing and especially slicing the raw content of char[] will become
> tiresome. I'm not convinced it's a net win.

Many Unicode algorithms don't need slicing. Those that do carefully mix manipulation of code points with manipulation of representation. It is a net win that the two operations are explicitly distinguished.

> As for Walter being the only one coding by looking at the code units
> directly, that's not true. All my parser code look at code units
> directly and only decode to code points where necessary (just look at
> the XML parsing code I posted a while ago to get an idea to how it can
> apply to ranges). And I don't think it's because I've seen Walter code
> before, I think it is because I know how Unicode works and I want to
> make my parser efficient. I've done the same for a parser in C++ a while
> ago. I can hardly imagine I'm the only one (with Walter and you). I
> think this is how efficient algorithms dealing with Unicode should be
> written.

Congratulations.


Andrei
December 31, 2011
Timon Gehr wrote:
> Me too. I think the way we have it now is optimal. The only reason we
> are discussing this is because of fear that uneducated users will write
> code that does not take into account Unicode characters above code point
> 0x80.

+1

From D's string docs:

"char[] strings are in UTF-8 format. wchar[] strings are in UTF-16 format. dchar[] strings are in UTF-32 format."

I would additionally add some clarifications:

char[] is an array of 8-bit code units. Unicode code point may take up to 4 chars.
wchar[] is an array of 16-bit code units. Unicode code point may take up to 2 wchars.
dchar[] is an array of 32-bit code units. Unicode code point always fits into one dchar.

Each of these formats may encode any Unicode string.

If you need indexing or slicing use:
* char[] or string when working with ASCII code points.
* wchar[] or wstring when working with Basic Multilingual Plane (BMP) code points.
* dchar[] or dstring when working with all possible code points.

If you do not need indexing or slicing you may use any of the formats.
December 31, 2011
On 12/31/2011 01:15 PM, Don wrote:
> On 31.12.2011 01:56, Timon Gehr wrote:
>> On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
>>> On 12/30/11 6:07 PM, Timon Gehr wrote:
>>>> alias std.string.representation raw;
>>>
>>> I meant your implementation is incomplete.
>>
>> It was more a sketch than an implementation. It is not even type safe
>> :o).
>>
>>>
>>> But the main point is that presence of representation/raw is not the
>>> issue.
>>> The availability of good-for-nothing .length and operator[] are
>>> the issue. Putting in place the convention of using .raw is hardly
>>> useful within the context.
>>>
>>
>> D strings are arrays. An array without .length and operator[] is close
>> to being good for nothing. The language specification is quite clear
>> about the fact that e.g. char is not a character but an utf-8 code unit.
>> Therefore char[] is an array of code units.
>
> No, it isn't. That's the problem. char[] is not an array of char.
> It has an additional invariant: it is a UTF8 string. If you randomly
> change elements, the invariant is violated.

char[] is an array of char and the additional invariant is not enforced by the language.

>
> In reality, char[] and wchar[] are compressed forms of dstring.
>
>> .raw would return ubyte[], therefore it
>> would lose all type information. Effectively, what .raw does is a type
>> cast that will let code point data alias with integral data.
>
> Exactly. It's just a "I know what I'm doing" signal.

No, it is a "I don't know what I'm doing" signal: ubyte[] does not carry any sign of an additional invariant, and the aliasing can be used to break the invariant that is commonly assumed for char[]. That was my point.
December 31, 2011
On 12/31/2011 03:17 PM, Michel Fortin wrote:
>
> As for Walter being the only one coding by looking at the code units
> directly, that's not true. All my parser code look at code units
> directly and only decode to code points where necessary (just look at
> the XML parsing code I posted a while ago to get an idea to how it can
> apply to ranges). And I don't think it's because I've seen Walter code
> before, I think it is because I know how Unicode works and I want to
> make my parser efficient. I've done the same for a parser in C++ a while
> ago. I can hardly imagine I'm the only one (with Walter and you). I
> think this is how efficient algorithms dealing with Unicode should be
> written.
>

+1.
December 31, 2011
On Saturday, 31 December 2011 at 15:03:13 UTC, Andrei Alexandrescu wrote:
>> The whole concept of generic algorithms working on strings efficiently
>> doesn't work.
>
> Apparently std.algorithm does.

According to my research[1], std.array.replace (which uses std.algorithm under the hood) can be at least 40% faster when there is a match and 70% faster when there isn't one.

I don't think this is actually related to UTF, though.

[1]: http://dump.thecybershadow.net/5cfb6713ce6628686c6aa8a23b15c99e/test.d
December 31, 2011
I'm not sure I understand what's wrong with length.  Of all the times I get a length in one sizable i18nalized app at work I can think of only one instance where I actually want the character count rather than the byte count. Is there some other reason I'm not aware of that length is undesirable?

Sent from my iPhone

On Dec 30, 2011, at 4:12 PM, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 12/30/11 6:07 PM, Timon Gehr wrote:
>> alias std.string.representation raw;
> 
> I meant your implementation is incomplete.
> 
> But the main point is that presence of representation/raw is not the issue. The availability of good-for-nothing .length and operator[] are the issue. Putting in place the convention of using .raw is hardly useful within the context.
> 
> 
> Andrei
December 31, 2011
On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> On 12/31/11 8:17 CST, Michel Fortin wrote:
>> At one time Java and other frameworks started to use UTF-16 as
>> if they were characters, that turned wrong on them. Now we know that not
>> even code points should be considered characters, thanks to characters
>> spanning on multiple code points. You might call it perfect, but for
>> that you have made two assumptions:
>> 
>> 1. treating code points as characters is good enough, and
>> 2. the performance penalty of decoding everything is tolerable
> 
> I'm not sure how you concluded I drew such assumptions.

1: Because treating UTF-8 strings as a range of code point encourage people to think so. 2: From things you posted on the newsgroup previously. Sorry I don't have the references, but it'd take too long to dig them back.

>> Ranges of code points might be perfect for you, but it's a tradeoff that
>> won't work in every situations.
> 
> Ranges can be defined to span logical glyphs that span multiple code points.

I'm talking about the default interpretation, where string ranges are ranges of code units, making that tradeoff the default.

And also, I think we can agree that a logical glyph range would be terribly inefficient in practice, although it could be a nice teaching tool.

>> The whole concept of generic algorithms working on strings efficiently
>> doesn't work.
> 
> Apparently std.algorithm does.

First, it doesn't really work. It seems to work fine, but it doesn't handle (yet) characters spanning multiple code points. To handle this case, you could use a logical glyph range, but that'd be quite inefficient. Or you can improve the algorithm working on code points so that it checks for combining characters on the edges, but then is it still a generic algorithm?

Second, it doesn't work efficiently. Sure you can specialize the algorithm so it does not decode all code units when it's not necessary, but then does it still classify as a generic algorithm?

My point is that *generic* algorithms cannot work *efficiently* with Unicode, not that they can't work at all. And even then, for the inneficient generic algorithm to work correctly with all input, the user need to choose the correct Unicode representation to for the problem at hand, which requires some general knowledge of Unicode.

Which is why I'd just discourage generic algorithms for strings.


>> I'm not against making strings more opaque to encourage people to use
>> the Unicode algorithms from the standard library instead of rolling
>> their own.
> 
> I'd say we're discussing making the two kinds of manipulation (encoded sequence of logical character vs. array of code units) more distinguished from each other. That's a Good Thing(tm).

It's a good abstraction to show the theory of Unicode. But it's not the way to go if you want efficiency. For efficiency you need for each element in the string to use the lowest abstraction required to handle this element, so your algorithm needs to know about the various abstraction layers.

This is the kind of "range" I'd use to create algorithms dealing with Unicode properly:

struct UnicodeRange(U)
{
	U frontUnit() @property;
	dchar frontPoint() @property;
	immutable(U)[] frontGlyph() @property;
	
	void popFrontUnit();
	void popFrontPoint();
	void popFrontGlyph();

	...
}

Not really a range per your definition of ranges, but basically it lets you intermix working with units, code points, and glyphs. Add a way to slice at the unit level and a way to know the length at the unit level and it's all I need to make an efficient parser, or any algorithm really.

The problem with .raw is that it creates a separate range for the units. This means you can't look at the frontUnit and then decide to pop the unit and then look at the next, decide you need to decode using frontPoint, then call popPoint and return to looking at the front unit.

Also, I'm not sure the "glyph" part of that range is required most of the time, because most of the time you don't need to decode glyphs to be glyph-aware. But it'd be nice if you wanted to count them and having it there alongside the rest makes teaches makes users aware of them.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/