December 31, 2011
On 2011-12-31 16:47:40 +0000, Sean Kelly <sean@invisibleduck.org> said:

> I don't know that Unicode expertise is really required here anyway.  All one
>  has to know is that UTF8 is a multibyte encoding and built-in string attrib
> utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s
> cience.

It's not bytes vs. characters, it's code units vs. code points vs. user perceived characters (grapheme clusters). One character can span multiple code points, and can be represented in various ways depending on which Unicode normalization you pick. But most people don't know that.

If you want to count the number of *characters*, counting code points isn't really it, as you should avoid counting the combining ones. If you want to search for a substring, you need to be sure both strings use the same normalization first, and if not normalize them appropriately so that equivalent code point combinations are always represented the same.

That said, if you are implementing an XML or JSON parser, since those specs are defined in term of code points you should probably write your code in term of code points (hopefully without decoding code points when you don't need to). On the other hand, if you're writing something that processes text (like counting the average number of *character* per word in a document), then you should be aware of combining characters.

How to pack all this into an easy to use package is most challenging.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

December 31, 2011
On 12/31/11 2:44 PM, Michel Fortin wrote:
> But will s.raw.popFront() also pop a single unit from s? "raw" would
> need to be defined as a reinterpret cast of the reference to the char[]
> to do what I want, something like this:
>
> ref ubyte[] raw(ref char[] s) { return *cast(ubyte[]*)&s; }
>
> The current std.string.representation doesn't do that at all.

You just found a bug!

Andrei
December 31, 2011
On 12/31/2011 07:56 PM, Andrei Alexandrescu wrote:
> On 12/31/11 10:47 AM, Michel Fortin wrote:
>> This means you can't look at the frontUnit and then decide to pop the
>> unit and then look at the next, decide you need to decode using
>> frontPoint, then call popPoint and return to looking at the front unit.
>
> Of course you can.
>
> while (condition) {
> if (s.raw.front == someFrontUnitThatICareAbout) {
> s.raw.popFront();
> auto c = s.front;
> s.popFront();
> }
> }
>
> Now that I wrote it I'm even more enthralled with the coolness of the
> scheme. You essentially have access to two separate ranges on top of the
> same fabric.
>
>
> Andrei

There is nothing wrong with the scheme on the conceptual level (except maybe that .raw.popFront() lets you invalidate the code point range). But making built-in arrays behave that way is like fitting a square peg in a round hole. immutable(char)[] is actually what .raw should return, not what it should be called on. It is already the raw representation.
January 01, 2012
Sorry, I was simplifying. The distinction I was trying to make was between generic operations (in my experience the majority) vs. encoding-aware ones.

Sent from my iPhone

On Dec 31, 2011, at 12:48 PM, Michel Fortin <michel.fortin@michelf.com> wrote:

> On 2011-12-31 16:47:40 +0000, Sean Kelly <sean@invisibleduck.org> said:
> 
>> I don't know that Unicode expertise is really required here anyway.  All one has to know is that UTF8 is a multibyte encoding and built-in string attrib utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s cience.
> 
> It's not bytes vs. characters, it's code units vs. code points vs. user perceived characters (grapheme clusters). One character can span multiple code points, and can be represented in various ways depending on which Unicode normalization you pick. But most people don't know that.
> 
> If you want to count the number of *characters*, counting code points isn't really it, as you should avoid counting the combining ones. If you want to search for a substring, you need to be sure both strings use the same normalization first, and if not normalize them appropriately so that equivalent code point combinations are always represented the same.
> 
> That said, if you are implementing an XML or JSON parser, since those specs are defined in term of code points you should probably write your code in term of code points (hopefully without decoding code points when you don't need to). On the other hand, if you're writing something that processes text (like counting the average number of *character* per word in a document), then you should be aware of combining characters.
> 
> How to pack all this into an easy to use package is most challenging.
> 
> 
> -- 
> Michel Fortin
> michel.fortin@michelf.com
> http://michelf.com/
> 
January 01, 2012
On 12/31/2011 02:02 PM, Timon Gehr wrote:
> On 12/31/2011 07:22 PM, Chad J wrote:
>> On 12/30/2011 02:55 PM, Timon Gehr wrote:
>>> On 12/30/2011 08:33 PM, Joshua Reusch wrote:
>>>> Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
>>>>> On 12/29/11 12:28 PM, Don wrote:
>>>>>> On 28.12.2011 20:00, Andrei Alexandrescu wrote:
>>>>>>> Oh, one more thing - one good thing that could come out of this
>>>>>>> thread
>>>>>>> is abolition (through however slow a deprecation path) of
>>>>>>> s.length and
>>>>>>> s[i] for narrow strings. Requiring s.rep.length instead of s.length
>>>>>>> and
>>>>>>> s.rep[i] instead of s[i] would improve the quality of narrow strings
>>>>>>> tremendously. Also, s.rep[i] should return ubyte/ushort, not
>>>>>>> char/wchar.
>>>>>>> Then, people would access the decoding routines on the needed
>>>>>>> occasions,
>>>>>>> or would consciously use the representation.
>>>>>>>
>>>>>>> Yum.
>>>>>>
>>>>>>
>>>>>> If I understand this correctly, most others don't. Effectively, .rep
>>>>>> just means, "I know what I'm doing", and there's no change to
>>>>>> existing
>>>>>> semantics, purely a syntax change.
>>>>>
>>>>> Exactly!
>>>>>
>>>>>> If you change s[i] into s.rep[i], it does the same thing as now.
>>>>>> There's
>>>>>> no loss of functionality -- it's just stops you from accidentally
>>>>>> doing
>>>>>> the wrong thing. Like .ptr for getting the address of an array.
>>>>>> Typically all the ".rep" everywhere would get annoying, so you would
>>>>>> write:
>>>>>> ubyte [] u = s.rep;
>>>>>> and use u from then on.
>>>>>>
>>>>>> I don't like the name 'rep'. Maybe 'raw' or 'utf'?
>>>>>> Apart from that, I think this would be perfect.
>>>>>
>>>>> Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great.
>>>>>
>>>>> Now I'm twice sorry this will not happen...
>>>>>
>>>>
>>>> Maybe it could happen if we
>>>> 1. make dstring the default strings type --
>>>
>>> Inefficient.
>>>
>>
>> But correct (enough).
>>
>>>> code units and characters would be the same
>>>
>>> Wrong.
>>>
>>
>> *sigh*, FINE.  Code units and /code points/ would be the same.
> 
> Relax.
> 

I'll do one better and ultra relax:
http://www.youtube.com/watch?v=jimQoWXzc0Q
;)

>>
>>>> or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindex
>>>
>>> Inconsistent and inefficient (it blows up the algorithmic complexity).
>>>
>>
>> Inconsistent?  How?
> 
> int[]
> bool[]
> float[]
> char[]
> 

I'll refer to another limb of this thread when foobar mentioned a mental model of strings as strings of letters.  Now, given annoying corner cases, we probably can't get strings of /letters/, but I'd at least like to make it as far as code points.  That seems very doable.  I mention this because I find that forwarding string.length and opIndex would be much more consistent with this mental model of strings as strings of unicode code points, which, IMO, is more important than it being binary consistent with the other things.  I'd much rather have char[] behave more like an array of code points than an array of bytes.  I don't need an array of bytes.  That's ubyte[]; I have that already.

>>
>> Inefficiency is a lot easier to deal with than incorrect.  If something is inefficient, then in the right places I will NOTICE.  If something is incorrect, it can hide for years until that one person (or country, in this case) with a different usage pattern than the others uncovers it.
>>
>>>>
>>>> so programmers could use the slices/indexing/length (no lazyness
>>>> problems), and if they really want codeunits use .raw/.rep (or better
>>>> .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)
>>>>
>>>
>>> Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.
>>>
>>
>> What about people who want to write correct string processing code AND want to use this handy slicing feature?  Because I totally want both of these.  Slicing is super useful for script-like coding.
>>
> 
> Except that the proposal would make slicing strings go away.
> 

Yeah, Andrei's proposal says that.  But I'm speaking of Joshua's:

>>>> so programmers could use the slices/indexing/length ...

I kind-of like either, but I'd prefer Joshua's suggestion.

>>>> But generally I liked the idea of just having an alias for strings...
>>>
>>> Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen?
>>>
>>> 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.
>>>
>>
>> How do you know they are only working with ASCII?  They might be /now/.
>>   But what if someone else uses the program a couple years later when the
>> original author is no longer maintaining that chunk of code?
> 
> Then they obviously need to fix the code, because the requirements have changed. Most of it will already work correctly though, because UTF-8 extends ASCII in a natural way.
> 

Or, you know, we could design the language a little differently and make this become mostly a non-problem.  That would be cool.

>>
>>> 2. They get screwed up string output, look for the reason, patch up their code with some functions from std.utf and will never make the same mistakes again.
>>>
>>
>> Except they don't.  Because there are a lot of programmers that will never put in non-ascii strings to begin with.  But that has nothing to do with whether or not the /users/ or /maintainers/ of that code will put non-ascii strings in.  This could make some messes.
>>
>>>
>>> I have *never* seen an user in D.learn complain about it. They might have been some I missed, but it is certainly not a prevalent problem. Also, just because an user can type .rep does not mean he understands Unicode: He is able to make just the same mistakes as before, even more so, as the array he is getting back has the _wrong element type_.
>>>
>>
>> You know, here in America (Amurica?) we don't know that other countries exist.  I think there is a large population of programmers here that don't even know how to enter non-latin characters, much less would think to include such characters in their test cases.  These programmers won't necessarily be found on the internet much, but they will be found in cubicles all around, doing their 9-to-5 and writing mediocre code that the rest of us have to put up with.  Their code will pass peer review (their peers are also from America) and continue working just fine until someone from one of those confusing other places decides to type in the characters they feel comfortable typing in.  No, there will not be /tests/ for code points greater than 0x80, because there is no one around to write those.  I'd feel a little better if D herds people into writing correct code to begin with, because they won't otherwise.
>>
> 
> There is no way to 'herd people into writing correct code' and UTF-8 is quite easy to deal with.
> 

Probably not.  I played fast and loose with this a lot in my early D code.  Then this same conversation happened like ~3 years ago on this newsgroup.  Then I learned more about unicode and had a bit of a bitter taste regarding char[] and how it handled indexing.  I thought I could just index char[]s willy nilly.  But no, I can't.  And the compiler won't tell me.  It just silently does what I don't want.

Maybe unicode is easy, but we sure as hell aren't born with it, and the language doesn't give beginners ANY red flags about this.

I find myself pretty fortified against this issue due to having known about it before anything unpleasant happened, but I don't like the idea of others having to learn the hard way.

>> ...
>>
>> There's another issue at play here too: efficiency vs correctness as a default.
>>
>> Here's the tradeoff --
>>
>> Option A:
>> char[i] returns the i'th byte of the string as a (char) type.
>> Consequences:
>> (1) Code is efficient and INcorrect.
> 
> Do you have an example of impactful incorrect code resulting from those semantics?
> 

Nope.  Sorry.  I learned about it before it had a chance to bite me. But this is only because I frequent(ed) the newsgroup and had a good throw on my dice roll.

>> (2) It requires extra effort to write correct code.
>> (3) Detecting the incorrect code may take years, as these errors can
>> hide easily.
> 
> None of those is a direct consequence of char[i] returning char. They are the consequence of at least 3 things:
> 
> 1. char[] is an array of char
> 2. immutable(char)[] is the default string type
> 3. the programmer does not know about 1. and/or 2.
> 
> I say, 1. is inevitable. You say 3. is inevitable. If we are both right, then 2. is the culprit.
> 

I can get behind this.

Honestly I'd like the default string type to be intelligent and optimize itself into whichever UTF-N encoding is optimal for content I throw into it.  Maybe this means it should lazily expand itself to the narrowest character type that maintains a 1-to-1 ratio between code units and code points so that indexing/slicing remain O(1), or maybe it's a bag of disparate encodings, or maybe someone can think of a better strategy. Just make it /reasonably/ fast and help me with correctness as much as possible.  If I need more performance or more unicode pedantics, I'll do my homework then and only then.

Of course this is probably never going to happen I'm afraid.  Even the problem of making such a (probably) struct work at compile time in templates as if it were a native type... agh, headaches.

>>
>> Option B:
>> char[i] returns the i'th codepoint of the string as a (dchar) type.
>> Consequences:
>> (1) Code is INefficient and correct.
> 
> It is awfully optimistic to assume the code will be correct.
> 
>> (2) It requires extra effort to write efficient code.
>> (3) Detecting the inefficient code happens in minutes.  It is VERY
>> noticable when your program runs too slowly.
>>
> 
> Except when in testing only small inputs are used and only 2 years later maintainers throw your program at a larger problem instance and wonder why it does not terminate. Or your program is DOS'd. Polynomial blowup in runtime can be as large a problem as a correctness bug in practice just fine.
> 

I see what you mean there.  I'm still not entirely happy with it though.
 I don't think these are reasonable requirements.  It sounds like forced
premature optimization to me.

I have found myself in a number of places in different problem domains where optimality-is-correctness.  Make it too slow and the program isn't worth writing.  I can't imagine doing this for workloads I can't test on or anticipate though: I'd have to operate like NASA and make things 10x more expensive than they need to be.

Correctness, on the other hand, can be easily (relatively speaking) obtained by only allowing the user to input data you can handle and then making sure the program can handle it as promised.  Test, test, test, etc.

>>
>> This is how I see it.
>>
>> And I really like my correct code.  If it's too slow, and I'll /know/ when it's too slow, then I'll profile->tweak->profile->etc until the slowness goes away.  I'm totally digging option B.
> 
> Those kinds of inefficiencies build up and make the whole program run sluggish, and it will possibly be to late when you notice.
> 

I get the feeling that the typical divide-and-conquer profiling strategy will find the more expensive operations /at least/ most of the time. Unfortunately, I have only experience to speak from on this matter.

> Option B is not even on the table. This thread is about a breaking interface change and special casing T[] for T in {char, wchar}.
> 
> 

Yeah, I know.  I'm refering to what Joshua wrote, because I like option B.  Even if it's academic, I'll say I like it anyways, if only for the sake of argument.
January 01, 2012
On 01/01/2012 02:34 AM, Chad J wrote:
> On 12/31/2011 02:02 PM, Timon Gehr wrote:
>> On 12/31/2011 07:22 PM, Chad J wrote:
>>> On 12/30/2011 02:55 PM, Timon Gehr wrote:
>>>> On 12/30/2011 08:33 PM, Joshua Reusch wrote:
>>>>> Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
>>>>>> On 12/29/11 12:28 PM, Don wrote:
>>>>>>> On 28.12.2011 20:00, Andrei Alexandrescu wrote:
>>>>>>>> Oh, one more thing - one good thing that could come out of this
>>>>>>>> thread
>>>>>>>> is abolition (through however slow a deprecation path) of
>>>>>>>> s.length and
>>>>>>>> s[i] for narrow strings. Requiring s.rep.length instead of s.length
>>>>>>>> and
>>>>>>>> s.rep[i] instead of s[i] would improve the quality of narrow strings
>>>>>>>> tremendously. Also, s.rep[i] should return ubyte/ushort, not
>>>>>>>> char/wchar.
>>>>>>>> Then, people would access the decoding routines on the needed
>>>>>>>> occasions,
>>>>>>>> or would consciously use the representation.
>>>>>>>>
>>>>>>>> Yum.
>>>>>>>
>>>>>>>
>>>>>>> If I understand this correctly, most others don't. Effectively, .rep
>>>>>>> just means, "I know what I'm doing", and there's no change to
>>>>>>> existing
>>>>>>> semantics, purely a syntax change.
>>>>>>
>>>>>> Exactly!
>>>>>>
>>>>>>> If you change s[i] into s.rep[i], it does the same thing as now.
>>>>>>> There's
>>>>>>> no loss of functionality -- it's just stops you from accidentally
>>>>>>> doing
>>>>>>> the wrong thing. Like .ptr for getting the address of an array.
>>>>>>> Typically all the ".rep" everywhere would get annoying, so you would
>>>>>>> write:
>>>>>>> ubyte [] u = s.rep;
>>>>>>> and use u from then on.
>>>>>>>
>>>>>>> I don't like the name 'rep'. Maybe 'raw' or 'utf'?
>>>>>>> Apart from that, I think this would be perfect.
>>>>>>
>>>>>> Yes, I mean "rep" as a short for "representation" but upon first sight
>>>>>> the connection is tenuous. "raw" sounds great.
>>>>>>
>>>>>> Now I'm twice sorry this will not happen...
>>>>>>
>>>>>
>>>>> Maybe it could happen if we
>>>>> 1. make dstring the default strings type --
>>>>
>>>> Inefficient.
>>>>
>>>
>>> But correct (enough).
>>>
>>>>> code units and characters would be the same
>>>>
>>>> Wrong.
>>>>
>>>
>>> *sigh*, FINE.  Code units and /code points/ would be the same.
>>
>> Relax.
>>
>
> I'll do one better and ultra relax:
> http://www.youtube.com/watch?v=jimQoWXzc0Q
> ;)
>
>>>
>>>>> or 2. forward string.length to std.utf.count and opIndex to
>>>>> std.utf.toUTFindex
>>>>
>>>> Inconsistent and inefficient (it blows up the algorithmic complexity).
>>>>
>>>
>>> Inconsistent?  How?
>>
>> int[]
>> bool[]
>> float[]
>> char[]
>>
>
> I'll refer to another limb of this thread when foobar mentioned a mental
> model of strings as strings of letters.  Now, given annoying corner
> cases, we probably can't get strings of /letters/, but I'd at least like
> to make it as far as code points.  That seems very doable.  I mention
> this because I find that forwarding string.length and opIndex would be
> much more consistent with this mental model of strings as strings of
> unicode code points, which, IMO, is more important than it being binary
> consistent with the other things.  I'd much rather have char[] behave
> more like an array of code points than an array of bytes.  I don't need
> an array of bytes.  That's ubyte[]; I have that already.
>

char[] is not an array of bytes: it is an array of UTF-8 code units.

>>>
>>> Inefficiency is a lot easier to deal with than incorrect.  If something
>>> is inefficient, then in the right places I will NOTICE.  If something is
>>> incorrect, it can hide for years until that one person (or country, in
>>> this case) with a different usage pattern than the others uncovers it.
>>>
>>>>>
>>>>> so programmers could use the slices/indexing/length (no lazyness
>>>>> problems), and if they really want codeunits use .raw/.rep (or better
>>>>> .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)
>>>>>
>>>>
>>>> Anyone who intends to write efficient string processing code needs this.
>>>> Anyone who does not want to write string processing code will not need
>>>> to index into a string -- standard library functions will suffice.
>>>>
>>>
>>> What about people who want to write correct string processing code AND
>>> want to use this handy slicing feature?  Because I totally want both of
>>> these.  Slicing is super useful for script-like coding.
>>>
>>
>> Except that the proposal would make slicing strings go away.
>>
>
> Yeah, Andrei's proposal says that.  But I'm speaking of Joshua's:
>
>>>>> so programmers could use the slices/indexing/length ...
>
> I kind-of like either, but I'd prefer Joshua's suggestion.
>
>>>>> But generally I liked the idea of just having an alias for strings...
>>>>
>>>> Me too. I think the way we have it now is optimal. The only reason we
>>>> are discussing this is because of fear that uneducated users will write
>>>> code that does not take into account Unicode characters above code point
>>>> 0x80. But what is the worst thing that can happen?
>>>>
>>>> 1. They don't notice. Then it is not a problem, because they are
>>>> obviously only using ASCII characters and it is perfectly reasonable to
>>>> assume that code units and characters are the same thing.
>>>>
>>>
>>> How do you know they are only working with ASCII?  They might be /now/.
>>>    But what if someone else uses the program a couple years later when the
>>> original author is no longer maintaining that chunk of code?
>>
>> Then they obviously need to fix the code, because the requirements have
>> changed. Most of it will already work correctly though, because UTF-8
>> extends ASCII in a natural way.
>>
>
> Or, you know, we could design the language a little differently and make
> this become mostly a non-problem.  That would be cool.
>

It is imo already mostly a non-problem, but YMMV:

void main(){
    string s = readln();
    int nest = 0;
    foreach(x;s){ // iterates by code unit
        if(x=='(') nest++;
        else if(x==')' && --nest<0) goto unbalanced;
    }
    if(!nest){
        writeln("balanced parentheses");
        return;
    }
unbalanced:
    writeln("unbalanced parentheses");
}

That code is UTF aware, even though it does not explicitly deal with UTF. I'd claim it is like this most of the time.


>>>
>>>> 2. They get screwed up string output, look for the reason, patch up
>>>> their code with some functions from std.utf and will never make the same
>>>> mistakes again.
>>>>
>>>
>>> Except they don't.  Because there are a lot of programmers that will
>>> never put in non-ascii strings to begin with.  But that has nothing to
>>> do with whether or not the /users/ or /maintainers/ of that code will
>>> put non-ascii strings in.  This could make some messes.
>>>
>>>>
>>>> I have *never* seen an user in D.learn complain about it. They might
>>>> have been some I missed, but it is certainly not a prevalent problem.
>>>> Also, just because an user can type .rep does not mean he understands
>>>> Unicode: He is able to make just the same mistakes as before, even more
>>>> so, as the array he is getting back has the _wrong element type_.
>>>>
>>>
>>> You know, here in America (Amurica?) we don't know that other countries
>>> exist.  I think there is a large population of programmers here that
>>> don't even know how to enter non-latin characters, much less would think
>>> to include such characters in their test cases.  These programmers won't
>>> necessarily be found on the internet much, but they will be found in
>>> cubicles all around, doing their 9-to-5 and writing mediocre code that
>>> the rest of us have to put up with.  Their code will pass peer review
>>> (their peers are also from America) and continue working just fine until
>>> someone from one of those confusing other places decides to type in the
>>> characters they feel comfortable typing in.  No, there will not be
>>> /tests/ for code points greater than 0x80, because there is no one
>>> around to write those.  I'd feel a little better if D herds people into
>>> writing correct code to begin with, because they won't otherwise.
>>>
>>
>> There is no way to 'herd people into writing correct code' and UTF-8 is
>> quite easy to deal with.
>>
>
> Probably not.  I played fast and loose with this a lot in my early D
> code.  Then this same conversation happened like ~3 years ago on this
> newsgroup.  Then I learned more about unicode and had a bit of a bitter
> taste regarding char[] and how it handled indexing.  I thought I could
> just index char[]s willy nilly.  But no, I can't.  And the compiler
> won't tell me.  It just silently does what I don't want.
>

How often do you actually need to get, for example, the 10th character of a string? I think it is a very uncommon operation. If the indexing is just part of an iteration that looks once at each char and handles some ASCII characters in certain ways, there is no potential correctness problem. As soon as code talks about non-ascii characters, it has to be UTF aware anyway.

> Maybe unicode is easy, but we sure as hell aren't born with it, and the
> language doesn't give beginners ANY red flags about this.
>
> I find myself pretty fortified against this issue due to having known
> about it before anything unpleasant happened, but I don't like the idea
> of others having to learn the hard way.
>

Hm, well. The first thing I looked up when I learned D supports Unicode is how Unicode/UTF work in detail. After that, the semantics of char[] were very clear to me.

>>> ...
>>>
>>> There's another issue at play here too: efficiency vs correctness as a
>>> default.
>>>
>>> Here's the tradeoff --
>>>
>>> Option A:
>>> char[i] returns the i'th byte of the string as a (char) type.
>>> Consequences:
>>> (1) Code is efficient and INcorrect.
>>
>> Do you have an example of impactful incorrect code resulting from those
>> semantics?
>>
>
> Nope.  Sorry.  I learned about it before it had a chance to bite me.
> But this is only because I frequent(ed) the newsgroup and had a good
> throw on my dice roll.
>

I might be wrong, but I somewhat have the impression we might be chasing phantoms here. I have so far never seen a bug in real world code caused by inadvertent misuse of D string indexing or slicing.

>>> (2) It requires extra effort to write correct code.
>>> (3) Detecting the incorrect code may take years, as these errors can
>>> hide easily.
>>
>> None of those is a direct consequence of char[i] returning char. They
>> are the consequence of at least 3 things:
>>
>> 1. char[] is an array of char
>> 2. immutable(char)[] is the default string type
>> 3. the programmer does not know about 1. and/or 2.
>>
>> I say, 1. is inevitable. You say 3. is inevitable. If we are both right,
>> then 2. is the culprit.
>>
>
> I can get behind this.
>
> Honestly I'd like the default string type to be intelligent and optimize
> itself into whichever UTF-N encoding is optimal for content I throw into
> it.  Maybe this means it should lazily expand itself to the narrowest
> character type that maintains a 1-to-1 ratio between code units and code
> points so that indexing/slicing remain O(1), or maybe it's a bag of
> disparate encodings, or maybe someone can think of a better strategy.
> Just make it /reasonably/ fast and help me with correctness as much as
> possible.  If I need more performance or more unicode pedantics, I'll do
> my homework then and only then.
>
> Of course this is probably never going to happen I'm afraid.  Even the
> problem of making such a (probably) struct work at compile time in
> templates as if it were a native type... agh, headaches.
>
>>>
>>> Option B:
>>> char[i] returns the i'th codepoint of the string as a (dchar) type.
>>> Consequences:
>>> (1) Code is INefficient and correct.
>>
>> It is awfully optimistic to assume the code will be correct.
>>
>>> (2) It requires extra effort to write efficient code.
>>> (3) Detecting the inefficient code happens in minutes.  It is VERY
>>> noticable when your program runs too slowly.
>>>
>>
>> Except when in testing only small inputs are used and only 2 years later
>> maintainers throw your program at a larger problem instance and wonder
>> why it does not terminate. Or your program is DOS'd. Polynomial blowup
>> in runtime can be as large a problem as a correctness bug in practice
>> just fine.
>>
>
> I see what you mean there.  I'm still not entirely happy with it though.
>   I don't think these are reasonable requirements.  It sounds like forced
> premature optimization to me.
>

It is using a better algorithm that performs faster by a linear factor. I would be very leery of something that looks like a constant time array indexing operation take linear time. I think premature optimization is about writing near-optimal hard-to-debug and maintain code that only gains some constant factors in parts of the code that are not performance critical.

> I have found myself in a number of places in different problem domains
> where optimality-is-correctness.  Make it too slow and the program isn't
> worth writing.  I can't imagine doing this for workloads I can't test on
> or anticipate though: I'd have to operate like NASA and make things 10x
> more expensive than they need to be.
>
> Correctness, on the other hand, can be easily (relatively speaking)
> obtained by only allowing the user to input data you can handle and then
> making sure the program can handle it as promised.  Test, test, test, etc.
>
>>>
>>> This is how I see it.
>>>
>>> And I really like my correct code.  If it's too slow, and I'll /know/
>>> when it's too slow, then I'll profile->tweak->profile->etc until the
>>> slowness goes away.  I'm totally digging option B.
>>
>> Those kinds of inefficiencies build up and make the whole program run
>> sluggish, and it will possibly be to late when you notice.
>>
>
> I get the feeling that the typical divide-and-conquer profiling strategy
> will find the more expensive operations /at least/ most of the time.
> Unfortunately, I have only experience to speak from on this matter.
>

Yes, what I meant is, that if the inefficiencies are spread out more or less uniformly, then fixing it all up might seem to be too much work and too much risk.

>> Option B is not even on the table. This thread is about a breaking
>> interface change and special casing T[] for T in {char, wchar}.
>>
>>
>
> Yeah, I know.  I'm refering to what Joshua wrote, because I like option
> B.  Even if it's academic, I'll say I like it anyways, if only for the
> sake of argument.

OK.
January 01, 2012
On 12/31/2011 09:17 PM, Timon Gehr wrote:
> On 01/01/2012 02:34 AM, Chad J wrote:
>> On 12/31/2011 02:02 PM, Timon Gehr wrote:
>>> On 12/31/2011 07:22 PM, Chad J wrote:
>>>> On 12/30/2011 02:55 PM, Timon Gehr wrote:
>>>>> On 12/30/2011 08:33 PM, Joshua Reusch wrote:
>>>>>>
>>>>>> Maybe it could happen if we
>>>>>> 1. make dstring the default strings type --
>>>>>
>>>>> Inefficient.
>>>>>
>>>>
>>>> But correct (enough).
>>>>
>>>>>> code units and characters would be the same
>>>>>
>>>>> Wrong.
>>>>>
>>>>
>>>> *sigh*, FINE.  Code units and /code points/ would be the same.
>>>
>>> Relax.
>>>
>>
>> I'll do one better and ultra relax:
>> http://www.youtube.com/watch?v=jimQoWXzc0Q
>> ;)
>>
>>>>
>>>>>> or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindex
>>>>>
>>>>> Inconsistent and inefficient (it blows up the algorithmic complexity).
>>>>>
>>>>
>>>> Inconsistent?  How?
>>>
>>> int[]
>>> bool[]
>>> float[]
>>> char[]
>>>
>>
>> I'll refer to another limb of this thread when foobar mentioned a mental model of strings as strings of letters.  Now, given annoying corner cases, we probably can't get strings of /letters/, but I'd at least like to make it as far as code points.  That seems very doable.  I mention this because I find that forwarding string.length and opIndex would be much more consistent with this mental model of strings as strings of unicode code points, which, IMO, is more important than it being binary consistent with the other things.  I'd much rather have char[] behave more like an array of code points than an array of bytes.  I don't need an array of bytes.  That's ubyte[]; I have that already.
>>
> 
> char[] is not an array of bytes: it is an array of UTF-8 code units.
> 

Meh, I'd still prefer it be an array of UTF-8 code /points/ represented by an array of bytes (which are the UTF-8 code units).

>>>>
>>>> Inefficiency is a lot easier to deal with than incorrect.  If something
>>>> is inefficient, then in the right places I will NOTICE.  If
>>>> something is
>>>> incorrect, it can hide for years until that one person (or country, in
>>>> this case) with a different usage pattern than the others uncovers it.
>>>>
>>>>>>
>>>>>> so programmers could use the slices/indexing/length (no lazyness
>>>>>> problems), and if they really want codeunits use .raw/.rep (or better
>>>>>> .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)
>>>>>>
>>>>>
>>>>> Anyone who intends to write efficient string processing code needs
>>>>> this.
>>>>> Anyone who does not want to write string processing code will not need
>>>>> to index into a string -- standard library functions will suffice.
>>>>>
>>>>
>>>> What about people who want to write correct string processing code AND want to use this handy slicing feature?  Because I totally want both of these.  Slicing is super useful for script-like coding.
>>>>
>>>
>>> Except that the proposal would make slicing strings go away.
>>>
>>
>> Yeah, Andrei's proposal says that.  But I'm speaking of Joshua's:
>>
>>>>>> so programmers could use the slices/indexing/length ...
>>
>> I kind-of like either, but I'd prefer Joshua's suggestion.
>>
>>>>>> But generally I liked the idea of just having an alias for strings...
>>>>>
>>>>> Me too. I think the way we have it now is optimal. The only reason we
>>>>> are discussing this is because of fear that uneducated users will
>>>>> write
>>>>> code that does not take into account Unicode characters above code
>>>>> point
>>>>> 0x80. But what is the worst thing that can happen?
>>>>>
>>>>> 1. They don't notice. Then it is not a problem, because they are
>>>>> obviously only using ASCII characters and it is perfectly
>>>>> reasonable to
>>>>> assume that code units and characters are the same thing.
>>>>>
>>>>
>>>> How do you know they are only working with ASCII?  They might be /now/.
>>>>    But what if someone else uses the program a couple years later
>>>> when the
>>>> original author is no longer maintaining that chunk of code?
>>>
>>> Then they obviously need to fix the code, because the requirements have changed. Most of it will already work correctly though, because UTF-8 extends ASCII in a natural way.
>>>
>>
>> Or, you know, we could design the language a little differently and make this become mostly a non-problem.  That would be cool.
>>
> 
> It is imo already mostly a non-problem, but YMMV:
> 
> void main(){
>     string s = readln();
>     int nest = 0;
>     foreach(x;s){ // iterates by code unit
>         if(x=='(') nest++;
>         else if(x==')' && --nest<0) goto unbalanced;
>     }
>     if(!nest){
>         writeln("balanced parentheses");
>         return;
>     }
> unbalanced:
>     writeln("unbalanced parentheses");
> }
> 
> That code is UTF aware, even though it does not explicitly deal with UTF. I'd claim it is like this most of the time.
> 
> 

I'm willing to agree with this.

I still don't like the possibility that folks encounter corner-cases in that not-most-of-the-time.

I'm not going to rage-face too hard if this never changes though.  There would be a number of other things more important to fix before this, IMO.

>>>>
>>>>> 2. They get screwed up string output, look for the reason, patch up
>>>>> their code with some functions from std.utf and will never make the
>>>>> same
>>>>> mistakes again.
>>>>>
>>>>
>>>> Except they don't.  Because there are a lot of programmers that will never put in non-ascii strings to begin with.  But that has nothing to do with whether or not the /users/ or /maintainers/ of that code will put non-ascii strings in.  This could make some messes.
>>>>
>>>>>
>>>>> I have *never* seen an user in D.learn complain about it. They might
>>>>> have been some I missed, but it is certainly not a prevalent problem.
>>>>> Also, just because an user can type .rep does not mean he understands
>>>>> Unicode: He is able to make just the same mistakes as before, even
>>>>> more
>>>>> so, as the array he is getting back has the _wrong element type_.
>>>>>
>>>>
>>>> You know, here in America (Amurica?) we don't know that other countries
>>>> exist.  I think there is a large population of programmers here that
>>>> don't even know how to enter non-latin characters, much less would
>>>> think
>>>> to include such characters in their test cases.  These programmers
>>>> won't
>>>> necessarily be found on the internet much, but they will be found in
>>>> cubicles all around, doing their 9-to-5 and writing mediocre code that
>>>> the rest of us have to put up with.  Their code will pass peer review
>>>> (their peers are also from America) and continue working just fine
>>>> until
>>>> someone from one of those confusing other places decides to type in the
>>>> characters they feel comfortable typing in.  No, there will not be
>>>> /tests/ for code points greater than 0x80, because there is no one
>>>> around to write those.  I'd feel a little better if D herds people into
>>>> writing correct code to begin with, because they won't otherwise.
>>>>
>>>
>>> There is no way to 'herd people into writing correct code' and UTF-8 is quite easy to deal with.
>>>
>>
>> Probably not.  I played fast and loose with this a lot in my early D code.  Then this same conversation happened like ~3 years ago on this newsgroup.  Then I learned more about unicode and had a bit of a bitter taste regarding char[] and how it handled indexing.  I thought I could just index char[]s willy nilly.  But no, I can't.  And the compiler won't tell me.  It just silently does what I don't want.
>>
> 
> How often do you actually need to get, for example, the 10th character of a string? I think it is a very uncommon operation. If the indexing is just part of an iteration that looks once at each char and handles some ASCII characters in certain ways, there is no potential correctness problem. As soon as code talks about non-ascii characters, it has to be UTF aware anyway.
> 

If you haven't been educated about unicode or how D handles it, you might write this:

char[] str;
... load str ...
for ( int i = 0; i < str.length; i++ )
{
    font.render(str[i]); // Ewww.
    ...
}

It'd be neat if that gave a compiler error, or just passed code points as dchar's.  Maybe a compiler error is best in this light.

>> Maybe unicode is easy, but we sure as hell aren't born with it, and the language doesn't give beginners ANY red flags about this.
>>
>> I find myself pretty fortified against this issue due to having known about it before anything unpleasant happened, but I don't like the idea of others having to learn the hard way.
>>
> 
> Hm, well. The first thing I looked up when I learned D supports Unicode is how Unicode/UTF work in detail. After that, the semantics of char[] were very clear to me.
> 
>>>> ...
>>>>
>>>> There's another issue at play here too: efficiency vs correctness as a default.
>>>>
>>>> Here's the tradeoff --
>>>>
>>>> Option A:
>>>> char[i] returns the i'th byte of the string as a (char) type.
>>>> Consequences:
>>>> (1) Code is efficient and INcorrect.
>>>
>>> Do you have an example of impactful incorrect code resulting from those semantics?
>>>
>>
>> Nope.  Sorry.  I learned about it before it had a chance to bite me. But this is only because I frequent(ed) the newsgroup and had a good throw on my dice roll.
>>
> 
> I might be wrong, but I somewhat have the impression we might be chasing phantoms here. I have so far never seen a bug in real world code caused by inadvertent misuse of D string indexing or slicing.
> 

Possibly.

>>>> (2) It requires extra effort to write correct code.
>>>> (3) Detecting the incorrect code may take years, as these errors can
>>>> hide easily.
>>>
>>> None of those is a direct consequence of char[i] returning char. They are the consequence of at least 3 things:
>>>
>>> 1. char[] is an array of char
>>> 2. immutable(char)[] is the default string type
>>> 3. the programmer does not know about 1. and/or 2.
>>>
>>> I say, 1. is inevitable. You say 3. is inevitable. If we are both right, then 2. is the culprit.
>>>
>>
>> I can get behind this.
>>
>> Honestly I'd like the default string type to be intelligent and optimize itself into whichever UTF-N encoding is optimal for content I throw into it.  Maybe this means it should lazily expand itself to the narrowest character type that maintains a 1-to-1 ratio between code units and code points so that indexing/slicing remain O(1), or maybe it's a bag of disparate encodings, or maybe someone can think of a better strategy. Just make it /reasonably/ fast and help me with correctness as much as possible.  If I need more performance or more unicode pedantics, I'll do my homework then and only then.
>>
>> Of course this is probably never going to happen I'm afraid.  Even the problem of making such a (probably) struct work at compile time in templates as if it were a native type... agh, headaches.
>>
>>>>
>>>> Option B:
>>>> char[i] returns the i'th codepoint of the string as a (dchar) type.
>>>> Consequences:
>>>> (1) Code is INefficient and correct.
>>>
>>> It is awfully optimistic to assume the code will be correct.
>>>
>>>> (2) It requires extra effort to write efficient code.
>>>> (3) Detecting the inefficient code happens in minutes.  It is VERY
>>>> noticable when your program runs too slowly.
>>>>
>>>
>>> Except when in testing only small inputs are used and only 2 years later maintainers throw your program at a larger problem instance and wonder why it does not terminate. Or your program is DOS'd. Polynomial blowup in runtime can be as large a problem as a correctness bug in practice just fine.
>>>
>>
>> I see what you mean there.  I'm still not entirely happy with it though.
>>   I don't think these are reasonable requirements.  It sounds like forced
>> premature optimization to me.
>>
> 
> It is using a better algorithm that performs faster by a linear factor. I would be very leery of something that looks like a constant time array indexing operation take linear time. I think premature optimization is about writing near-optimal hard-to-debug and maintain code that only gains some constant factors in parts of the code that are not performance critical.
> 

This wouldn't be the first data structure to require linear time indexing.  I mean, linked lists exists.

I do feel that heavy-duty optimization puts the onus on the programmer
to know what to do.  The programming language is responsible for merely
making it possible, not for making it the default path.  The latter is
fairly impossible.  Correctness, on the other hand, should involve some
hand-holding.  It's that notion of the language catching me when I fall.
 I think the language should (and can) help a lot with program
correctness if designed right.  D is already really good on these
counts, and even helps quite a bit when optimization gets down-and-dirty.

>> I have found myself in a number of places in different problem domains where optimality-is-correctness.  Make it too slow and the program isn't worth writing.  I can't imagine doing this for workloads I can't test on or anticipate though: I'd have to operate like NASA and make things 10x more expensive than they need to be.
>>
>> Correctness, on the other hand, can be easily (relatively speaking) obtained by only allowing the user to input data you can handle and then making sure the program can handle it as promised.  Test, test, test, etc.
>>
>>>>
>>>> This is how I see it.
>>>>
>>>> And I really like my correct code.  If it's too slow, and I'll /know/ when it's too slow, then I'll profile->tweak->profile->etc until the slowness goes away.  I'm totally digging option B.
>>>
>>> Those kinds of inefficiencies build up and make the whole program run sluggish, and it will possibly be to late when you notice.
>>>
>>
>> I get the feeling that the typical divide-and-conquer profiling strategy will find the more expensive operations /at least/ most of the time. Unfortunately, I have only experience to speak from on this matter.
>>
> 
> Yes, what I meant is, that if the inefficiencies are spread out more or less uniformly, then fixing it all up might seem to be too much work and too much risk.
> 

Ah, right.  Because code refactoring tends to suck.  I get you.

This is, of course, still the same reason why I'd never want to have to go through my code and replace all of the "font.render(str[i]);".  Yeah, starting a number of years ago it won't happen to me, but it might get someone else.

>>> Option B is not even on the table. This thread is about a breaking interface change and special casing T[] for T in {char, wchar}.
>>>
>>>
>>
>> Yeah, I know.  I'm refering to what Joshua wrote, because I like option B.  Even if it's academic, I'll say I like it anyways, if only for the sake of argument.
> 
> OK.

January 01, 2012
On 31.12.2011 17:13, Timon Gehr wrote:
> On 12/31/2011 01:15 PM, Don wrote:
>> On 31.12.2011 01:56, Timon Gehr wrote:
>>> On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
>>>> On 12/30/11 6:07 PM, Timon Gehr wrote:
>>>>> alias std.string.representation raw;
>>>>
>>>> I meant your implementation is incomplete.
>>>
>>> It was more a sketch than an implementation. It is not even type safe
>>> :o).
>>>
>>>>
>>>> But the main point is that presence of representation/raw is not the
>>>> issue.
>>>> The availability of good-for-nothing .length and operator[] are
>>>> the issue. Putting in place the convention of using .raw is hardly
>>>> useful within the context.
>>>>
>>>
>>> D strings are arrays. An array without .length and operator[] is close
>>> to being good for nothing. The language specification is quite clear
>>> about the fact that e.g. char is not a character but an utf-8 code unit.
>>> Therefore char[] is an array of code units.
>>
>> No, it isn't. That's the problem. char[] is not an array of char.
>> It has an additional invariant: it is a UTF8 string. If you randomly
>> change elements, the invariant is violated.
>
> char[] is an array of char and the additional invariant is not enforced
> by the language.

No, it isn't an ordinary array. For example with concatenation.  char[] ~ int will never create an invalid string. You can end up with multiple chars being appended, even from a single append. foreach is different, too. They are a bit magical.
There's quite a lot of code in the compiler to make sure that strings remain valid.

The additional invariant is not enforced in the case of slicing; that's the point.
January 01, 2012
> Meh, I'd still prefer it be an array of UTF-8 code /points/ represented by an array of bytes (which are the UTF-8 code units).

By saying you want an array of code points you already define representation. And if you want that there already is dchar[]. You probably meant a range of code points represented by an array of code units. But such a range can't have opIndex, since opIndex implies a constant time operation.  If you want nth element of the range, you can use std.range.drop or write your own nth() function.
January 01, 2012
On 01/01/2012 05:53 AM, Chad J wrote:
>
> If you haven't been educated about unicode or how D handles it, you
> might write this:
>
> char[] str;
> ... load str ...
> for ( int i = 0; i<  str.length; i++ )
> {
>      font.render(str[i]); // Ewww.
>      ...
> }
>

That actually looks like a bug that might happen in real world code. What is the signature of font.render?