March 09, 2014
On Friday, 7 March 2014 at 23:13:50 UTC, H. S. Teoh wrote:
> On Fri, Mar 07, 2014 at 10:35:46PM +0000, Sarath Kodali wrote:
>> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
>> >On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu
>> >wrote:
> [...]
>> >>Clearly one might argue that their app has no business dealing
>> >>with diacriticals or Asian characters. But that's the typical
>> >>provincial view that marred many languages' approach to UTF and
>> >>internationalization.
>> >
>> >So is yours, if you think that making everything magically a dchar
>> >is going to solve all problems.
>> >
>> >The TDPL example only showcases the problem. Yes, it works with
>> >Swedish. Now try it again with Sanskrit.
>> 
>> +1
>> In Indian languages, a character consists of one or more UNICODE
>> code points. For example, in Sanskrit "ddhrya"
>> http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
>> consists of 7 UNICODE code points. So to search for this char I have
>> to use string search.
> [...]
>
> That's what I've been arguing for. The most general form of character
> searching in Unicode requires substring searching, and similarly many
> character-based operations on Unicode strings are effectively
> substring-based operations, because said "character" may be a multibyte
> code point, or, in your case, multiple code points. Since that's the
> case, we might as well just forget about the distinction between
> "character" and "string", and treat all such operations as substring
> operations (even if the operand is supposedly "just 1 character long").
>
> This would allow us to get rid of the hackish auto-decoding of narrow
> strings, and thus eliminate the needless overhead of always decoding.

That won't work, because your needle might be in a different normalization form than your haystack, thus a byte-by-byte comparison will not be able to find it.
March 09, 2014
On 2014-03-09 14:12:28 +0000, "Marc Schütz" <schuetzm@gmx.net> said:

> That won't work, because your needle might be in a different normalization form than your haystack, thus a byte-by-byte comparison will not be able to find it.

The core of the problem is that sometime this byte-by-byte comparison is exactly what you want; when searching for some terminal character(s) in some kind of parser for instance.

Other times you want to do a proper Unicode search using Unicode comparison algorithms; when the user is searching for a particular string in a text document for instance.

The former is very easy to do with the current API. But what's the API for the later?

And how to make the correct API the obvious choice depending on the use case?

These two questions are what this thread should be about. Although not unimportant, performance of std.array.front() and whether it should decode is a secondary issue in comparison.

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca

March 09, 2014
On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
> IMO, the "normalization" argument is overrated. I've yet to encounter a real-world case of normalization: only hand written counter-examples. Not saying it doesn't exist, just that:
> 1. It occurs only in special cases that the program should be aware of before hand.
> 2. Arguably, be taken care of eagerly, or in a special pass.
>
> As for "the belief that iterating by code point has utility." I have to strongly disagree. Unicode is composed of codepoints, and that is what we handle. The fact that it can be be encoded and stored as UTF is implementation detail.

We don't "handle" code points (when have you ever wanted to handle a combining character separate to the character it combines with?)

You are just thinking of a subset of languages and locales.

Normalization is an issue any time you have a user enter text into your program and you then want to search for that text. I hope we can agree this isn't a rare occurrence.


>> AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character?
>
> But *what* other kinds of algorithms are there? AFAIK, the *only* type of algorithm that doesn't need decoding is searching, and you know what? std.algorithm.find does it perfectly well. This trickles into most other algorithms too: split, splitter or findAmong don't decode if they don't have too.

Searching, equality testing, copying, sorting, hashing, splitting, joining...

I can't think of a single use-case for searching for a non-ASCII code point. You can search for strings, but searching by code unit is just as good (and fast by default).


> AFAIK, the most common algorithm "case insensitive search" *must* decode.

But it must also normalize and take locales into account, so by code point is insufficient (unless you are willing to ignore languages like Turkish). See Turkish I.

http://en.wikipedia.org/wiki/Turkish_I

Sure, if you just want to ignore normalization and several languages then by code point is just fine... but that's the point: by code point is incorrect in general.


> There may still be cases where it is still not working as intended in the face of normalization, but it is still leaps and bounds better than what we get iterating with codeunits.
>
> To turn it the other way around, *what* are you guys doing, that doesn't require decoding, and where performance is such a killer?

Searching, equality testing, copying, sorting, hashing, splitting, joining...

The performance thing can be fixed in the library, but my concern is (a) it takes a significant amount of code to do so (b) complicates implementations. There are many, many algorithms in Phobos that are special cased for strings, and I don't think it needs to be that way.


>> To those that think the status quo is better, can you give an example of a real-life use case that demonstrates this?
>
> I do not know of a single bug report in regards to buggy phobos code that used front/popFront. Not_a_single_one (AFAIK).
>
> On the other hand, there are plenty of cases of bugs for attempting to not decode strings, or incorrectly decoding strings. They are being corrected on a continuous basis.

Can you provide a link to a bug?

Also, you haven't answered the question :-)  Can you give a real-life example of a case where code point decoding was necessary where code units wouldn't have sufficed?

You have mentioned case-insensitive searching, but I think I've adequately demonstrated that this doesn't work in general by code point: you need to normalize and take locales into account.
March 09, 2014
On Sunday, 9 March 2014 at 05:10:26 UTC, Andrei Alexandrescu wrote:
> On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:
>> On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu wrote:
>>> What exactly is the consensus? From your wiki page I see "One of the
>>> proposals in the thread is to switch the iteration type of string
>>> ranges from dchar to the string's character type."
>>>
>>> I can tell you straight out: That will not happen for as long as I'm
>>> working on D.
>>
>> Why?
>
> From the cycle "going in circles": because I think the breakage is way too large compared to the alleged improvement.

All right. I was wondering if there was something more fundamental behind such an ultimatum.

> In fact I believe that that design is inferior to the current one regardless.

I was hoping we could come to an agreement at least on this point.

---

BTW, a thought struck me while thinking about the problem yesterday.

char and dchar should not be implicitly convertible between one another, or comparable to the other.

void main()
{
    string s = "Привет";
    foreach (c; s)
        assert(c != 'Ñ');
}

Instead, std.conv.to should allow converting between character types, iff they represent one whole code point and fit into the destination type, and throw an exception otherwise (similar to how it deals with integer overflow). Char literals should be special-cased by the compiler to implicitly convert to any sufficiently large type.

This would break more[1] code, but it would avoid the silent failures of the earlier proposal.

[1] I went through my own larger programs. I actually couldn't find any uses of dchar which would be impacted by such a hypothetical change.
March 09, 2014
On Sunday, 9 March 2014 at 12:24:11 UTC, ponce wrote:
>> - In lots of places, I've discovered that Phobos did UTF decoding (thus murdering performance) when it didn't need to. Such cases included format (now fixed), appender (now fixed), startsWith (now fixed - recently), skipOver (still unfixed). These have caused latent bugs in my programs that happened to be fed non-UTF data. There's no reason for why D should fail on non-UTF data if it has no reason to decode it in the first place! These failures have only served to identify places in Phobos where redundant decoding was occurring.
>
> With all due respect, D string type is exclusively for UTF-8 strings. If it is not valid UTF-8, it should never had been a D string in the first place. In the other cases, ubyte[] is there.

This is an arbitrary self-imposed limitation caused by the choice in how strings are handled in Phobos.
March 09, 2014
On Sunday, 9 March 2014 at 13:51:12 UTC, Marc Schütz wrote:
> On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
>> On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev
>>> Can we look at some example situations that this will break?
>>
>> Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string?
>
> This would no longer compile, as dchar[] stops being a range. countUntil(range.byCodePoint) would have to be used instead.

Why? There's no reason why dchar[] would stop being a range. It will be treated as now, like any other array.
March 09, 2014
On Sunday, 9 March 2014 at 13:47:26 UTC, Marc Schütz wrote:
> On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
>> 2) It is regression back to C++ days of no-one-cares-about-Unicode pain. Thinking about strings as character arrays is so natural and convenient that if language/Phobos won't punish you for that, it will be extremely widespread.
>
> Not with Nick Sabalausky's suggestion to remove the implementation of front from char arrays. This way, everyone will be forced to decide whether they want code units or code points or something else.

Andrei has made it clear that the code breakage this would
involve would be unacceptable.
March 09, 2014
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
> On topic, I think D's implicit default decode to dchar is *infinity* times better than C++'s char-based strings. While imperfect in terms of grapheme, it was still a design decision made of win.

Care to argument?

> I'd be tempted to not ask "how do we back out", but rather, "how can we take this further"? I'd love to ditch the whole "char"/"dchar" thing altogether, and work with graphemes. But that would be massive involvement.

As has been discussed, this does not make sense. Graphemes are also a concept which apply only to certain writing systems, all it would do is exchange one set of tradeoffs with another, without solving anything. Text isn't that simple.
March 09, 2014
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
> On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote:
>>> The current approach is a cut above treating strings as arrays of bytes
>>> for some languages, and still utterly broken for others. If I'm
>>> operating on a right to left language like Hebrew, what would I expect
>>> the result to be from something like countUntil?
>>
>> The entire string processing paraphernalia is left to right. I figure RTL languages are under-supported, but s.retro.countUntil comes to mind.
>>
>> Andrei
>
> I'm pretty sure that all string operations are actually "front to back". If I recall correctly, evenlanguages that "read" right to left, are stored in a front to back manner: EG: string[0] would be the right-most character. Is is only a question of "display", and changes nothing to the code. As for "countUntil", it would still work perfectly fine, as a RTL reader would expect the counting to start at the "begining" eg: the "Right" side.
>
> I'm pretty confident RTL is 100% supported. The only issue is the "front"/"left" abiguity, and the only one I know of is the oddly named "stripLeft" function, which actually does a "stripFront" anyways.
>
> So I wouldn't worry about RTL.

Yeah, I think RTL strings are preceded by a code point that indicates RTL display. It was just something I mentioned because some operations might be confusing to the programmer.


> But as mentioned, it is languages like indian, that have complex graphemes, or languages with accentuated characters, eg, most europeans ones, that can have problems, such as canFind("cassé", 'e').

True. I still question why anyone would want to do character-based operations on Unicode strings. I guess substring searches could even end up with the same problem in some cases if not implemented specifically for Unicode for the same reason, but those should be far less common.
March 09, 2014
On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
> As for "the belief that iterating by code point has utility." I have to strongly disagree. Unicode is composed of codepoints, and that is what we handle. The fact that it can be be encoded and stored as UTF is implementation detail.

But you don't deal with Unicode. You deal with *text*. Unless you are implementing Unicode algorithms, code points solve nothing in the general case.

> Seriously, Bearophile suggested "ABCD".sort(), and it took about 6 pages (!) for someone to point out this would be wrong.

Sorting a string has quite limited use in the general case, so I think this is another artificial example.

> Even Walter pointed out that such code should work. *Maybe* it is still wrong in regards to graphemes and normalization, but at *least*, the result is not a corrupted UTF-8 stream.

I think this is no worse than putting all combining marks all clustered at the end of the string, thus attached to the last non-combining letter.