March 13, 2014
On Thursday, March 06, 2014 18:37:13 Walter Bright wrote:
> Is there any hope of fixing this?

I agree with Andrei. I don't think that there's really anything to fix. The problem is that there's roughly 3 levels at which string operations can be done

1. By code unit
2. By code point
3. By grapheme

and which is correct depends on what you're trying to do. Phobos attempts to go for correctness by default without seriously impacting performance, so it treats all strings as ranges of dchar (so, level #2). If we went with #1, then pretty much any algorithm which operated on individual characters would be broken, as unless your strings are ASCII-only, code units are very much the wrong level to be operating on if you're trying to deal with characters. If we went with #3, then we'd have full correctness, but we'd tank performance. With #2, we're far more correct than is typically the case with C++ while still being reasonably performant. And those who want full performance can use immutable(ubyte)[] to get #1, and those who want #3 can use the grapheme support in std.uni.

We've gone to great lengths in Phobos to specialize on narrow strings in order to make it more efficient while still maintaining correctness, and anyone who really wants performance can do the same. But by operating on the code point level, we at least get a reasonable level of unicode-correctness by default. With your suggestion, I'd fully expect most D programs to be wrong with regards to Unicode, because most programmers don't know or care about how Unicode works. And changing what we're doing now would be code breakage of astronomical proportions. It will essentially break all uses of range-based string code. Certainly, it would be largest code breakage that D has seen is years if not ever. So, it's almost certainly a bad idea, but if it isn't, we need to be darn sure that what we change to is significantly better and worth the huge amount of code breakage that it will cause.

I really don't think that there's any way to get this right. Regardless of which level you operate at by default - be it code unit, code point, or grapheme - it will be wrong a good chunk of the time. So, it becomes a question which of the three has the best tradeoffs, and I think that our current solution of operating on code points by default does that. If there are things that we can do to better support operating on code units or graphemes for those who want it, then great. And it's great if we can find ways to make operating at the code point level more efficient or less prone to bugs due to not operating at the grapheme level. But I think that operating on the code point level like we currently do is by far the best approach.

If anything, it's the fact that the language doesn't do that that's a bigger concern IMHO - the main place where that's an issue being the fact that foreach iterates by code unit by default. But I don't know of a good way to solve that other than treating all arrays of char, wchar, and dchar specially, and disable their array operations like ranges do so that you have to convert them to code units via the representation function in order to operate on them as code units - which Andrei has suggested a number of times before, but you've shot him down each time. If that were fixed, then at least we'd be consistent, which is usually the biggest complaint with regards to how D treats strings. But I really don't think that there's a magical fix for range- based string operations, and I think that our current approach is a good one.

- Jonathan M Davis
March 18, 2014
Am Mon, 10 Mar 2014 17:44:22 -0400
schrieb Nick Sabalausky <SeeWebsiteToContactMe@semitwist.com>:

> On 3/7/2014 8:40 AM, Michel Fortin wrote:
> > On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileHUGS@lycos.com> said:
> >
> >> Walter Bright:
> >>
> >>> I understand this all too well. (Note that we currently have a different silent problem: unnoticed large performance problems.)
> >>
> >> On the other hand your change could introduce Unicode-related bugs in
> >> future code (that the current Phobos avoids) (and here I am not
> >> talking about code breakage).
> >
> > The way Phobos works isn't any more correct than dealing with code units. Many graphemes span on multiple code points -- because of combined diacritics or character variant modifiers -- and decoding at the code-point level is thus often insufficient for correctness.
> >
> 
> Well, it is *more* correct, as many western languages are more likely in current Phobos to "just work" in most cases. It's just that things still aren't completely correct overall.
> 
> >  From my experience, I'd suggest these basic operations for a "string
> > range" instead of the regular range interface:
> >
> > .empty
> > .frontCodeUnit
> > .frontCodePoint
> > .frontGrapheme
> > .popFrontCodeUnit
> > .popFrontCodePoint
> > .popFrontGrapheme
> > .codeUnitLength (aka length)
> > .codePointLength (for dchar[] only)
> > .codePointLengthLinear
> > .graphemeLengthLinear
> >
> > Someone should be able to mix all the three 'front' and 'pop' function variants above in any code dealing with a string type. In my XML parser for instance I regularly use frontCodeUnit to avoid the decoding penalty when matching the next character with an ASCII one such as '<' or '&'. An API like the one above forces you to be aware of the level you're working on, making bugs and inefficiencies stand out (as long as you're familiar with each representation).
> >
> > If someone wants to use a generic array/range algorithm with a string, my opinion is that he should have to wrap it in a range type that maps front and popFront to one of the above variant. Having to do that should make it obvious that there's an inefficiency there, as you're using an algorithm that wasn't tailored to work with strings and that more decoding than strictly necessary is being done.
> >
> 
> I actually like this suggestion quite a bit.

+1 Reminds me of my proposal for Rust (https://github.com/mozilla/rust/issues/7043#issuecomment-19187984)

-- 
Marco

March 19, 2014
Am Sat, 08 Mar 2014 22:07:09 +0000
schrieb "Sean Kelly" <sean@invisibleduck.org>:

> On Saturday, 8 March 2014 at 20:50:49 UTC, Andrei Alexandrescu wrote:
> >
> > Pretty much everyone using ICU hates it.
> 
> I think the biggest problem with ICU is documentation.  It can take a long time to figure out how to do something if you've never done it before.  Also, the C interface in ICU seems better than the C++ interface.  And I'll grant that a few things are just far harder than they need to be.  I wanted a transcoding iterator and ICU almost has this but not quite, so I've got to write my own.  In fact, iterating across an arbitrary encoding in general is at least not intuitive and perhaps not possible.  I kinda gave up on that.  Um, and using UTF-16 as the standard encoding, requiring many transcoding operations to require two conversions.  Okay, I guess there are a lot of problems with ICU, but it handles nearly every requirement I have, which is in itself quite a lot.

You find the answer here: http://userguide.icu-project.org/icufaq#TOC-What-is-the-performance-difference-between-UTF-8-and-UTF-16-

In addition it is infeasible to maintain code for direct conversions with all the encodings they support. The project doesn't aim at providing a specific transcoding but all of them equally. What can you do. For Java it is easier to accept since they use UTF-16 internally.

-- 
Marco

15 16 17 18 19 20 21 22 23 24 25
Next ›   Last »