Major performance problem with std.array.front() (page 4)

On Friday, 7 March 2014 at 03:32:50 UTC, H. S. Teoh wrote: > Calling count() on a narrow string will not return the expected value, for example. I would argue that, unless it's been made clear that the program is expected to work only for certain languages, code that relied on this was wrong in the first place.

On 03/07/2014 12:56 PM, Vladimir Panteleev wrote: > I'm glad I'm not the only one who feels this way. Implicit decoding must die. > > I strongly believe that implicit decoding of character points in std.range has been a mistake. > > - Algorithms such as "countUntil" will count code points. These numbers are useless for slicing, and can introduce hard-to-find bugs. +1 see my pull requests for std.string: https://github.com/D-Programming-Language/phobos/pull/1952 https://github.com/D-Programming-Language/phobos/pull/1977

On Fri, 07 Mar 2014 05:41:18 -0500, Walter Bright <newshound2@digitalmars.com> wrote: > On 3/7/2014 2:27 AM, Dmitry Olshansky wrote: >> Where have you been when it was introduced? :) > > It slipped by me. What can I say? I'm not the only committer :-) No, this is intrinsic in the problem of treating strings as ranges of dchar. This one function is a symptom, not the problem. -Steve

March 07, 2014

Re: Major performance problem with std.array.front()

Posted by Michel Fortin
in reply to bearophile

Permalink

Michel Fortin

Posted in reply to bearophile

Permalink

On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileHUGS@lycos.com> said:

> Walter Bright:
> 
>> I understand this all too well. (Note that we currently have a different silent problem: unnoticed large performance problems.)
> 
> On the other hand your change could introduce Unicode-related bugs in future code (that the current Phobos avoids) (and here I am not talking about code breakage).

The way Phobos works isn't any more correct than dealing with code units. Many graphemes span on multiple code points -- because of combined diacritics or character variant modifiers -- and decoding at the code-point level is thus often insufficient for correctness.

The problem with Unicode strings is that the representation you must work with depends on the things you want to do. If you want to count the characters then you need graphemes; if you want to parse XML then you'll need to work with code points (in theory, in practice you might still want direct access to code units for performance reasons); and if you want to slice or copy a string then you need to deal with code units. Because of this multiple-representation-for-different-purpose thing, generic algorithms for arrays don't map very well to string.

From my experience, I'd suggest these basic operations for a "string range" instead of the regular range interface:

.empty
.frontCodeUnit
.frontCodePoint
.frontGrapheme
.popFrontCodeUnit
.popFrontCodePoint
.popFrontGrapheme
.codeUnitLength (aka length)
.codePointLength (for dchar[] only)
.codePointLengthLinear
.graphemeLengthLinear

Someone should be able to mix all the three 'front' and 'pop' function variants above in any code dealing with a string type. In my XML parser for instance I regularly use frontCodeUnit to avoid the decoding penalty when matching the next character with an ASCII one such as '<' or '&'. An API like the one above forces you to be aware of the level you're working on, making bugs and inefficiencies stand out (as long as you're familiar with each representation).

If someone wants to use a generic array/range algorithm with a string, my opinion is that he should have to wrap it in a range type that maps front and popFront to one of the above variant. Having to do that should make it obvious that there's an inefficiency there, as you're using an algorithm that wasn't tailored to work with strings and that more decoding than strictly necessary is being done.

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca

On Thu, 06 Mar 2014 21:37:13 -0500, Walter Bright <newshound2@digitalmars.com> wrote: > Is there any hope of fixing this? Yes, make d strings not char arrays, but a library-defined struct with an array as backing. auto x = "..."; compiles to => auto x = string(cast(immutable(char)[])"..."); Then define string to be whatever kind of range you want in the library, with whatever functionality you want. Then if you want by-char traversal, explicitly use immutable(char)[] as x's type. And in the string range's members, we can provide whatever access we want. Note, this also fixes foreach, and many other problems we have. Most likely code that works today will continue to work, since it's much more of a bear to type immutable(char)[] instead of string :) -Steve

On Friday, 7 March 2014 at 04:19:16 UTC, Walter Bright wrote: > I'd rather fix the compiler's codegen than add a pragma. The codegen isn't broken, the current this pointer behavior is needed for full compatibility with the C ABI. It would be opt in to an ABI tweak that the caller needs to be aware of rather than an traditional optimization where the outside world would never know.

On Friday, 7 March 2014 at 13:56:48 UTC, Adam D. Ruppe wrote: > On Friday, 7 March 2014 at 04:19:16 UTC, Walter Bright wrote: >> I'd rather fix the compiler's codegen than add a pragma. > > The codegen isn't broken, the current this pointer behavior is needed for full compatibility with the C ABI. It would be opt in to an ABI tweak that the caller needs to be aware of rather than an traditional optimization where the outside world would never know. We don't need C ABI compatibility for stuff that is not extern(C), do we?

On Friday, 7 March 2014 at 10:44:46 UTC, Kagamin wrote: > Now it's passed by value. That won't work for operator overloading though (which is the really interesting case here). > Though, I needed checked arithmetic only twice: for cast from long to int and for cast from double to long. If you expect your number type to overflow, you probably chose wrong type. I very rarely need it too, but it is nice to have in a convenient package that is fairly efficient at the same time.

On Friday, 7 March 2014 at 14:04:53 UTC, Dicebot wrote: > We don't need C ABI compatibility for stuff that is not extern(C), do we? That's a good point, though personally I'd still like some way to magic it up, even in extern(C). Consider the example of library typedef. If C did: typedef void* HANDLE; and D did struct HANDLE { void* foo; alias foo this; } it is almost the same, but then when you declare HANDLE OpenFile(...); it won't work since the compiler will pass a hidden struct pointer (which is exactly what C woudl expect if it was a typedef struct { void* } on its side too) instead of expecting the value in the accumulator as it would with the void*.

Forums