Major performance problem with std.array.front() (page 3)

07-Mar-2014 07:22, bearophile пишет: > Walter Bright: > >> You use ranges a lot. Would it break any of your code? > > I need to try the changes to be sure. But the magnitude of this change > is so large that I guess some code will surely break. > > One advantage of your change is that this code will work: > > auto s = "hello".dup; > s.sort(); Which it shouldn't unless there is an ascii type or some such. -- Dmitry Olshansky

07-Mar-2014 07:52, Walter Bright пишет: > On 3/6/2014 7:06 PM, Walter Bright wrote: >> On 3/6/2014 6:37 PM, Walter Bright wrote: >>> Is there any hope of fixing this? >> >> Is there any way we can provide an upgrade path for this? Silent >> breakage is >> terrible. Any ideas? > > Ok, I have a plan. Each step will be separated by at least one version: > > 1. implement decode() as an algorithm for string types, so one can write: > > string s; > s.decode.algorithm... > > suggest that people start doing that instead of: > > s.algorithm... > This would also be a great fit in cases where 'decode' is decoding some other encoding. > 2. Emit warning when people use std.array.front(s) with strings. > > 3. Deprecate std.array.front for strings. > > 4. Error for std.array.front for strings. This sounds fine to me. I would even prefer to only offer explicit wrappers: .raw - ubyte/ushort for UTF-8/UTF-16 etc. .decode - dchars as Nick suggests. Then there is also the horrible ElementEncodingType vs ElementType. I would love to see ElementEncodingType die. > > 5. Implement new std.array.front for strings that doesn't decode. It would make it simple to think that strings are arrays of characters. This illusion was broken (and good thing it was), no point in reestablishing it to save a couple of keystrokes for those "who really know what they are doing". -- Dmitry Olshansky

07-Mar-2014 06:37, Walter Bright пишет: > In "Lots of low hanging fruit in Phobos" the issue came up about the > automatic encoding and decoding of char ranges. > > Throughout D's history, there are regular and repeated proposals to > redesign D's view of char[] to pretend it is not UTF-8, but UTF-32. I.e. > so D will automatically generate code to decode and encode on every > attempt to index char[]. ... > Is there any hope of fixing this? Where have you been when it was introduced? :) -- Dmitry Olshansky

On 3/7/2014 2:11 AM, Dmitry Olshansky wrote: > Then there is also the horrible ElementEncodingType vs ElementType. > I would love to see ElementEncodingType die. I agree. ElementEncodingType is a giant red flag saying we screwed things up.

On 3/7/2014 2:27 AM, Dmitry Olshansky wrote: > Where have you been when it was introduced? :) It slipped by me. What can I say? I'm not the only committer :-) But after spending non-trivial time suffering as auto-decode blasted my kingdom, I've concluded that it needs to die. Working around it is not easy. I know that auto-decode has negatively impacted your regex, too. Basically, auto-decode is like booking a flight from Seattle to San Francisco with a plane change in Atlanta.

On Friday, 7 March 2014 at 04:01:15 UTC, Adam D. Ruppe wrote: > BTW you know what would help this? A pragma we can attach to a struct which makes it a very thin value type. > > pragma(thin_struct) > struct A { > int a; > int foo() { return a; } > static A get() { A(10); } > } > > void test() { > A a = A.get(); > printf("%d", a.foo()); > } > > With the pragma, A would be completely indistinguishable from int in all ways. > > What do I mean? > $ dmd -release -O -inline test56 -c > > Let's look at A.foo: > > A.foo: > 0: 55 push ebp > 1: 8b ec mov ebp,esp > 3: 50 push eax > 4: 8b 00 mov eax,DWORD PTR [eax] ; waste! > 6: 8b e5 mov esp,ebp > 8: 5d pop ebp > 9: c3 ret > > > It is line four that bugs me: the struct is passed as a *pointer*, but its only contents are an int, which could just as well be passed as a value. Let's compare it to an identical function in operation: > > int identity(int a) { return a; } > > 00000000 <_D6test568identityFiZi>: > 0: 55 push ebp > 1: 8b ec mov ebp,esp > 3: 83 ec 04 sub esp,0x4 > 6: c9 leave > 7: c3 ret > > lol it *still* wastes time, setting up a stack frame for nothing. But we could just as well write asm { naked; ret; } and it would work as expected: the argument is passed in EAX and the return value is expected in EAX. The function doesn't actually have to do anything. struct A { int a; //int foo() { return a; } static A get() { A(10); } } int foo(A a) { return a.a; } printf("%d", a.foo()); Now it's passed by value. Though, I needed checked arithmetic only twice: for cast from long to int and for cast from double to long. If you expect your number type to overflow, you probably chose wrong type.

March 07, 2014

Re: Major performance problem with std.array.front()

Posted by Vladimir Panteleev
in reply to Walter Bright

Permalink

Vladimir Panteleev

Posted in reply to Walter Bright

Permalink

On Friday, 7 March 2014 at 02:37:11 UTC, Walter Bright wrote:
> In "Lots of low hanging fruit in Phobos" the issue came up about the automatic encoding and decoding of char ranges.
>
> Throughout D's history, there are regular and repeated proposals to redesign D's view of char[] to pretend it is not UTF-8, but UTF-32. I.e. so D will automatically generate code to decode and encode on every attempt to index char[].

I'm glad I'm not the only one who feels this way. Implicit decoding must die.

I strongly believe that implicit decoding of character points in std.range has been a mistake.

- Algorithms such as "countUntil" will count code points. These numbers are useless for slicing, and can introduce hard-to-find bugs.

- In lots of places, I've discovered that Phobos did UTF decoding (thus murdering performance) when it didn't need to. Such cases included format (now fixed), appender (now fixed), startsWith (now fixed - recently), skipOver (still unfixed). These have caused latent bugs in my programs that happened to be fed non-UTF data. There's no reason for why D should fail on non-UTF data if it has no reason to decode it in the first place! These failures have only served to identify places in Phobos where redundant decoding was occurring.

Furthermore, it doesn't actually solve anything completely! The only thing it solves is a subset of cases for a subset of languages!

People want to look at a string "character by character". If a Unicode code point is a character in your language and alphabet, I'm really happy for you, but that's not how it is for everyone. Combining marks, complex scripts etc. make this point just a fallacy that in the end will cause programmers to make mistakes that will affect certain users somewhere.

Why do people want to look at individual characters? There are a lot of misconceptions about Unicode, and I think some of that applies here.

- Do you want to split a string by whitespace? Some languages have no notion of whitespace. What do you need it for? Line wrapping? Employ the Unicode line-breaking algorithm instead.

- Do you want to uppercase the first letter of a string? Some language have no notion of letter case, and some use it for different reasons. Furthermore, even languages with a Latin-based alphabet may not have 1:1 mapping for case, e.g. the German ß letter.

- Do you want to count how wide a string will be in a fixed-point font? Wrong... Combining and control characters, zero-width whitespace, etc. will render this approach futile.

- Do you want to split or flush a stream to a character device at a point so that there's no garbage? I believe, this is the case in TDPL's mention of the subject. Again, combining characters or complex scripts will still be broken by this approach.

You need to either go all-out and provide complete implementations of the relevant Unicode algorithms to perform tasks such as the above that will work in all locales, or you need to draw a line somewhere for which languages, alphabets, locales do you want to support in your program. D's line is drawn at the point where it considers that code points == characters, however the outcome of this is clear nowhere in its documentation and for such an arbitrary decision (from a cultural point of view), it is embedded too deep into the language itself. With std.ascii, at least, it's clear to the user that the functions there will only work with English or languages using the same alphabet.

This doesn't apply universally. There are still cases like, e.g., regular expression ranges. [a-z] makes sense in English, and [а-я] makes sense in Russian, but I don't think that makes sense for all languages. However, for the most part, I think implicit decoding must be axed, and instead we need implementations of Unicode algorithms and the documentation to instruct users why and how to use them.

On Friday, 7 March 2014 at 03:32:50 UTC, H. S. Teoh wrote: > On Thu, Mar 06, 2014 at 06:59:36PM -0800, Walter Bright wrote: >> On 3/6/2014 6:54 PM, bearophile wrote: >> >Walter Bright: >> >>Is there any hope of fixing this? >> > >> >I don't think we can change that in D2. You can change it in D3. >> >> You use ranges a lot. Would it break any of your code? > > This is very high risk change IMO. +1 This will be the most disruptive change in D's history...

On 3/7/14, Vladimir Panteleev <vladimir@thecybershadow.net> wrote: > - Do you want to split a string by whitespace? > - Do you want to uppercase the first letter of a string? > - Do you want to count how wide a string will be in a fixed-point > font? > - Do you want to split or flush a stream to a character device at > a point so that there's no garbage? We could later make a page on dlang (or the wiki) describing how to do these common things.

On 2014-03-07 04:17:34 +0000, Walter Bright said: > On 3/6/2014 7:59 PM, bearophile wrote: >> Walter Bright: >> >>> I understand this all too well. (Note that we currently have a different >>> silent problem: unnoticed large performance problems.) >> >> On the other hand your change could introduce Unicode-related bugs in future >> code (that the current Phobos avoids) (and here I am not talking about code >> breakage). > > This comes up repeatedly as justification for D trying to hide the UTF-8 nature of strings that I discussed upthread. > > To my mind it's like trying to pretend that floating point doesn't have roundoff issues, integers have infinite range, memory is infinite, etc. That has a place in other languages, but not in a systems/native language. Is it possible to add a warning notice when .front() is used on char? I would say fix it now, add a warning, and then remove the warning later. -S.

Forums