char array weirdness (page 3)

March 29, 2016

Re: char array weirdness

Posted by H. S. Teoh
in reply to Jack Stouffer

Permalink

H. S. Teoh

Posted in reply to Jack Stouffer

Permalink

On Wed, Mar 30, 2016 at 03:22:48AM +0000, Jack Stouffer via Digitalmars-d-learn wrote:
> On Tuesday, 29 March 2016 at 23:42:07 UTC, H. S. Teoh wrote:
> >Believe it or not, it was only last year (IIRC, maybe the year before) that Walter "discovered" that Phobos does autodecoding, and got pretty upset over it.  If even Walter wasn't aware of this for that long...
> 
> The link (I think this is what you're referring to) to that discussion: http://forum.dlang.org/post/lfbg06$30kh$1@digitalmars.com
> 
> It's a shame Walter never got his way. Special casing ranges like this is a huge mistake.

To be fair, one *could* make a case of autodecoding, if it was done right, i.e., segmenting by graphemes, which is what is really expected by users when they think of "characters". This would allow users to truly think in terms of characters (in the intuitive sense) when they work with strings. However, segmenting by graphemes is, in general, quite expensive, and few algorithms actually need to do this. Most don't need to -- a pretty large part of string processing consists of looking for certain markers, mostly punctuation and control characters, and treating the stuff in between as opaque data. If we didn't have autodecoding, would be a simple matter of searching for sentinel substrings.  This also indicates that most of the work done by autodecoding is unnecessary -- it's wasted work since most of the string data is treated opaquely anyway.

The unfortunate situation in Phobos currently is that we are neither doing it right (segmenting by graphemes), *and* we're inefficient because we're constantly decoding all that data that the application is mostly going to treat as opaque data anyway.  It's the worst of both worlds.

I wish we could get consensus for implementing Walter's plan to phase out autodecoding (as proposed in the linked thread above).

T

-- 
Freedom of speech: the whole world has no right *not* to hear my spouting off!

On Tue, Mar 29, 2016 at 08:05:29PM -0400, Steven Schveighoffer via Digitalmars-d-learn wrote: [...] > Phobos treats narrow strings (wchar[], char[]) as ranges of dchar. It was discovered that auto decoding strings isn't always the smartest thing to do, especially for performance. > > So you get things like this: https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm/searching.d#L1622 > > That's right. Phobos insists that auto decoding must happen for narrow strings. Except that's not the best thing to do so it inserts lots of exceptions -- for narrow strings. > > Mind blown? [...] Mind not blown. Mostly because I've seen many, many instances of similar code in Phobos. It's what I was alluding to when I said that special-casing strings has caused a ripple of exceptional cases to percolate throughout Phobos, increasing code complexity and making things very hard to maintain. I mean, honestly, just look at that code as linked above. Can anyone honestly claim that this is maintainable code? For something so trivial as linear search of strings, that's some heavy hackery just to make strings work, as contrasted with, say, the one-line call to simpleMindedFind(). Who would have thought linear string searching would require a typecast, a @trusted hack, and templates named "force" just to make things work? It's code like this -- and its pervasive ugliness throughout Phobos -- that slowly eroded my original pro-autodecoding stance. It's becoming clearer and clearer to me that it's just not pulling its own weight against the dramatic increase in Phobos code complexity, nevermind the detrimental performance consequences. T -- Obviously, some things aren't very obvious.

On Wednesday, 30 March 2016 at 05:16:04 UTC, H. S. Teoh wrote: > If we didn't have autodecoding, would be a simple matter of searching for sentinel substrings. This also indicates that most of the work done by autodecoding is unnecessary -- it's wasted work since most of the string data is treated opaquely anyway. Just to drive this point home, I made a very simple benchmark. Iterating over code points when you don't need to is 100x slower than iterating over code units. import std.datetime; import std.stdio; import std.array; import std.utf; import std.uni; enum testCount = 1_000_000; enum var = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent justo ante, vehicula in felis vitae, finibus tincidunt dolor. Fusce sagittis."; void test() { auto a = var.array; } void test2() { auto a = var.byCodeUnit.array; } void test3() { auto a = var.byGrapheme.array; } void main() { import std.conv : to; auto r = benchmark!(test, test2, test3)(testCount); auto result = to!Duration(r[0] / testCount); auto result2 = to!Duration(r[1] / testCount); auto result3 = to!Duration(r[2] / testCount); writeln("auto-decoding", "\t\t", result); writeln("byCodeUnit", "\t\t", result2); writeln("byGrapheme", "\t\t", result3); } $ ldc2 -O3 -release -boundscheck=off test.d $ ./test auto-decoding 1 μs byCodeUnit 0 hnsecs byGrapheme 11 μs

On 30.03.2016 19:30, Jack Stouffer wrote: > Just to drive this point home, I made a very simple benchmark. Iterating > over code points when you don't need to is 100x slower than iterating > over code units. [...] > enum testCount = 1_000_000; > enum var = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. > Praesent justo ante, vehicula in felis vitae, finibus tincidunt dolor. > Fusce sagittis."; > > void test() > { > auto a = var.array; > } > > void test2() > { > auto a = var.byCodeUnit.array; > } > > void test3() > { > auto a = var.byGrapheme.array; > } [...] > $ ldc2 -O3 -release -boundscheck=off test.d > $ ./test > auto-decoding 1 μs > byCodeUnit 0 hnsecs > byGrapheme 11 μs When byCodeUnit takes no time at all, isn't 1µs infinite times slower, instead of 100 times? And I think byCodeUnits's 1µs is so low that noise is going to mess with any ratios you make. byCodeUnit taking no time at all suggests that it's been optimized away completely. To avoid that, don't hardcode the test data, and make some output that depends on the calculations being actually done. There was a little thread about this recently: http://forum.dlang.org/post/sdmdwyhfgmbppfflkljz@forum.dlang.org I think creating arrays from the ranges is relatively costly and noisy, and it's not of interest when you want to compare iteration speed.

On Wednesday, 30 March 2016 at 22:49:24 UTC, ag0aep6g wrote: > When byCodeUnit takes no time at all, isn't 1µs infinite times slower, instead of 100 times? And I think byCodeUnits's 1µs is so low that noise is going to mess with any ratios you make. It's not that it's taking no time at all, it's just that it's less than 1 hecto-nanosecond, which is the smallest unit that benchmark works with. Observe what happens when the times are no longer averaged, I also made some other changes to the script: import std.datetime; import std.stdio; import std.array; import std.utf; import std.uni; enum testCount = 1_000_000; void test(char[] var) { auto a = var.array; } void test2(char[] var) { auto a = var.byCodeUnit.array; } void test3(char[] var) { auto a = var.byGrapheme.array; } void main() { import std.conv : to; import std.random : uniform; import std.string : assumeUTF; // random string ubyte[] data; foreach (_; 0 .. 200) { data ~= cast(ubyte) uniform(33, 126); } auto result = to!Duration(benchmark!(() => test(data.assumeUTF))(testCount)[0]); auto result2 = to!Duration(benchmark!(() => test2(data.assumeUTF))(testCount)[0]); auto result3 = to!Duration(benchmark!(() => test3(data.assumeUTF))(testCount)[0]); writeln("auto-decoding", "\t\t", result); writeln("byCodeUnit", "\t\t", result2); writeln("byGrapheme", "\t\t", result3); } $ ldc2 -O3 -release -boundscheck=off test.d $ ./test auto-decoding 1 sec, 757 ms, and 946 μs byCodeUnit 87 ms, 731 μs, and 8 hnsecs byGrapheme 14 secs, 769 ms, 796 μs, and 6 hnsecs

On 31.03.2016 07:40, Jack Stouffer wrote: > $ ldc2 -O3 -release -boundscheck=off test.d > $ ./test > auto-decoding 1 sec, 757 ms, and 946 μs > byCodeUnit 87 ms, 731 μs, and 8 hnsecs > byGrapheme 14 secs, 769 ms, 796 μs, and 6 hnsecs So the auto-decoding version takes about twenty times as long as the non-decoding one (1758 / 88 ≅ 20). I still think the allocations from the `.array` calls should be eliminated to see how just iterating compares. Here's a quick edit to get rid of the `.array`s: ---- uint accumulator = 0; void test(char[] var) { foreach (dchar d; var) accumulator += d; } void test2(char[] var) { foreach (c; var.byCodeUnit) accumulator += c; } ---- I get theses timings then: ---- auto-decoding 642 ms, 969 μs, and 1 hnsec byCodeUnit 84 ms, 980 μs, and 3 hnsecs ---- And 643 / 85 ≅ 8.

On Thursday, 31 March 2016 at 12:49:57 UTC, ag0aep6g wrote: > I get theses timings then: > ---- > auto-decoding 642 ms, 969 μs, and 1 hnsec > byCodeUnit 84 ms, 980 μs, and 3 hnsecs > ---- > And 643 / 85 ≅ 8. Ok, so not as bad as 100x, but still not great by any means. I think I will do some investigation into why array of dchar is so much slower than calling array with char[].

Forums