dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead (page 7)

On 2021-11-05 20:03, Walter Bright wrote: > On 11/5/2021 2:43 PM, deadalnix wrote: >> I have not checked for GCC, but modern version of LLVM are pretty good at optimizing in the presence of landing pads. > > I saw a presentation by Chandler Carruth at CPPCON 3 years back or so where he said that LLVM abandoned much of the optimizations in the presence of rewind blocks. Three years is a long time in this industry. > Optimizations will do better, of course, if your tight loops don't call functions that might throw. > > You'll also lose simply because the extra bulk of the EH code will push more of your hot code out of the cache. Turns out the EH code is very well separated. Gcc goes so far as to generate two separate functions, one for hot and one for cold. Clang also does a good job separating the paths. It happens in our metier that good judgment becomes prejudice. It seems that's what happening with "exceptions are expensive" right now.

On 11/5/2021 5:40 PM, Andrei Alexandrescu wrote: > Turns out the EH code is very well separated. Gcc goes so far as to generate two separate functions, one for hot and one for cold. Clang also does a good job separating the paths. How does one decide in advance to call the non-throwing function? > It happens in our metier that good judgment becomes prejudice. It seems that's what happening with "exceptions are expensive" right now. I remain skeptical. My playing with gcc shows it moves the unwind blocks past the end of the function, which keeps them somewhat out of the hot path. Doesn't fix the register allocation problem, though. BTW, dmd also moves the unwind blocks past the end.

On Saturday, 6 November 2021 at 00:40:41 UTC, Andrei Alexandrescu wrote: > Turns out the EH code is very well separated. Gcc goes so far as to generate two separate functions, one for hot and one for cold. Clang also does a good job separating the paths. > You bet, I wrote the code that separates the two :) > It happens in our metier that good judgment becomes prejudice. It seems that's what happening with "exceptions are expensive" right now. It is on windows due to the whole funclet business, and it is in some specific condition (for instance if icache pressure is the bottleneck) but in most cases, the impact is fairly minimal beyond binary size.

On Saturday, 6 November 2021 at 01:55:47 UTC, deadalnix wrote: > On Saturday, 6 November 2021 at 00:40:41 UTC, Andrei Alexandrescu wrote: >> Turns out the EH code is very well separated. Gcc goes so far as to generate two separate functions, one for hot and one for cold. Clang also does a good job separating the paths. >> > > You bet, I wrote the code that separates the two :) > To expand on that, I also wrote code that send all the exception handling code in a cold section in the executable (and if PGO is enabled, also really cold codepath). This impact on benchmark was fairly minimal, so this ended up not being merged.

November 06, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Alexey
in reply to Walter Bright

Permalink

Alexey

Posted in reply to Walter Bright

Permalink

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:
> https://issues.dlang.org/show_bug.cgi?id=22473
>
> I've tried to fix this before, but too many people objected.
>
> Are we fed up with this yet? I sure am.
>
> Who wants to take up this cudgel and fix the durned thing once and for all?
>
> (It's unclear if it would even break existing code.)

I didn't read thread. And I'm not an expert in D or Unicode, of course.

But If I would need to solve the problem of unicode handling, I would do the following:

1. define type for the 'grapheme' - so grapheme could store any unicode symbol;
2. define string of grapheme as array of grapheme, so programmer could at any time use usual array tools on those. like so things like .length and slicing [x..y] work as usual. call this, for instance, 'gstring' or 'graphstring';
3. IMHO, one grapheme should be and alias to ubyte[] or to one BigInt;
4. conversion from string/wstring/dstring/ubyte[]/BigInt[]/etc to ['gstring' or 'graphstring'] should be automatic and this should be stated in documentation;
5. ['gstring' or 'graphstring'] should have functions to convert to string/wstring/dstring/ubyte[]/BigInt[]/etc

On Saturday, 6 November 2021 at 04:07:35 UTC, Alexey wrote: > 3. IMHO, one grapheme should be and alias to ubyte[] or to one BigInt; or may be, even, define one grapheme as dchar[]. or maybe, even, define new separate type for 'codepoint' and define one grapheme as codepoint[].

On Friday, 5 November 2021 at 23:01:24 UTC, Ali Çehreli wrote: > On 11/5/21 5:38 AM, max haughton wrote: > > > I have never ever seen someone use a static array by mistake > > Related, although safe, vector::at is almost never used because the more convenient (but unsafe) vector.operator[] exists: > > v[42] // What Ali saw in the wild > v.at(42) // What Ali did not see as much in the wild > > Ali Although I understand what Walter is trying to say, he picked a poor example, this one does actually make sense. Although in the world of sanitizers and such it is not a hard thing to catch, bounds checking by default is a win.

On Sat, Nov 06, 2021 at 04:18:51AM +0000, Alexey via Digitalmars-d wrote: > On Saturday, 6 November 2021 at 04:07:35 UTC, Alexey wrote: > > > 3. IMHO, one grapheme should be and alias to ubyte[] or to one BigInt; > > or may be, even, define one grapheme as dchar[]. or maybe, even, define new separate type for 'codepoint' and define one grapheme as codepoint[]. Unfortunately, codepoint != grapheme. This was the fundamental error with autodecoding that made it so bad. It costs us a performance hit but doesn't even produce the right results in return. And even more unfortunately, grapheme segmentation is an extremely convoluted (i.e. slow) operation that normally you would *not* want to do it unless your code absolutely has to. T -- Let's eat some disquits while we format the biskettes.

Forums