Switch ignores case (?) (page 2)

On 11/23/16 2:30 PM, ketmar wrote: > On Wednesday, 23 November 2016 at 19:07:49 UTC, Chris wrote: >> It has something to do with the smart quote, e.g.: > > it is wrong binary search in `_d_switch_string()`. > > strings for switch are lexically sorted, and compiler calls > `_d_switch_string()` to select one. the thing is that comparison in > `_d_switch_string()` is done with `memcmp()`. still not clear? ok, let's > see how cases are encoded: > > body _d_switch_dstring() > 'U0027' (ca) > table[0] = 1, 'U0027' > table[1] = 1, 'U2019' > > or, in memory: > > table[0] = 1, 0x27, 0x00 > table[1] = 1, 0x19, 0x20 > > so, memcmp for `table[1]` returns... 1! 'cause 0x27 is greater than > 0x19. and binsearch is broken from here on. the same is true for > `_d_switch_ustring()`, of course. > > this can be fixed either by using slow char-by-char comparisons in > druntime, or by fixing codegen, so it would sort strings as byte arrays. Oh wow, so this is really an endian issue. On a big endian machine, the code would work. Interesting! I think it makes the most sense to remove the memcmp, and do binary search based on actual char values. Thanks for finding this. -Steve

On 11/23/16 4:31 PM, Steven Schveighoffer wrote: > On 11/23/16 2:30 PM, ketmar wrote: >> this can be fixed either by using slow char-by-char comparisons in >> druntime, or by fixing codegen, so it would sort strings as byte arrays. > I think it makes the most sense to remove the memcmp, and do binary > search based on actual char values. On second thought, this is compiler-generated data, will never change. We may as well make it as efficient as possible. So the better way is to sort based on byte array, and just use memcmp for everything. -Steve

On Wednesday, 23 November 2016 at 21:31:08 UTC, Steven Schveighoffer wrote: > I think it makes the most sense to remove the memcmp, and do binary search based on actual char values. yeah. there is wmemcmp, which can be used to speed up one of the cases ('cause wchar_t has different size on windows and GNU/Linux), as i did in my hackfix. yet i am not sure that wmemcmp is really there on windows in all C runtimes (hence HACKfix ;-).

On Wednesday, 23 November 2016 at 21:40:52 UTC, Steven Schveighoffer wrote: > So the better way is to sort based on byte array, and just use memcmp for everything. i am not completely sure that this is really better. sorting and generating tables is done in glue layer, which can be different for different codegens. and `.compare` method, that is used for sorting strings may be used in other places too. by introducing something like `switchCompare()` we will add unnecessary complexity to the compiler, will make sort order less obvious, and won't really get significant speedup, as binary search doesn't really do alot of comparisons anyway. i thing it is better to be fixed in druntime.

November 23, 2016

Re: Switch ignores case (?)

Posted by Steven Schveighoffer
in reply to ketmar

Permalink

Steven Schveighoffer

Posted in reply to ketmar

Permalink

On 11/23/16 4:44 PM, ketmar wrote:
> On Wednesday, 23 November 2016 at 21:40:52 UTC, Steven Schveighoffer wrote:
>> So the better way is to sort based on byte array, and just use memcmp
>> for everything.
>
> i am not completely sure that this is really better. sorting and
> generating tables is done in glue layer, which can be different for
> different codegens. and `.compare` method, that is used for sorting
> strings may be used in other places too. by introducing something like
> `switchCompare()` we will add unnecessary complexity to the compiler,
> will make sort order less obvious, and won't really get significant
> speedup, as binary search doesn't really do alot of comparisons anyway.

I'm not certain of how difficult this will be, or how much risk there is. What I am certain of is that the user doesn't care that the compiler really isn't sorting the strings alphabetically, and that he wants the fastest code possible for a switch statement.

I can't see why you need to deal with the glue layer at all -- just tell the glue layer that it's a list of strings and not dstrings ;)

I also don't actually know that memcmp is faster than wmemcmp, so maybe there is even an advantage to changing behavior for dstring searching.

> i thing it is better to be fixed in druntime.

This can be a solution, for sure. And in reality, it's not *that* much slower -- you are still doing a binary search.

I wonder if there is a "binary search among arrays" algorithm that can be optimized for this.

-Steve

On Wednesday, 23 November 2016 at 22:00:58 UTC, Steven Schveighoffer wrote: > I can't see why you need to deal with the glue layer at all -- just tell the glue layer that it's a list of strings and not dstrings ;) 'cause that is how s2ir.d is done in dmd. ;-) it actually sorts *objects*, and objects knows how to order 'emselves. at this stage it is not trivial to treat objects as byte arrays (easy, but not a one-liner). sure, other compilers may do that differently, and it doesn't really matter how exactly the code for switch is generated until it works correctly. ;-)

On Wednesday, 23 November 2016 at 22:13:38 UTC, ketmar wrote: > On Wednesday, 23 November 2016 at 22:00:58 UTC, Steven Schveighoffer wrote: >> I can't see why you need to deal with the glue layer at all -- just tell the glue layer that it's a list of strings and not dstrings ;) > > 'cause that is how s2ir.d is done in dmd. ;-) it actually sorts *objects*, and objects knows how to order 'emselves. at this stage it is not trivial to treat objects as byte arrays (easy, but not a one-liner). > > sure, other compilers may do that differently, and it doesn't really matter how exactly the code for switch is generated until it works correctly. ;-) Great job, ketmar! I'm only surprised that this bug wasn't discovered earlier, I mean it goes back to (at least) dmd 2.040.

On Thursday, 24 November 2016 at 09:56:25 UTC, Chris wrote: > Great job, ketmar! I'm only surprised that this bug wasn't discovered earlier, I mean it goes back to (at least) dmd 2.040. thanks. tbh, i am surprised myself -- it is completely fubared, but nobody noticed. maybe that says something about real-world useability of dstring/wstring... ;-)

On Thursday, 24 November 2016 at 10:12:40 UTC, ketmar wrote: > > thanks. tbh, i am surprised myself -- it is completely fubared, but nobody noticed. maybe that says something about real-world useability of dstring/wstring... ;-) Well, I came across it, because I wanted to work around autodecode, our old "friend" (who need enemies, if you have a friend like that?). Except, it doesn't work as expected. For my needs, I can work around it with `(c == "\u2019")`, because switch works for the rest of the (common) cases like full stop, question mark etc. The whole string handling issue in D needs to be fixed asap, because text processing is one of _the_ big things in IT these days. Think of all the data harvesting and evaluation and whatnot.

On Wednesday, 23 November 2016 at 22:13:38 UTC, ketmar wrote: > On Wednesday, 23 November 2016 at 22:00:58 UTC, Steven Schveighoffer wrote: >> I can't see why you need to deal with the glue layer at all -- just tell the glue layer that it's a list of strings and not dstrings ;) > > 'cause that is how s2ir.d is done in dmd. ;-) it actually sorts *objects*, and objects knows how to order 'emselves. at this stage it is not trivial to treat objects as byte arrays (easy, but not a one-liner). > > sure, other compilers may do that differently, and it doesn't really matter how exactly the code for switch is generated until it works correctly. ;-) Could it be possible to ping M. Nowak to include the fix for this bug (due to its importance) in the final 2.072.1 release? Antonio

Forums