April 22, 2012 Re: avoid toLower in std.algorithm.sort compare alias | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On Sunday, 22 April 2012 at 02:29:45 UTC, Jonathan M Davis wrote: > Regardless of whether it's the Big(O) complexity or the constant factor that's > the problem here, clearly there's enough additional overhead that it's causing > problems for Jay's particular case. It's also the sort of thing that can be > easy to miss and then end up wondering why your code is so slow (if it > actually matters in your particular situation). > > - Jonathan M Davis I haven't looked at strncmpi code, but I suspect it is a lot more efficient. For example, in comparing AbbbCdddEfffXabcdEfgh AbbbCdddEfffYabcdEfgh it is not necessary to do case conversion on anything except X and Y, and if isUpper(X)==isUpper(Y) then X and Y can be compared without conversion, and since X and Y are not equal the remaining characters don't have to be converted. The comment below worries me a little bit about std.string.icmp. What if they are two 1MB strings that differ in he first character? Does it really convert both strings to lower case before comparing the first character? http://dlang.org/phobos/std_string.html#icmp "Technically, icmp(r1, r2) is equivalent to cmp!"std.uni.toLower(a) < std.uni.toLower(b)"(r1, r2). " |
April 22, 2012 Re: avoid toLower in std.algorithm.sort compare alias | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jay Norwood | On Sunday, April 22, 2012 08:20:13 Jay Norwood wrote:
> The comment below worries me a little bit about std.string.icmp. What if they are two 1MB strings that differ in he first character? Does it really convert both strings to lower case before comparing the first character?
>
> http://dlang.org/phobos/std_string.html#icmp
>
> "Technically, icmp(r1, r2) is equivalent to
> cmp!"std.uni.toLower(a) < std.uni.toLower(b)"(r1, r2). "
You can look at the code. It checks each of the characters in place. Unlike toLower, it doesn't need to generate a new string. But as far as the comparison goes, they're the same - hence that line in the docs.
- Jonathan M Davis
|
April 22, 2012 Re: avoid toLower in std.algorithm.sort compare alias | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On Sunday, 22 April 2012 at 06:26:42 UTC, Jonathan M Davis wrote:
>
> You can look at the code. It checks each of the characters in place. Unlike
> toLower, it doesn't need to generate a new string. But as far as the
> comparison goes, they're the same - hence that line in the docs.
>
> - Jonathan M Davis
ok, I did look at the code just now, and I'll sleep better
knowing that it doesn't do the whole string conversion. I
misunderstood your pseudo-code to mean that two lower case
strings were being created prior to the compare.
However, icmp code does appear to call the toLower conversion on
both characters without first comparing the characters for
equality, which misses the chance to do a simple compare that
would avoid the two calls.
|
April 22, 2012 Re: toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ali Çehreli | On 22.04.2012 5:43, Ali Çehreli wrote: > On 04/21/2012 04:24 PM, Jay Norwood wrote: > > While playing with sorting the unzip archive entries I tried use of the > > last example in http://dlang.org/phobos/std_algorithm.html#sort > > > > std.algorithm.sort!("toLower(a.name) < > > toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries); > > Stealing this thread to point out that converting a letter to upper or > lower case cannot be done without knowing the writing system. Phobos's > toLower() documentation currently says: "Returns a string which is > identical to s except that all of its characters are lowercase (in > unicode, not just ASCII)." Oh, come on. This function wasn't updated for ages. I bet this wording here is intact since unicode 4.0 ;) > > Unicode cannot define the conversions of at least the following letters > without knowing the actual alphabet that the text is written in: > > - Lowercase of I is ı in some alphabets[*] and i in many others. > > - Uppercase of i is İ in some alphabets[*] and I in many others. > Fair point. The list however is not that long and a system may choose to support this or not (changing behavior based on writing system is called tailoring I believe). > Ali > > [*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc. > -- Dmitry Olshansky |
April 22, 2012 Re: avoid toLower in std.algorithm.sort compare alias | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jay Norwood | On 2012-04-22 01:24, Jay Norwood wrote: > While playing with sorting the unzip archive entries I tried use of the > last example in http://dlang.org/phobos/std_algorithm.html#sort > > std.algorithm.sort!("toLower(a.name) < > toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries); > > It was terribly slow for sorting the 34k entries in my test case. I'd > guess it is doing the toLower call on both strings for every compare. > > It was much, much faster to add an entries.lowerCaseName = > std.string.toLower(entries.name) foreach entry prior to the sort > execution and then use > > std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName > ",std.algorithm.SwapStrategy.stable)(entries); > > The difference was on the order of 10 secs vs no noticeable delay when > executing the sort operation in the debugger. Perhaps a function that does case folding would be better in this case. But as far as I know Phobos doesn't have a function for that. -- /Jacob Carlborg |
April 22, 2012 Re: avoid toLower in std.algorithm.sort compare alias | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jay Norwood | On Saturday, 21 April 2012 at 23:24:57 UTC, Jay Norwood wrote:
> While playing with sorting the unzip archive entries I tried use of the last example in http://dlang.org/phobos/std_algorithm.html#sort
>
> std.algorithm.sort!("toLower(a.name) < toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
>
> It was terribly slow for sorting the 34k entries in my test case. I'd guess it is doing the toLower call on both strings for every compare.
>
> It was much, much faster to add an entries.lowerCaseName = std.string.toLower(entries.name) foreach entry prior to the sort execution and then use
>
> std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName ",std.algorithm.SwapStrategy.stable)(entries);
>
> The difference was on the order of 10 secs vs no noticeable delay when executing the sort operation in the debugger.
Good point. Perhaps this should be added in the documentation of std.algorithm ? It's easy to get trapped.
|
April 23, 2012 Re: avoid toLower in std.algorithm.sort compare alias | ||||
---|---|---|---|---|
| ||||
Posted in reply to SomeDude | On Sunday, 22 April 2012 at 00:36:19 UTC, bearophile wrote: > Performing the toLower every time the cmp function is called doesn't change the O complexity. In Phobos there is an alternative sorting (Schwartzian sorting routime) that applies a function to each item before sorting them, usually is much slower than the normal sorting, but maybe this time it's convenient. > Bye, > bearophile The shwartzSort works fine. Thanks, Jay std.algorithm.schwartzSort!(std.string.toLower, "a < b")(dirs); G:\d\rmdir2\rmdir2\rmdir2\Debug>rmdir2 removing: g:\tz finished! time:699 ms |
April 23, 2012 Re: avoid toLower in std.algorithm.sort compare alias | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jay Norwood | On Sat, 21 Apr 2012 19:24:56 -0400, Jay Norwood <jayn@prismnet.com> wrote:
> While playing with sorting the unzip archive entries I tried use of the last example in http://dlang.org/phobos/std_algorithm.html#sort
>
> std.algorithm.sort!("toLower(a.name) < toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
>
> It was terribly slow for sorting the 34k entries in my test case. I'd guess it is doing the toLower call on both strings for every compare.
>
> It was much, much faster to add an entries.lowerCaseName = std.string.toLower(entries.name) foreach entry prior to the sort execution and then use
>
> std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName ",std.algorithm.SwapStrategy.stable)(entries);
>
> The difference was on the order of 10 secs vs no noticeable delay when executing the sort operation in the debugger.
I'll point out what I haven't seen yet:
the issue is not so much toLower being called on every comparison, but more that toLower allocates (and then throws away!).
I think using std.string.icmp is the best solution. I would expect it to outperform even schwartz sort.
Note, to answer your question elsewhere, the comment is accurate, std.uni.toLower(a) is a function that accepts a dchar, not a string. What the comment is saying is that for the "ranges" (i.e. strings) given, it runs the given comparison on the std.uni.toLower() result for each element (i.e. dchar).
-Steve
|
April 23, 2012 Re: avoid toLower in std.algorithm.sort compare alias | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Monday, 23 April 2012 at 11:27:40 UTC, Steven Schveighoffer wrote: > I think using std.string.icmp is the best solution. I would expect it to outperform even schwartz sort. > > -Steve icmp took longer... added about 1 sec vs 0.3 sec (for schwartzSort ) to the program execution time. bool myComp(string x, string y) { return std.string.icmp(x,y)<0; } std.algorithm.sort!(myComp)(dirs); finished! time:1396 ms |
April 23, 2012 Re: avoid toLower in std.algorithm.sort compare alias | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jay Norwood | On Mon, 23 Apr 2012 09:49:50 -0400, Jay Norwood <jayn@prismnet.com> wrote:
> On Monday, 23 April 2012 at 11:27:40 UTC, Steven Schveighoffer
> wrote:
>
>> I think using std.string.icmp is the best solution. I would expect it to outperform even schwartz sort.
>>
>> -Steve
>
> icmp took longer... added about 1 sec vs 0.3 sec (for
> schwartzSort ) to the program execution time.
>
> bool myComp(string x, string y) { return std.string.icmp(x,y)<0; }
> std.algorithm.sort!(myComp)(dirs);
>
> finished! time:1396 ms
Well, that's surprising :) Perhaps there's some room for improvement in icmp or std.uni.toLower. There may be some constructs that are preventing inlining (enforce is the worst offender).
While dealing with unicode in my std.stream rewrite, I've found that hand-decoding dchars is way faster than using library calls.
-Steve
|
Copyright © 1999-2021 by the D Language Foundation