avoid toLower in std.algorithm.sort compare alias - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » avoid toLower in std.algorithm.sort compare alias

Thread overview

avoid toLower in std.algorithm.sort compare alias

Apr 21, 2012

Apr 21, 2012

Jonathan M Davis

Apr 22, 2012

Apr 22, 2012

Jonathan M Davis

Apr 22, 2012

Apr 22, 2012

Apr 22, 2012

Jonathan M Davis

toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias
Apr 22, 2012 Ali Çehreli
Apr 22, 2012 Jonathan M Davis
Apr 22, 2012 Dmitry Olshansky

Apr 22, 2012

Jonathan M Davis

Apr 22, 2012

Apr 22, 2012

Jonathan M Davis

Apr 22, 2012

Apr 26, 2012

Apr 22, 2012

Apr 22, 2012

Apr 23, 2012

Apr 23, 2012

Steven Schveighoffer

Apr 23, 2012

Apr 23, 2012

Steven Schveighoffer

Apr 24, 2012

Apr 24, 2012

Steven Schveighoffer

Apr 24, 2012

Steven Schveighoffer

Apr 24, 2012

Jonathan M Davis

April 21, 2012

avoid toLower in std.algorithm.sort compare alias

Posted by Jay Norwood

Jay Norwood

While playing with sorting the unzip archive entries I tried use of the last example in http://dlang.org/phobos/std_algorithm.html#sort

std.algorithm.sort!("toLower(a.name) < toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);

It was terribly slow  for sorting the 34k entries in my test case.  I'd guess it is doing the toLower call on both strings for every compare.

It was much, much faster to add an entries.lowerCaseName = std.string.toLower(entries.name) foreach entry prior to the sort execution and then use

std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName ",std.algorithm.SwapStrategy.stable)(entries);

The difference was on the order of 10 secs vs no noticeable delay when executing the sort operation in the debugger.

April 21, 2012

Re: avoid toLower in std.algorithm.sort compare alias

Posted by Jonathan M Davis
in reply to Jay Norwood

Jonathan M Davis

Posted in reply to Jay Norwood

On Sunday, April 22, 2012 01:24:56 Jay Norwood wrote:
> While playing with sorting the unzip archive entries I tried use of the last example in http://dlang.org/phobos/std_algorithm.html#sort
> 
> std.algorithm.sort!("toLower(a.name) <
> toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
> 
> It was terribly slow  for sorting the 34k entries in my test case.  I'd guess it is doing the toLower call on both strings for every compare.
> 
> It was much, much faster to add an entries.lowerCaseName = std.string.toLower(entries.name) foreach entry prior to the sort execution and then use
> 
> std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName
> ",std.algorithm.SwapStrategy.stable)(entries);
> 
> The difference was on the order of 10 secs vs no noticeable delay when executing the sort operation in the debugger.

Yeah. toLower would be called on both strings on _every_ compare. And since that involves a loop, that would make the overall call to sort an order of magnitude worse than if you didn't call toLower at all. I'm not sure if it's an order of magnitude worse than your solution though, since it's been a while since I studied Big(O) notation (doing the conversion only once is still more expensive than not converting, but I'm not sure how much more expensive - it might cost less than sort such that it actually doesn't matter as for as Big(O) goes though).

- Jonathan M Davis

April 22, 2012

Re: avoid toLower in std.algorithm.sort compare alias

Posted by bearophile
in reply to Jonathan M Davis

bearophile

Posted in reply to Jonathan M Davis

Jonathan M Davis:

> I'm not sure if it's an order of magnitude worse than your solution though, since it's been a while since I studied Big(O) notation (doing the conversion only once is still more expensive than not converting, but I'm not sure how much more expensive - it might cost less than sort such that it actually doesn't matter as for as Big(O) goes though).

Performing the toLower every time the cmp function is called doesn't change the O complexity. In Phobos there is an alternative sorting (Schwartzian sorting routime) that applies a function to each item before sorting them, usually is much slower than the normal sorting, but maybe this time it's convenient.
The performance improvement in the OP message is large, maybe it's a problem of memory allocations of the converted strings... More info on the original code is needed to give better answers.

Bye,
bearophile

April 22, 2012

Re: avoid toLower in std.algorithm.sort compare alias

Posted by Jonathan M Davis
in reply to bearophile

Jonathan M Davis

Posted in reply to bearophile

On Saturday, April 21, 2012 20:36:18 bearophile wrote:
> Jonathan M Davis:
> > I'm not sure if it's
> > an order of magnitude worse than your solution though, since it's been a
> > while since I studied Big(O) notation (doing the conversion only once is
> > still more expensive than not converting, but I'm not sure how much more
> > expensive - it might cost less than sort such that it actually doesn't
> > matter as for as Big(O) goes though).
> 
> Performing the toLower every time the cmp function is called doesn't change the O complexity.

Yes it does. It adds a loop to each comparison (two loops actually, but since they're not nested, Big(O) only cares about the one), since toLower has to loop over all of the characters. As sort loops over each of the strings to compare them for moving them into a sorted position or not, it calls toLower, which adds a nested loop, so it increases the Big(O) complexity. Something like this

foreach(str; strings)
    str < otherStr;

becomes

foreach(str; strings)
{
    foreach(dchar c; str)
        //lower characters

    foreach(dchar c; otherStr)
        //lower characters

    str < otherStr;
}

though that's obviously very pseudo-code-ish and not exactly what sort does. Regardless, those extra loops when the comparison happens are nested and therefore increase the Big(O) complexity of the overall algorithm.

- Jonathan M Davis

April 22, 2012

Re: avoid toLower in std.algorithm.sort compare alias

Posted by H. S. Teoh

H. S. Teoh

On Sat, Apr 21, 2012 at 05:45:35PM -0700, Jonathan M Davis wrote:
> On Saturday, April 21, 2012 20:36:18 bearophile wrote:
> > Jonathan M Davis:
> > > I'm not sure if it's an order of magnitude worse than your solution though, since it's been a while since I studied Big(O) notation (doing the conversion only once is still more expensive than not converting, but I'm not sure how much more expensive - it might cost less than sort such that it actually doesn't matter as for as Big(O) goes though).
> > 
> > Performing the toLower every time the cmp function is called doesn't change the O complexity.
> 
> Yes it does. It adds a loop to each comparison (two loops actually, but since they're not nested, Big(O) only cares about the one), since toLower has to loop over all of the characters. As sort loops over each of the strings to compare them for moving them into a sorted position or not, it calls toLower, which adds a nested loop, so it increases the Big(O) complexity. Something like this
> 
> foreach(str; strings)
>     str < otherStr;
> 
> becomes
> 
> foreach(str; strings)
> {
>     foreach(dchar c; str)
>         //lower characters
> 
>     foreach(dchar c; otherStr)
>         //lower characters
> 
>     str < otherStr;
> }
> 
> though that's obviously very pseudo-code-ish and not exactly what sort does.  Regardless, those extra loops when the comparison happens are nested and therefore increase the Big(O) complexity of the overall algorithm.
[...]

Actually, I don't think the nested loops affect Big-O complexity at all.
The O(l) complexity (where l = string length) is already inherent in the
string comparison "str < otherStr". Adding more loops over the strings
doesn't change the Big-O complexity (you just make O(l) into 3*O(l)
which is the same as O(l)).

However, always remember that Big-O notation hides constant factors and terms. These factors are negligible given a sufficiently large problem size, but for small problem sizes, the hidden constants are very much significant. An O(n log n) algorithm may actually take n*log(10*n) steps or 1000*n*log(n) steps; for large enough n, they're approximately the same, but when n is small, the 1000 has a very significant hit on observed performance.

That's why optimizing the constant factors is sometimes necessary (provided you've already minimized the big-O complexity to the full, since otherwise you're just pinching pennies yet freely spending $100 bills). Inner loop optimization, like strength reduction, loop invariant hoisting, etc., are examples of where constant factors get optimized. If you have a loop:

	real x;
	for (i=0; i < n; i++) {
		// bigComplexCalculation is independent of i
		real n = bigComplexCalculation(x);

		// make use of n in some way
	}

Then moving the bigComplexCalculation() call outside the loop improves the observed performance significantly, even though you're not changing the big-O complexity: a small change from 10*n to 8*n means a 20% improvement in observed performance, even though the algorithm still degrades linearly with problem size just like before.

T

-- 
Microsoft is to operating systems & security ... what McDonalds is to gourmet cooking.

April 22, 2012

toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias

Posted by Ali Çehreli
in reply to Jay Norwood

Ali Çehreli

Posted in reply to Jay Norwood

On 04/21/2012 04:24 PM, Jay Norwood wrote:
> While playing with sorting the unzip archive entries I tried use of the
> last example in http://dlang.org/phobos/std_algorithm.html#sort
>
> std.algorithm.sort!("toLower(a.name) <
> toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);

Stealing this thread to point out that converting a letter to upper or lower case cannot be done without knowing the writing system. Phobos's toLower() documentation currently says: "Returns a string which is identical to s except that all of its characters are lowercase (in unicode, not just ASCII)."

Unicode cannot define the conversions of at least the following letters without knowing the actual alphabet that the text is written in:

- Lowercase of I is ı in some alphabets[*] and i in many others.

- Uppercase of i is İ in some alphabets[*] and I in many others.

Ali

[*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.

April 22, 2012

Re: avoid toLower in std.algorithm.sort compare alias

Posted by Jay Norwood
in reply to Jonathan M Davis

Jay Norwood

Posted in reply to Jonathan M Davis

On Saturday, 21 April 2012 at 23:54:26 UTC, Jonathan M Davis wrote:
> Yeah. toLower would be called on both strings on _every_ compare. And since
> that involves a loop, that would make the overall call to sort an order of
> magnitude worse than if you didn't call toLower at all. I'm not sure if it's
> an order of magnitude worse than your solution though, since it's been a while
> since I studied Big(O) notation (doing the conversion only once is still more
> expensive than not converting, but I'm not sure how much more expensive - it
> might cost less than sort such that it actually doesn't matter as for as
> Big(O) goes though).
>
> - Jonathan M Davis

use of toLower in the sort expression adds around 11.2 secs overhead to a 0.3 sec operation which reads and sorts the 34k directory entries in this 2GB layout on the ssd drive, so it isn't an option for me.

finished! time:312 ms

finished! time:11598 ms

std.algorithm.sort!("toLower(a) < toLower(b)",std.algorithm.SwapStrategy.stable)(dirs);
//std.algorithm.sort!("a < b", std.algorithm.SwapStrategy.stable)(dirs);

April 22, 2012

Re: avoid toLower in std.algorithm.sort compare alias

Posted by Jonathan M Davis
in reply to Jay Norwood

Jonathan M Davis

Posted in reply to Jay Norwood

On Sunday, April 22, 2012 03:47:30 Jay Norwood wrote:
> On Saturday, 21 April 2012 at 23:54:26 UTC, Jonathan M Davis
> 
> wrote:
> > Yeah. toLower would be called on both strings on _every_
> > compare. And since
> > that involves a loop, that would make the overall call to sort
> > an order of
> > magnitude worse than if you didn't call toLower at all. I'm not
> > sure if it's
> > an order of magnitude worse than your solution though, since
> > it's been a while
> > since I studied Big(O) notation (doing the conversion only once
> > is still more
> > expensive than not converting, but I'm not sure how much more
> > expensive - it
> > might cost less than sort such that it actually doesn't matter
> > as for as
> > Big(O) goes though).
> > 
> > - Jonathan M Davis
> 
> use of toLower in the sort expression adds around 11.2 secs overhead to a 0.3 sec operation which reads and sorts the 34k directory entries in this 2GB layout on the ssd drive, so it isn't an option for me.
> 
> finished! time:312 ms
> 
> finished! time:11598 ms
> 
> std.algorithm.sort!("toLower(a) <
> toLower(b)",std.algorithm.SwapStrategy.stable)(dirs);
> //std.algorithm.sort!("a < b",
> std.algorithm.SwapStrategy.stable)(dirs);

It wasn't saying that it _was_ an option. I was just trying to point out the likely reason why it's so bad with toLower - algorithmically-speaking. This definitely appears to be a case where doing some extra computation ahead of time will save you a lot later.

- Jonathan M Davis

April 22, 2012

Re: toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias

Posted by Jonathan M Davis
in reply to Ali Çehreli

Jonathan M Davis

Posted in reply to Ali Çehreli

On Saturday, April 21, 2012 18:43:23 Ali Çehreli wrote:
> On 04/21/2012 04:24 PM, Jay Norwood wrote:
>  > While playing with sorting the unzip archive entries I tried use of the
>  > last example in http://dlang.org/phobos/std_algorithm.html#sort
>  >
>  > std.algorithm.sort!("toLower(a.name) <
>  > toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
> 
> Stealing this thread to point out that converting a letter to upper or lower case cannot be done without knowing the writing system. Phobos's toLower() documentation currently says: "Returns a string which is identical to s except that all of its characters are lowercase (in unicode, not just ASCII)."
> 
> Unicode cannot define the conversions of at least the following letters without knowing the actual alphabet that the text is written in:
> 
> - Lowercase of I is ı in some alphabets[*] and i in many others.
> 
> - Uppercase of i is İ in some alphabets[*] and I in many others.
> 
> Ali
> 
> [*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.

toLower and toUpper get pretty screwing with unicode. I don't know enough about non-English alphabets to know what affects what, but at minimum, there are a number of cases where toLower does not reverse toUpper (and vice versa). Rather, it converts the character into yet another letter. So, toLower to toUpper with unicode and definitely a bit iffy. I suppose that they do the job if you call them enough on the string that it doesn't change anymore, but I don't know.

I also don't know how they act with regards to the various alphabets and how their implementation was decided upon. IIRC, Walter wrote them, and I'm sure that they're based on the unicode standard, but what that amounts to, I don't know.

- Jonathan M Davis

April 22, 2012

Re: avoid toLower in std.algorithm.sort compare alias

Posted by Jonathan M Davis

Jonathan M Davis

On Saturday, April 21, 2012 18:26:42 H. S. Teoh wrote:
> Actually, I don't think the nested loops affect Big-O complexity at all.
> The O(l) complexity (where l = string length) is already inherent in the
> string comparison "str < otherStr". Adding more loops over the strings
> doesn't change the Big-O complexity (you just make O(l) into 3*O(l)
> which is the same as O(l)).

If the loop is really nested, then it will increase the Big(O) complexity, whereas if it's parallel to a similar loop, it will just increase the constant factor. Which it really is in this case, I don't know without sitting down and sketching out the exact algorithm, but certainly upon a first glance, it was my conclusion that the loops were nested. If they're not, then they're not.

Regardless of whether it's the Big(O) complexity or the constant factor that's the problem here, clearly there's enough additional overhead that it's causing problems for Jay's particular case. It's also the sort of thing that can be easy to miss and then end up wondering why your code is so slow (if it actually matters in your particular situation).

- Jonathan M Davis

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation