Jump to page: 1 2 3
Thread overview
avoid toLower in std.algorithm.sort compare alias
Apr 21, 2012
Jay Norwood
Apr 21, 2012
Jonathan M Davis
Apr 22, 2012
bearophile
Apr 22, 2012
Jonathan M Davis
Apr 22, 2012
H. S. Teoh
Apr 22, 2012
Jay Norwood
Apr 22, 2012
Jonathan M Davis
toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias
Apr 22, 2012
Ali Çehreli
Apr 22, 2012
Jonathan M Davis
Apr 22, 2012
Dmitry Olshansky
Apr 22, 2012
Jonathan M Davis
Apr 22, 2012
Jay Norwood
Apr 22, 2012
Jonathan M Davis
Apr 22, 2012
Jay Norwood
Apr 26, 2012
Marco Leise
Apr 22, 2012
Jacob Carlborg
Apr 22, 2012
SomeDude
Apr 23, 2012
Jay Norwood
Apr 23, 2012
Jay Norwood
Apr 24, 2012
Regan Heath
Apr 24, 2012
Jonathan M Davis
April 21, 2012
While playing with sorting the unzip archive entries I tried use of the last example in http://dlang.org/phobos/std_algorithm.html#sort

std.algorithm.sort!("toLower(a.name) < toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);

It was terribly slow  for sorting the 34k entries in my test case.  I'd guess it is doing the toLower call on both strings for every compare.

It was much, much faster to add an entries.lowerCaseName = std.string.toLower(entries.name) foreach entry prior to the sort execution and then use

std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName ",std.algorithm.SwapStrategy.stable)(entries);

The difference was on the order of 10 secs vs no noticeable delay when executing the sort operation in the debugger.



April 21, 2012
On Sunday, April 22, 2012 01:24:56 Jay Norwood wrote:
> While playing with sorting the unzip archive entries I tried use of the last example in http://dlang.org/phobos/std_algorithm.html#sort
> 
> std.algorithm.sort!("toLower(a.name) <
> toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
> 
> It was terribly slow  for sorting the 34k entries in my test case.  I'd guess it is doing the toLower call on both strings for every compare.
> 
> It was much, much faster to add an entries.lowerCaseName = std.string.toLower(entries.name) foreach entry prior to the sort execution and then use
> 
> std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName
> ",std.algorithm.SwapStrategy.stable)(entries);
> 
> The difference was on the order of 10 secs vs no noticeable delay when executing the sort operation in the debugger.

Yeah. toLower would be called on both strings on _every_ compare. And since that involves a loop, that would make the overall call to sort an order of magnitude worse than if you didn't call toLower at all. I'm not sure if it's an order of magnitude worse than your solution though, since it's been a while since I studied Big(O) notation (doing the conversion only once is still more expensive than not converting, but I'm not sure how much more expensive - it might cost less than sort such that it actually doesn't matter as for as Big(O) goes though).

- Jonathan M Davis
April 22, 2012
Jonathan M Davis:

> I'm not sure if it's an order of magnitude worse than your solution though, since it's been a while since I studied Big(O) notation (doing the conversion only once is still more expensive than not converting, but I'm not sure how much more expensive - it might cost less than sort such that it actually doesn't matter as for as Big(O) goes though).

Performing the toLower every time the cmp function is called doesn't change the O complexity. In Phobos there is an alternative sorting (Schwartzian sorting routime) that applies a function to each item before sorting them, usually is much slower than the normal sorting, but maybe this time it's convenient.
The performance improvement in the OP message is large, maybe it's a problem of memory allocations of the converted strings... More info on the original code is needed to give better answers.

Bye,
bearophile
April 22, 2012
On Saturday, April 21, 2012 20:36:18 bearophile wrote:
> Jonathan M Davis:
> > I'm not sure if it's
> > an order of magnitude worse than your solution though, since it's been a
> > while since I studied Big(O) notation (doing the conversion only once is
> > still more expensive than not converting, but I'm not sure how much more
> > expensive - it might cost less than sort such that it actually doesn't
> > matter as for as Big(O) goes though).
> 
> Performing the toLower every time the cmp function is called doesn't change the O complexity.

Yes it does. It adds a loop to each comparison (two loops actually, but since they're not nested, Big(O) only cares about the one), since toLower has to loop over all of the characters. As sort loops over each of the strings to compare them for moving them into a sorted position or not, it calls toLower, which adds a nested loop, so it increases the Big(O) complexity. Something like this

foreach(str; strings)
    str < otherStr;

becomes

foreach(str; strings)
{
    foreach(dchar c; str)
        //lower characters

    foreach(dchar c; otherStr)
        //lower characters

    str < otherStr;
}

though that's obviously very pseudo-code-ish and not exactly what sort does. Regardless, those extra loops when the comparison happens are nested and therefore increase the Big(O) complexity of the overall algorithm.

- Jonathan M Davis
April 22, 2012
On Sat, Apr 21, 2012 at 05:45:35PM -0700, Jonathan M Davis wrote:
> On Saturday, April 21, 2012 20:36:18 bearophile wrote:
> > Jonathan M Davis:
> > > I'm not sure if it's an order of magnitude worse than your solution though, since it's been a while since I studied Big(O) notation (doing the conversion only once is still more expensive than not converting, but I'm not sure how much more expensive - it might cost less than sort such that it actually doesn't matter as for as Big(O) goes though).
> > 
> > Performing the toLower every time the cmp function is called doesn't change the O complexity.
> 
> Yes it does. It adds a loop to each comparison (two loops actually, but since they're not nested, Big(O) only cares about the one), since toLower has to loop over all of the characters. As sort loops over each of the strings to compare them for moving them into a sorted position or not, it calls toLower, which adds a nested loop, so it increases the Big(O) complexity. Something like this
> 
> foreach(str; strings)
>     str < otherStr;
> 
> becomes
> 
> foreach(str; strings)
> {
>     foreach(dchar c; str)
>         //lower characters
> 
>     foreach(dchar c; otherStr)
>         //lower characters
> 
>     str < otherStr;
> }
> 
> though that's obviously very pseudo-code-ish and not exactly what sort does.  Regardless, those extra loops when the comparison happens are nested and therefore increase the Big(O) complexity of the overall algorithm.
[...]

Actually, I don't think the nested loops affect Big-O complexity at all.
The O(l) complexity (where l = string length) is already inherent in the
string comparison "str < otherStr". Adding more loops over the strings
doesn't change the Big-O complexity (you just make O(l) into 3*O(l)
which is the same as O(l)).

However, always remember that Big-O notation hides constant factors and terms. These factors are negligible given a sufficiently large problem size, but for small problem sizes, the hidden constants are very much significant. An O(n log n) algorithm may actually take n*log(10*n) steps or 1000*n*log(n) steps; for large enough n, they're approximately the same, but when n is small, the 1000 has a very significant hit on observed performance.

That's why optimizing the constant factors is sometimes necessary (provided you've already minimized the big-O complexity to the full, since otherwise you're just pinching pennies yet freely spending $100 bills). Inner loop optimization, like strength reduction, loop invariant hoisting, etc., are examples of where constant factors get optimized. If you have a loop:

	real x;
	for (i=0; i < n; i++) {
		// bigComplexCalculation is independent of i
		real n = bigComplexCalculation(x);

		// make use of n in some way
	}

Then moving the bigComplexCalculation() call outside the loop improves the observed performance significantly, even though you're not changing the big-O complexity: a small change from 10*n to 8*n means a 20% improvement in observed performance, even though the algorithm still degrades linearly with problem size just like before.


T

-- 
Microsoft is to operating systems & security ... what McDonalds is to gourmet cooking.
April 22, 2012
On 04/21/2012 04:24 PM, Jay Norwood wrote:
> While playing with sorting the unzip archive entries I tried use of the
> last example in http://dlang.org/phobos/std_algorithm.html#sort
>
> std.algorithm.sort!("toLower(a.name) <
> toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);

Stealing this thread to point out that converting a letter to upper or lower case cannot be done without knowing the writing system. Phobos's toLower() documentation currently says: "Returns a string which is identical to s except that all of its characters are lowercase (in unicode, not just ASCII)."

Unicode cannot define the conversions of at least the following letters without knowing the actual alphabet that the text is written in:

- Lowercase of I is ı in some alphabets[*] and i in many others.

- Uppercase of i is İ in some alphabets[*] and I in many others.

Ali

[*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.

April 22, 2012
On Saturday, 21 April 2012 at 23:54:26 UTC, Jonathan M Davis wrote:
> Yeah. toLower would be called on both strings on _every_ compare. And since
> that involves a loop, that would make the overall call to sort an order of
> magnitude worse than if you didn't call toLower at all. I'm not sure if it's
> an order of magnitude worse than your solution though, since it's been a while
> since I studied Big(O) notation (doing the conversion only once is still more
> expensive than not converting, but I'm not sure how much more expensive - it
> might cost less than sort such that it actually doesn't matter as for as
> Big(O) goes though).
>
> - Jonathan M Davis

use of toLower in the sort expression adds around 11.2 secs overhead to a 0.3 sec operation which reads and sorts the 34k directory entries in this 2GB layout on the ssd drive, so it isn't an option for me.

finished! time:312 ms

finished! time:11598 ms

std.algorithm.sort!("toLower(a) < toLower(b)",std.algorithm.SwapStrategy.stable)(dirs);
//std.algorithm.sort!("a < b", std.algorithm.SwapStrategy.stable)(dirs);


April 22, 2012
On Sunday, April 22, 2012 03:47:30 Jay Norwood wrote:
> On Saturday, 21 April 2012 at 23:54:26 UTC, Jonathan M Davis
> 
> wrote:
> > Yeah. toLower would be called on both strings on _every_
> > compare. And since
> > that involves a loop, that would make the overall call to sort
> > an order of
> > magnitude worse than if you didn't call toLower at all. I'm not
> > sure if it's
> > an order of magnitude worse than your solution though, since
> > it's been a while
> > since I studied Big(O) notation (doing the conversion only once
> > is still more
> > expensive than not converting, but I'm not sure how much more
> > expensive - it
> > might cost less than sort such that it actually doesn't matter
> > as for as
> > Big(O) goes though).
> > 
> > - Jonathan M Davis
> 
> use of toLower in the sort expression adds around 11.2 secs overhead to a 0.3 sec operation which reads and sorts the 34k directory entries in this 2GB layout on the ssd drive, so it isn't an option for me.
> 
> finished! time:312 ms
> 
> finished! time:11598 ms
> 
> std.algorithm.sort!("toLower(a) <
> toLower(b)",std.algorithm.SwapStrategy.stable)(dirs);
> //std.algorithm.sort!("a < b",
> std.algorithm.SwapStrategy.stable)(dirs);

It wasn't saying that it _was_ an option. I was just trying to point out the likely reason why it's so bad with toLower - algorithmically-speaking. This definitely appears to be a case where doing some extra computation ahead of time will save you a lot later.

- Jonathan M Davis
April 22, 2012
On Saturday, April 21, 2012 18:43:23 Ali Çehreli wrote:
> On 04/21/2012 04:24 PM, Jay Norwood wrote:
>  > While playing with sorting the unzip archive entries I tried use of the
>  > last example in http://dlang.org/phobos/std_algorithm.html#sort
>  >
>  > std.algorithm.sort!("toLower(a.name) <
>  > toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
> 
> Stealing this thread to point out that converting a letter to upper or lower case cannot be done without knowing the writing system. Phobos's toLower() documentation currently says: "Returns a string which is identical to s except that all of its characters are lowercase (in unicode, not just ASCII)."
> 
> Unicode cannot define the conversions of at least the following letters without knowing the actual alphabet that the text is written in:
> 
> - Lowercase of I is ı in some alphabets[*] and i in many others.
> 
> - Uppercase of i is İ in some alphabets[*] and I in many others.
> 
> Ali
> 
> [*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.

toLower and toUpper get pretty screwing with unicode. I don't know enough about non-English alphabets to know what affects what, but at minimum, there are a number of cases where toLower does not reverse toUpper (and vice versa). Rather, it converts the character into yet another letter. So, toLower to toUpper with unicode and definitely a bit iffy. I suppose that they do the job if you call them enough on the string that it doesn't change anymore, but I don't know.

I also don't know how they act with regards to the various alphabets and how their implementation was decided upon. IIRC, Walter wrote them, and I'm sure that they're based on the unicode standard, but what that amounts to, I don't know.

- Jonathan M Davis
April 22, 2012
On Saturday, April 21, 2012 18:26:42 H. S. Teoh wrote:
> Actually, I don't think the nested loops affect Big-O complexity at all.
> The O(l) complexity (where l = string length) is already inherent in the
> string comparison "str < otherStr". Adding more loops over the strings
> doesn't change the Big-O complexity (you just make O(l) into 3*O(l)
> which is the same as O(l)).

If the loop is really nested, then it will increase the Big(O) complexity, whereas if it's parallel to a similar loop, it will just increase the constant factor. Which it really is in this case, I don't know without sitting down and sketching out the exact algorithm, but certainly upon a first glance, it was my conclusion that the loops were nested. If they're not, then they're not.

Regardless of whether it's the Big(O) complexity or the constant factor that's the problem here, clearly there's enough additional overhead that it's causing problems for Jay's particular case. It's also the sort of thing that can be easy to miss and then end up wondering why your code is so slow (if it actually matters in your particular situation).

- Jonathan M Davis
« First   ‹ Prev
1 2 3