View mode: basic / threaded / horizontal-split · Log in · Help
April 21, 2012
avoid toLower in std.algorithm.sort compare alias
While playing with sorting the unzip archive entries I tried use 
of the last example in 
http://dlang.org/phobos/std_algorithm.html#sort

std.algorithm.sort!("toLower(a.name) < 
toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);

It was terribly slow  for sorting the 34k entries in my test 
case.  I'd guess it is doing the toLower call on both strings for 
every compare.

It was much, much faster to add an entries.lowerCaseName = 
std.string.toLower(entries.name) foreach entry prior to the sort 
execution and then use

std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName 
",std.algorithm.SwapStrategy.stable)(entries);

The difference was on the order of 10 secs vs no noticeable delay 
when executing the sort operation in the debugger.
April 21, 2012
Re: avoid toLower in std.algorithm.sort compare alias
On Sunday, April 22, 2012 01:24:56 Jay Norwood wrote:
> While playing with sorting the unzip archive entries I tried use
> of the last example in
> http://dlang.org/phobos/std_algorithm.html#sort
> 
> std.algorithm.sort!("toLower(a.name) <
> toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
> 
> It was terribly slow  for sorting the 34k entries in my test
> case.  I'd guess it is doing the toLower call on both strings for
> every compare.
> 
> It was much, much faster to add an entries.lowerCaseName =
> std.string.toLower(entries.name) foreach entry prior to the sort
> execution and then use
> 
> std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName
> ",std.algorithm.SwapStrategy.stable)(entries);
> 
> The difference was on the order of 10 secs vs no noticeable delay
> when executing the sort operation in the debugger.

Yeah. toLower would be called on both strings on _every_ compare. And since 
that involves a loop, that would make the overall call to sort an order of 
magnitude worse than if you didn't call toLower at all. I'm not sure if it's 
an order of magnitude worse than your solution though, since it's been a while 
since I studied Big(O) notation (doing the conversion only once is still more 
expensive than not converting, but I'm not sure how much more expensive - it 
might cost less than sort such that it actually doesn't matter as for as 
Big(O) goes though).

- Jonathan M Davis
April 22, 2012
Re: avoid toLower in std.algorithm.sort compare alias
Jonathan M Davis:

> I'm not sure if it's 
> an order of magnitude worse than your solution though, since it's been a while 
> since I studied Big(O) notation (doing the conversion only once is still more 
> expensive than not converting, but I'm not sure how much more expensive - it 
> might cost less than sort such that it actually doesn't matter as for as 
> Big(O) goes though).

Performing the toLower every time the cmp function is called doesn't change the O complexity. In Phobos there is an alternative sorting (Schwartzian sorting routime) that applies a function to each item before sorting them, usually is much slower than the normal sorting, but maybe this time it's convenient.
The performance improvement in the OP message is large, maybe it's a problem of memory allocations of the converted strings... More info on the original code is needed to give better answers.

Bye,
bearophile
April 22, 2012
Re: avoid toLower in std.algorithm.sort compare alias
On Saturday, April 21, 2012 20:36:18 bearophile wrote:
> Jonathan M Davis:
> > I'm not sure if it's
> > an order of magnitude worse than your solution though, since it's been a
> > while since I studied Big(O) notation (doing the conversion only once is
> > still more expensive than not converting, but I'm not sure how much more
> > expensive - it might cost less than sort such that it actually doesn't
> > matter as for as Big(O) goes though).
> 
> Performing the toLower every time the cmp function is called doesn't change
> the O complexity.

Yes it does. It adds a loop to each comparison (two loops actually, but since 
they're not nested, Big(O) only cares about the one), since toLower has to 
loop over all of the characters. As sort loops over each of the strings to 
compare them for moving them into a sorted position or not, it calls toLower, 
which adds a nested loop, so it increases the Big(O) complexity. Something 
like this

foreach(str; strings)
   str < otherStr;

becomes

foreach(str; strings)
{
   foreach(dchar c; str)
       //lower characters

   foreach(dchar c; otherStr)
       //lower characters

   str < otherStr;
}

though that's obviously very pseudo-code-ish and not exactly what sort does. 
Regardless, those extra loops when the comparison happens are nested and 
therefore increase the Big(O) complexity of the overall algorithm.

- Jonathan M Davis
April 22, 2012
Re: avoid toLower in std.algorithm.sort compare alias
On Sat, Apr 21, 2012 at 05:45:35PM -0700, Jonathan M Davis wrote:
> On Saturday, April 21, 2012 20:36:18 bearophile wrote:
> > Jonathan M Davis:
> > > I'm not sure if it's an order of magnitude worse than your
> > > solution though, since it's been a while since I studied Big(O)
> > > notation (doing the conversion only once is still more expensive
> > > than not converting, but I'm not sure how much more expensive - it
> > > might cost less than sort such that it actually doesn't matter as
> > > for as Big(O) goes though).
> > 
> > Performing the toLower every time the cmp function is called doesn't change
> > the O complexity.
> 
> Yes it does. It adds a loop to each comparison (two loops actually,
> but since they're not nested, Big(O) only cares about the one), since
> toLower has to loop over all of the characters. As sort loops over
> each of the strings to compare them for moving them into a sorted
> position or not, it calls toLower, which adds a nested loop, so it
> increases the Big(O) complexity. Something like this
> 
> foreach(str; strings)
>     str < otherStr;
> 
> becomes
> 
> foreach(str; strings)
> {
>     foreach(dchar c; str)
>         //lower characters
> 
>     foreach(dchar c; otherStr)
>         //lower characters
> 
>     str < otherStr;
> }
> 
> though that's obviously very pseudo-code-ish and not exactly what sort
> does.  Regardless, those extra loops when the comparison happens are
> nested and therefore increase the Big(O) complexity of the overall
> algorithm.
[...]

Actually, I don't think the nested loops affect Big-O complexity at all.
The O(l) complexity (where l = string length) is already inherent in the
string comparison "str < otherStr". Adding more loops over the strings
doesn't change the Big-O complexity (you just make O(l) into 3*O(l)
which is the same as O(l)).

However, always remember that Big-O notation hides constant factors and
terms. These factors are negligible given a sufficiently large problem
size, but for small problem sizes, the hidden constants are very much
significant. An O(n log n) algorithm may actually take n*log(10*n) steps
or 1000*n*log(n) steps; for large enough n, they're approximately the
same, but when n is small, the 1000 has a very significant hit on
observed performance.

That's why optimizing the constant factors is sometimes necessary
(provided you've already minimized the big-O complexity to the full,
since otherwise you're just pinching pennies yet freely spending $100
bills). Inner loop optimization, like strength reduction, loop invariant
hoisting, etc., are examples of where constant factors get optimized. If
you have a loop:

	real x;
	for (i=0; i < n; i++) {
		// bigComplexCalculation is independent of i
		real n = bigComplexCalculation(x);

		// make use of n in some way
	}

Then moving the bigComplexCalculation() call outside the loop improves
the observed performance significantly, even though you're not changing
the big-O complexity: a small change from 10*n to 8*n means a 20%
improvement in observed performance, even though the algorithm still
degrades linearly with problem size just like before.


T

-- 
Microsoft is to operating systems & security ... what McDonalds is to gourmet cooking.
April 22, 2012
toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias
On 04/21/2012 04:24 PM, Jay Norwood wrote:
> While playing with sorting the unzip archive entries I tried use of the
> last example in http://dlang.org/phobos/std_algorithm.html#sort
>
> std.algorithm.sort!("toLower(a.name) <
> toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);

Stealing this thread to point out that converting a letter to upper or 
lower case cannot be done without knowing the writing system. Phobos's 
toLower() documentation currently says: "Returns a string which is 
identical to s except that all of its characters are lowercase (in 
unicode, not just ASCII)."

Unicode cannot define the conversions of at least the following letters 
without knowing the actual alphabet that the text is written in:

- Lowercase of I is ı in some alphabets[*] and i in many others.

- Uppercase of i is İ in some alphabets[*] and I in many others.

Ali

[*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.
April 22, 2012
Re: avoid toLower in std.algorithm.sort compare alias
On Saturday, 21 April 2012 at 23:54:26 UTC, Jonathan M Davis 
wrote:
> Yeah. toLower would be called on both strings on _every_ 
> compare. And since
> that involves a loop, that would make the overall call to sort 
> an order of
> magnitude worse than if you didn't call toLower at all. I'm not 
> sure if it's
> an order of magnitude worse than your solution though, since 
> it's been a while
> since I studied Big(O) notation (doing the conversion only once 
> is still more
> expensive than not converting, but I'm not sure how much more 
> expensive - it
> might cost less than sort such that it actually doesn't matter 
> as for as
> Big(O) goes though).
>
> - Jonathan M Davis

use of toLower in the sort expression adds around 11.2 secs 
overhead to a 0.3 sec operation which reads and sorts the 34k 
directory entries in this 2GB layout on the ssd drive, so it 
isn't an option for me.

finished! time:312 ms

finished! time:11598 ms

std.algorithm.sort!("toLower(a) < 
toLower(b)",std.algorithm.SwapStrategy.stable)(dirs);
//std.algorithm.sort!("a < b", 
std.algorithm.SwapStrategy.stable)(dirs);
April 22, 2012
Re: avoid toLower in std.algorithm.sort compare alias
On Sunday, April 22, 2012 03:47:30 Jay Norwood wrote:
> On Saturday, 21 April 2012 at 23:54:26 UTC, Jonathan M Davis
> 
> wrote:
> > Yeah. toLower would be called on both strings on _every_
> > compare. And since
> > that involves a loop, that would make the overall call to sort
> > an order of
> > magnitude worse than if you didn't call toLower at all. I'm not
> > sure if it's
> > an order of magnitude worse than your solution though, since
> > it's been a while
> > since I studied Big(O) notation (doing the conversion only once
> > is still more
> > expensive than not converting, but I'm not sure how much more
> > expensive - it
> > might cost less than sort such that it actually doesn't matter
> > as for as
> > Big(O) goes though).
> > 
> > - Jonathan M Davis
> 
> use of toLower in the sort expression adds around 11.2 secs
> overhead to a 0.3 sec operation which reads and sorts the 34k
> directory entries in this 2GB layout on the ssd drive, so it
> isn't an option for me.
> 
> finished! time:312 ms
> 
> finished! time:11598 ms
> 
> std.algorithm.sort!("toLower(a) <
> toLower(b)",std.algorithm.SwapStrategy.stable)(dirs);
> //std.algorithm.sort!("a < b",
> std.algorithm.SwapStrategy.stable)(dirs);

It wasn't saying that it _was_ an option. I was just trying to point out the 
likely reason why it's so bad with toLower - algorithmically-speaking. This 
definitely appears to be a case where doing some extra computation ahead of 
time will save you a lot later.

- Jonathan M Davis
April 22, 2012
Re: toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias
On Saturday, April 21, 2012 18:43:23 Ali Çehreli wrote:
> On 04/21/2012 04:24 PM, Jay Norwood wrote:
>  > While playing with sorting the unzip archive entries I tried use of the
>  > last example in http://dlang.org/phobos/std_algorithm.html#sort
>  > 
>  > std.algorithm.sort!("toLower(a.name) <
>  > toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
> 
> Stealing this thread to point out that converting a letter to upper or
> lower case cannot be done without knowing the writing system. Phobos's
> toLower() documentation currently says: "Returns a string which is
> identical to s except that all of its characters are lowercase (in
> unicode, not just ASCII)."
> 
> Unicode cannot define the conversions of at least the following letters
> without knowing the actual alphabet that the text is written in:
> 
> - Lowercase of I is ı in some alphabets[*] and i in many others.
> 
> - Uppercase of i is İ in some alphabets[*] and I in many others.
> 
> Ali
> 
> [*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.

toLower and toUpper get pretty screwing with unicode. I don't know enough 
about non-English alphabets to know what affects what, but at minimum, there 
are a number of cases where toLower does not reverse toUpper (and vice versa). 
Rather, it converts the character into yet another letter. So, toLower to 
toUpper with unicode and definitely a bit iffy. I suppose that they do the job 
if you call them enough on the string that it doesn't change anymore, but I 
don't know.

I also don't know how they act with regards to the various alphabets and how 
their implementation was decided upon. IIRC, Walter wrote them, and I'm sure 
that they're based on the unicode standard, but what that amounts to, I don't 
know.

- Jonathan M Davis
April 22, 2012
Re: avoid toLower in std.algorithm.sort compare alias
On Saturday, April 21, 2012 18:26:42 H. S. Teoh wrote:
> Actually, I don't think the nested loops affect Big-O complexity at all.
> The O(l) complexity (where l = string length) is already inherent in the
> string comparison "str < otherStr". Adding more loops over the strings
> doesn't change the Big-O complexity (you just make O(l) into 3*O(l)
> which is the same as O(l)).

If the loop is really nested, then it will increase the Big(O) complexity, 
whereas if it's parallel to a similar loop, it will just increase the constant 
factor. Which it really is in this case, I don't know without sitting down and 
sketching out the exact algorithm, but certainly upon a first glance, it was my 
conclusion that the loops were nested. If they're not, then they're not.

Regardless of whether it's the Big(O) complexity or the constant factor that's 
the problem here, clearly there's enough additional overhead that it's causing 
problems for Jay's particular case. It's also the sort of thing that can be 
easy to miss and then end up wondering why your code is so slow (if it 
actually matters in your particular situation).

- Jonathan M Davis
« First   ‹ Prev
1 2 3
Top | Discussion index | About this forum | D home