April 22, 2012
On Sunday, 22 April 2012 at 02:29:45 UTC, Jonathan M Davis wrote:

> Regardless of whether it's the Big(O) complexity or the constant factor that's
> the problem here, clearly there's enough additional overhead that it's causing
> problems for Jay's particular case. It's also the sort of thing that can be
> easy to miss and then end up wondering why your code is so slow (if it
> actually matters in your particular situation).
>
> - Jonathan M Davis

I haven't looked at strncmpi code, but I suspect it is a lot more efficient.  For example, in comparing

AbbbCdddEfffXabcdEfgh
AbbbCdddEfffYabcdEfgh

it is not necessary to do case conversion on anything except X and Y, and if isUpper(X)==isUpper(Y) then X and Y can be compared without conversion, and since X and Y are not equal the remaining characters don't have to be converted.

The comment below worries me a little bit about std.string.icmp.  What if they are two 1MB strings that differ in he first character?  Does it really convert both strings to lower case before comparing the first character?

http://dlang.org/phobos/std_string.html#icmp

"Technically, icmp(r1, r2) is equivalent to cmp!"std.uni.toLower(a) < std.uni.toLower(b)"(r1, r2). "


April 22, 2012
On Sunday, April 22, 2012 08:20:13 Jay Norwood wrote:
> The comment below worries me a little bit about std.string.icmp. What if they are two 1MB strings that differ in he first character?  Does it really convert both strings to lower case before comparing the first character?
> 
> http://dlang.org/phobos/std_string.html#icmp
> 
> "Technically, icmp(r1, r2) is equivalent to
> cmp!"std.uni.toLower(a) < std.uni.toLower(b)"(r1, r2). "

You can look at the code. It checks each of the characters in place. Unlike toLower, it doesn't need to generate a new string. But as far as the comparison goes, they're the same - hence that line in the docs.

- Jonathan M Davis
April 22, 2012
On Sunday, 22 April 2012 at 06:26:42 UTC, Jonathan M Davis wrote:
>
> You can look at the code. It checks each of the characters in place. Unlike
> toLower, it doesn't need to generate a new string. But as far as the
> comparison goes, they're the same - hence that line in the docs.
>
> - Jonathan M Davis

ok, I did look at the code just now, and I'll sleep better
knowing that it doesn't do the whole string conversion.  I
misunderstood your pseudo-code to mean that two lower case
strings were being created prior to the compare.

However,  icmp code does appear to call the toLower conversion on
both characters without first comparing the characters for
equality, which misses the chance to do a simple compare that
would avoid the two calls.



April 22, 2012
On 22.04.2012 5:43, Ali Çehreli wrote:
> On 04/21/2012 04:24 PM, Jay Norwood wrote:
>  > While playing with sorting the unzip archive entries I tried use of the
>  > last example in http://dlang.org/phobos/std_algorithm.html#sort
>  >
>  > std.algorithm.sort!("toLower(a.name) <
>  > toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
>
> Stealing this thread to point out that converting a letter to upper or
> lower case cannot be done without knowing the writing system. Phobos's
> toLower() documentation currently says: "Returns a string which is
> identical to s except that all of its characters are lowercase (in
> unicode, not just ASCII)."

Oh, come on. This function wasn't updated for ages. I bet this wording here is intact since unicode 4.0 ;)

>
> Unicode cannot define the conversions of at least the following letters
> without knowing the actual alphabet that the text is written in:
>
> - Lowercase of I is ı in some alphabets[*] and i in many others.
>
> - Uppercase of i is İ in some alphabets[*] and I in many others.
>

Fair point. The list however is not that long and a system may choose to support this or not (changing behavior based on writing system is called tailoring I believe).

> Ali
>
> [*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.
>


-- 
Dmitry Olshansky
April 22, 2012
On 2012-04-22 01:24, Jay Norwood wrote:
> While playing with sorting the unzip archive entries I tried use of the
> last example in http://dlang.org/phobos/std_algorithm.html#sort
>
> std.algorithm.sort!("toLower(a.name) <
> toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
>
> It was terribly slow for sorting the 34k entries in my test case. I'd
> guess it is doing the toLower call on both strings for every compare.
>
> It was much, much faster to add an entries.lowerCaseName =
> std.string.toLower(entries.name) foreach entry prior to the sort
> execution and then use
>
> std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName
> ",std.algorithm.SwapStrategy.stable)(entries);
>
> The difference was on the order of 10 secs vs no noticeable delay when
> executing the sort operation in the debugger.

Perhaps a function that does case folding would be better in this case. But as far as I know Phobos doesn't have a function for that.

-- 
/Jacob Carlborg
April 22, 2012
On Saturday, 21 April 2012 at 23:24:57 UTC, Jay Norwood wrote:
> While playing with sorting the unzip archive entries I tried use of the last example in http://dlang.org/phobos/std_algorithm.html#sort
>
> std.algorithm.sort!("toLower(a.name) < toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
>
> It was terribly slow  for sorting the 34k entries in my test case.  I'd guess it is doing the toLower call on both strings for every compare.
>
> It was much, much faster to add an entries.lowerCaseName = std.string.toLower(entries.name) foreach entry prior to the sort execution and then use
>
> std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName ",std.algorithm.SwapStrategy.stable)(entries);
>
> The difference was on the order of 10 secs vs no noticeable delay when executing the sort operation in the debugger.

Good point. Perhaps this should be added in the documentation of std.algorithm ? It's easy to get trapped.
April 23, 2012
On Sunday, 22 April 2012 at 00:36:19 UTC, bearophile wrote:
> Performing the toLower every time the cmp function is called doesn't change the O complexity. In Phobos there is an alternative sorting (Schwartzian sorting routime) that applies a function to each item before sorting them, usually is much slower than the normal sorting, but maybe this time it's convenient.
> Bye,
> bearophile

The shwartzSort works fine.
Thanks,
Jay

std.algorithm.schwartzSort!(std.string.toLower, "a < b")(dirs);

G:\d\rmdir2\rmdir2\rmdir2\Debug>rmdir2
removing: g:\tz
finished! time:699 ms

April 23, 2012
On Sat, 21 Apr 2012 19:24:56 -0400, Jay Norwood <jayn@prismnet.com> wrote:

> While playing with sorting the unzip archive entries I tried use of the last example in http://dlang.org/phobos/std_algorithm.html#sort
>
> std.algorithm.sort!("toLower(a.name) < toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
>
> It was terribly slow  for sorting the 34k entries in my test case.  I'd guess it is doing the toLower call on both strings for every compare.
>
> It was much, much faster to add an entries.lowerCaseName = std.string.toLower(entries.name) foreach entry prior to the sort execution and then use
>
> std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName ",std.algorithm.SwapStrategy.stable)(entries);
>
> The difference was on the order of 10 secs vs no noticeable delay when executing the sort operation in the debugger.

I'll point out what I haven't seen yet:

the issue is not so much toLower being called on every comparison, but more that toLower allocates (and then throws away!).

I think using std.string.icmp is the best solution.  I would expect it to outperform even schwartz sort.

Note, to answer your question elsewhere, the comment is accurate, std.uni.toLower(a) is a function that accepts a dchar, not a string.  What the comment is saying is that for the "ranges" (i.e. strings) given, it runs the given comparison on the std.uni.toLower() result for each element (i.e. dchar).

-Steve
April 23, 2012
On Monday, 23 April 2012 at 11:27:40 UTC, Steven Schveighoffer
wrote:

> I think using std.string.icmp is the best solution.  I would expect it to outperform even schwartz sort.
>
> -Steve

icmp took longer... added about 1 sec vs  0.3 sec (for
schwartzSort ) to the program execution time.

bool myComp(string x, string y) { return std.string.icmp(x,y)<0; }
std.algorithm.sort!(myComp)(dirs);

finished! time:1396 ms




April 23, 2012
On Mon, 23 Apr 2012 09:49:50 -0400, Jay Norwood <jayn@prismnet.com> wrote:

> On Monday, 23 April 2012 at 11:27:40 UTC, Steven Schveighoffer
> wrote:
>
>> I think using std.string.icmp is the best solution.  I would expect it to outperform even schwartz sort.
>>
>> -Steve
>
> icmp took longer... added about 1 sec vs  0.3 sec (for
> schwartzSort ) to the program execution time.
>
> bool myComp(string x, string y) { return std.string.icmp(x,y)<0; }
> std.algorithm.sort!(myComp)(dirs);
>
> finished! time:1396 ms

Well, that's surprising :)  Perhaps there's some room for improvement in icmp or std.uni.toLower.  There may be some constructs that are preventing inlining (enforce is the worst offender).

While dealing with unicode in my std.stream rewrite, I've found that hand-decoding dchars is way faster than using library calls.

-Steve