August 05, 2013 Re: Which option is faster... | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile | On Monday, 5 August 2013 at 14:50:27 UTC, bearophile wrote:
> jicman:
>
>> How would you make it faster in D1?
>
> Compute std.string.tolower(fext[0]) and put it in a temporary variable. And then compare that variable with all your string literals. In most cases that's fast enough. If it's not enough, you could create a little finite state machine that represents your directed acyclic word graph, and uses gotos to jump around states. The amount of strings is small, so perhaps there are not enough code cache misses to nullify this optimization.
>
> Bye,
> bearophile
Thanks. This is great.
|
August 05, 2013 Re: Which option is faster... | ||||
---|---|---|---|---|
| ||||
Posted in reply to jicman | Am 05.08.2013 15:59, schrieb jicman: > > Greetings! > > I have this code, > > foreach (...) > { > > if (std.string.tolower(fext[0]) == "doc" || > std.string.tolower(fext[0]) == "docx" || > std.string.tolower(fext[0]) == "xls" || > std.string.tolower(fext[0]) == "xlsx" || > std.string.tolower(fext[0]) == "ppt" || > std.string.tolower(fext[0]) == "pptx") > continue; > } > That's how I would do it. if(["doc", "docx", "xls", "xlsx", "ppt", "pptx"].canFind(fext[0].toLower) { continue; } With the array beeing a runtime/compiletime constant with a sensible name. WORD_EXTENSIONS or something. |
August 05, 2013 Re: Which option is faster... | ||||
---|---|---|---|---|
| ||||
Posted in reply to John Colvin | On Monday, 5 August 2013 at 15:18:42 UTC, John Colvin wrote:
> On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
>>
>> Greetings!
>>
>> I have this code,
>>
>> foreach (...)
>> {
>>
>> if (std.string.tolower(fext[0]) == "doc" ||
>> std.string.tolower(fext[0]) == "docx" ||
>> std.string.tolower(fext[0]) == "xls" ||
>> std.string.tolower(fext[0]) == "xlsx" ||
>> std.string.tolower(fext[0]) == "ppt" ||
>> std.string.tolower(fext[0]) == "pptx")
>> continue;
>> }
>>
>> foreach (...)
>> {
>> if (std.string.tolower(fext[0]) == "doc")
>> continue;
>> if (std.string.tolower(fext[0]) == "docx")
>> continue;
>> if (std.string.tolower(fext[0]) == "xls")
>> continue;
>> if (std.string.tolower(fext[0]) == "xlsx")
>> continue;
>> if (std.string.tolower(fext[0]) == "ppt")
>> continue;
>> if (std.string.tolower(fext[0]) == "pptx")
>> continue;
>> ...
>> ...
>> }
>>
>> thanks.
>>
>> josé
>
> better:
>
> foreach (...)
> {
> auto tmp = std.string.tolower(fext[0]);
> if(tmp == "doc" || tmp == "docx"
> || tmp == "xls" || tmp == "xlsx"
> || tmp == "ppt" || tmp == "pptx")
> {
> continue;
> }
> }
>
> but still not super-fast as (unless the compiler is very clever) it still means multiple passes over tmp. Also, it converts the whole string to lower case even when it's not necessary.
>
> If you have large numbers of possible matches you will probably want to be clever with your data structures / algorithms. E.g.
>
> You could create a tree-like structure to quickly eliminate possibilities as you read successive letters. You read one character, follow the appropriate branch, check if there are any further branches, if not then no match and break. Else, read the next character and follow the appropriate branch and so on.... Infeasible for large (or even medium-sized) character-sets without hashing, but might be pretty fast for a-z and a large number of short strings.
As the Hispanic speaking community say, muchas gracias.
|
August 05, 2013 Re: Which option is faster... | ||||
---|---|---|---|---|
| ||||
Posted in reply to jicman | Am 05.08.2013 17:18, schrieb jicman:
>
> It is a tool that was a script, but I have turned it into do,
> which now has taken two hours from the last jscript script. I
> have not benchmarked it, yet. I may. But I see that a great
> idea has been provided, which I will use. Thanks for the help.
> have not benchmarked it, yet. I may.
so its totaly unclear if the presented code is your 2h monster, what was
the runtime of your jscript?
> But I see that a great idea has been provided
using a local variable for not lowercasing on each if is not an great idea it is default programming style
and i don't think you're be able to implement the tree statemachine
when doing such simple performance killer like multiple lowercase calls, and try to help youselfe by introducing "continue"...
|
August 05, 2013 Re: Which option is faster... | ||||
---|---|---|---|---|
| ||||
Posted in reply to David | On Monday, 5 August 2013 at 15:21:25 UTC, David wrote: > Am 05.08.2013 15:59, schrieb jicman: >> >> Greetings! >> >> I have this code, >> >> foreach (...) >> { >> >> if (std.string.tolower(fext[0]) == "doc" || >> std.string.tolower(fext[0]) == "docx" || >> std.string.tolower(fext[0]) == "xls" || >> std.string.tolower(fext[0]) == "xlsx" || >> std.string.tolower(fext[0]) == "ppt" || >> std.string.tolower(fext[0]) == "pptx") >> continue; >> } >> > > That's how I would do it. > > if(["doc", "docx", "xls", "xlsx", "ppt", "pptx"].canFind(fext[0].toLower) { > continue; > } > > With the array beeing a runtime/compiletime constant with a sensible > name. WORD_EXTENSIONS or something. In case some of the strings are actually rather long, one could use toLower lazily (either std.ascii.toLower or std.uni.toLower, depending on which you need): if(WORD_EXT.canFind(fext[0].map!toLower())) |
August 05, 2013 Re: Which option is faster... | ||||
---|---|---|---|---|
| ||||
Posted in reply to dennis luehring | On Monday, 5 August 2013 at 15:27:09 UTC, dennis luehring wrote: > Am 05.08.2013 17:18, schrieb jicman: > > >> It is a tool that was a script, but I have turned it into do, >> which now has taken two hours from the last jscript script. I >> have not benchmarked it, yet. I may. But I see that a great >> idea has been provided, which I will use. Thanks for the help. > > > have not benchmarked it, yet. I may. > > so its totaly unclear if the presented code is your 2h monster, what was > the runtime of your jscript? The files are in a network drive, so, that has some slowness already involved because of that. The jscript use to take over 8 hours. The new D program has dropped that to under less than six. This is huge to us. But, I know that I can probably fine tune the program to make it a few minutes less. :-) > > But I see that a great idea has been provided > > using a local variable for not lowercasing on each if is not an great idea it is default programming style This will help. If there are 100K files, which I know that there are more than that, it will help a little bit. > and i don't think you're be able to implement the tree statemachine > when doing such simple performance killer like multiple lowercase calls, and try to help youselfe by introducing "continue"... Perhaps, but it's a good idea, nonetheless. |
August 05, 2013 Re: Which option is faster... | ||||
---|---|---|---|---|
| ||||
Posted in reply to John Colvin | On Monday, 5 August 2013 at 15:18:42 UTC, John Colvin wrote:
>
> better:
>
> foreach (...)
> {
> auto tmp = std.string.tolower(fext[0]);
> if(tmp == "doc" || tmp == "docx"
> || tmp == "xls" || tmp == "xlsx"
> || tmp == "ppt" || tmp == "pptx")
> {
> continue;
> }
> }
>
> but still not super-fast as (unless the compiler is very clever) it still means multiple passes over tmp. Also, it converts the whole string to lower case even when it's not necessary.
>
> If you have large numbers of possible matches you will probably want to be clever with your data structures / algorithms. E.g.
>
> You could create a tree-like structure to quickly eliminate possibilities as you read successive letters. You read one character, follow the appropriate branch, check if there are any further branches, if not then no match and break. Else, read the next character and follow the appropriate branch and so on.... Infeasible for large (or even medium-sized) character-sets without hashing, but might be pretty fast for a-z and a large number of short strings.
Arguably, you'd do even better with:
foreach (...)
{
auto tmp = std.string.tolower(fext[0]);
switch(tmp)
{
case "doc", "docx":
case "xls", "xlsx":
case "ppt", "pptx":
continue;
default:
}
}
Since it gives the compiler more wiggle room to optimize things as it wishes. For example, it *could* (who knows :D !) implement the switch as a hash table, or a tree.
BTW, a very convenient "tree-like" structure that could be used is a "heap": it is a basic binary tree, but stored inside an array. You could build it during compilation, and then simply search it.
A possible optimization is to first switch on string length:
foreach (...)
{
auto tmp = std.string.tolower(fext[0]);
switch(tmp.length)
{
case 3:
switch(tmp)
{
case "doc", "xls", "ppt":
continue;
default:
}
break;
case 4:
switch(tmp)
{
case "docx", "xlsx", "pptx":
continue;
default:
}
break;
default:
}
}
That said, I'm not even sure this would be faster, so a bench would be called for. Further more, I'd really be tempted to say that at this point, we are in the realm of premature optimization.
|
August 05, 2013 Re: Which option is faster... | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On Monday, 5 August 2013 at 17:04:36 UTC, monarch_dodra wrote:
> On Monday, 5 August 2013 at 15:18:42 UTC, John Colvin wrote:
>>
>> better:
>>
>> foreach (...)
>> {
>> auto tmp = std.string.tolower(fext[0]);
>> if(tmp == "doc" || tmp == "docx"
>> || tmp == "xls" || tmp == "xlsx"
>> || tmp == "ppt" || tmp == "pptx")
>> {
>> continue;
>> }
>> }
>>
>> but still not super-fast as (unless the compiler is very clever) it still means multiple passes over tmp. Also, it converts the whole string to lower case even when it's not necessary.
>>
>> If you have large numbers of possible matches you will probably want to be clever with your data structures / algorithms. E.g.
>>
>> You could create a tree-like structure to quickly eliminate possibilities as you read successive letters. You read one character, follow the appropriate branch, check if there are any further branches, if not then no match and break. Else, read the next character and follow the appropriate branch and so on.... Infeasible for large (or even medium-sized) character-sets without hashing, but might be pretty fast for a-z and a large number of short strings.
>
> Arguably, you'd do even better with:
> foreach (...)
> {
> auto tmp = std.string.tolower(fext[0]);
> switch(tmp)
> {
> case "doc", "docx":
> case "xls", "xlsx":
> case "ppt", "pptx":
> continue;
> default:
> }
> }
>
> Since it gives the compiler more wiggle room to optimize things as it wishes. For example, it *could* (who knows :D !) implement the switch as a hash table, or a tree.
>
> BTW, a very convenient "tree-like" structure that could be used is a "heap": it is a basic binary tree, but stored inside an array. You could build it during compilation, and then simply search it.
>
> A possible optimization is to first switch on string length:
>
> foreach (...)
> {
> auto tmp = std.string.tolower(fext[0]);
> switch(tmp.length)
> {
> case 3:
> switch(tmp)
> {
> case "doc", "xls", "ppt":
> continue;
> default:
> }
> break;
>
> case 4:
> switch(tmp)
> {
> case "docx", "xlsx", "pptx":
> continue;
> default:
> }
> break;
>
> default:
> }
> }
>
> That said, I'm not even sure this would be faster, so a bench would be called for. Further more, I'd really be tempted to say that at this point, we are in the realm of premature optimization.
:-) Now that's funny. He he he he...
By the way, your coding makes sense. For 100K plus files, it may not mean much, but we are thinking that this amount will be only growing.
|
August 05, 2013 Re: Which option is faster... | ||||
---|---|---|---|---|
| ||||
Posted in reply to jicman | On 08/05/2013 10:04 AM, jicman wrote: > The files are in a network drive, so, that has some slowness already > involved because of that. Which may dominate the running time. > The jscript use to take over 8 hours. The new D program has > dropped that to under less than six. And that is proof that a sizable amount of time is spent during computations. :) > But, I know that I can probably fine tune the program to make > it a few minutes less. :-) You really should profile where the time is spent. If file I/O is still a big issue, there are WAN optimization products that may solve that problem amazingly efficiently: http://en.wikipedia.org/wiki/WAN_optimization Ali |
August 05, 2013 Re: Which option is faster... | ||||
---|---|---|---|---|
| ||||
Posted in reply to jicman | On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
>
> Greetings!
>
> I have this code,
>
> foreach (...)
> {
>
> if (std.string.tolower(fext[0]) == "doc" ||
> std.string.tolower(fext[0]) == "docx" ||
> std.string.tolower(fext[0]) == "xls" ||
> std.string.tolower(fext[0]) == "xlsx" ||
> std.string.tolower(fext[0]) == "ppt" ||
> std.string.tolower(fext[0]) == "pptx")
> continue;
> }
>
> foreach (...)
> {
> if (std.string.tolower(fext[0]) == "doc")
> continue;
> if (std.string.tolower(fext[0]) == "docx")
> continue;
> if (std.string.tolower(fext[0]) == "xls")
> continue;
> if (std.string.tolower(fext[0]) == "xlsx")
> continue;
> if (std.string.tolower(fext[0]) == "ppt")
> continue;
> if (std.string.tolower(fext[0]) == "pptx")
> continue;
> ...
> ...
> }
>
> thanks.
>
> josé
Both are slow, those trying to optimize out tolower are missing the boat. A good compiler should optimize that out.
The issue is that n compare are done in those cases with each successive compare getting slower and slower. if the match is doc then it happens in one compare in both cases. If it's pptx then it is 6. So if there are many pptx matches then the code will perform slower than for doc.
This can be good or bad behavior depending on if you are guarantee that your order is representative of what actually happens(many doc's few pptx's).
To get O(1) behavior, you must use some sort of jump list. A hash table can provide such behavior. If there are many comparisons needed then at some point this will be, on average, faster.
If performance is critical and the string length is small enough then the fastest method is to use a direct lookup table.
For n character of char. It is 256^n. For n = 4, this is 4GB of memory! (This is one reason to use a hash table). Since you are not using the full ascii you could compact this a bit, say 36^n, for n = 4 it is only 1.6MB of memory. The cost is packing the strings for lookup and computing the index.
One way to speed up the loop is to also compare for what is not used. In this case you can compare just on the first char.
if (fext[0][0]) == "d")
{
...
}
if (fext[0][0]) == "x")
{
...
}
if (fext[0][0]) == "p")
{
...
}
This pares down the results. In this case it is only 3 comparisions + 2 comparisons = 5 for pptx rather than 6.
You can try and develop your own hash to get optimal amortized performance.
|
Copyright © 1999-2021 by the D Language Foundation