Which option is faster... (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Which option is faster... (page 3)

August 06, 2013

Re: Which option is faster...

Posted by Raphaël Jakse
in reply to jicman

Raphaël Jakse

Posted in reply to jicman

Le 05/08/2013 15:59, jicman a écrit :
>
> Greetings!
>
> I have this code,
>
> foreach (...)
> {
>
>    if (std.string.tolower(fext[0]) == "doc" ||
>      std.string.tolower(fext[0]) == "docx" ||
>      std.string.tolower(fext[0]) == "xls" ||
>      std.string.tolower(fext[0]) == "xlsx" ||
>      std.string.tolower(fext[0]) == "ppt" ||
>      std.string.tolower(fext[0]) == "pptx")
>     continue;
> }
>
> foreach (...)
> {
>    if (std.string.tolower(fext[0]) == "doc")
>      continue;
>    if (std.string.tolower(fext[0]) == "docx")
>      continue;
>    if (std.string.tolower(fext[0]) == "xls")
>      continue;
>    if (std.string.tolower(fext[0]) == "xlsx")
>      continue;
>    if (std.string.tolower(fext[0]) == "ppt")
>      continue;
>    if (std.string.tolower(fext[0]) == "pptx")
>     continue;
>    ...
>    ...
> }
>
> thanks.
>
> josé


Like others said, writing if( ...) continue; if(...) continue; versus if(... ||  ...) continue; is not the problem in the code.
Computing the lower case of the string for each comparison is a greater problem that can be spotted at the first glance. Storing the value in a variable and then using this variables in comparisons would be better.

Both versions of your code should be equivalent, though this should be verified.

I would suggest another way of doing this:

      switch(std.string.tolower(fext[0])) {
         case "doc":
         case "docx":
         case "xls":
         case "xlsx":
         case "ppt":
         case "pptx":
            continue;
         default:
           // do something else
      }

This way of writing your code seems far more readable and elegant to me.

Even more concise:
switch(std.string.tolower(fext[0])) {
   case "doc","docx","xls","xlsx","ppt","pptx":
      continue;
   default:
      // do something else
}

Much like what morarch_dodra proposed, I guess.

But notice that tolower is deprecated in current D2 version, toLower should be used instead. If you still use D1, maybe a better thing to do before considering such optimization would be opting for D2.

I would also suggest avoiding usage of continue if you don't need it. Here, it is likely that you can write something more structured like:

import std.string : toLower

// ...

foreach(...) {
   switch(fext[0].toLower) {
      case "doc","docx","xls","xlsx","ppt","pptx":
         // do what is to be done here with these cases, or nothing
         break;
      default:
         // do something for other cases
   }
}

If you need to use fext[0].toLower in any case, I would suggest you write this kind of code instead:


import std.string : toLower

// ...

auto ext = fext[0].toLower;
switch(ext) {
   case "doc","docx","xls","xlsx","ppt","pptx":
      // do what is to be done here with these cases, or nothing
      // use ext instead of fext[0].toLower, if needed
      break;
   default:
      // do something for other cases
      // use ext instead of fext[0].toLower, if needed
}


auto ext = fext[0].toLower; should be placed before the foreach loop if fext[0] isn't changed in the loop (avoid computing a value more than one time when it is possible, though thinking before applying this rule or any other rule is not forbidden).

There might exist solutions that could be faster, like testing the first letter before the second / using the length of the string / using finite state machine to save tests, but beware of the maintainability of your code and if you go into this, triple check that these solutions are indeed more efficient in terms of execution.

See morarch_dodra, JS, bearophile and others posts for more precise information.

If this part of your code is not known to be a bottleneck, I would opt for readability / elegance over over-optimization and for this particular case, I would trust the compiler for optimizing enough with the switch version of the code (though it is not more than a opinion here).

August 06, 2013

Re: Which option is faster...

Posted by Andre Artus
in reply to jicman

Andre Artus

Posted in reply to jicman

On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
>
> Greetings!
>
> I have this code,
>
> foreach (...)
> {
>
>   if (std.string.tolower(fext[0]) == "doc" ||
>     std.string.tolower(fext[0]) == "docx" ||
>     std.string.tolower(fext[0]) == "xls" ||
>     std.string.tolower(fext[0]) == "xlsx" ||
>     std.string.tolower(fext[0]) == "ppt" ||
>     std.string.tolower(fext[0]) == "pptx")
>    continue;
> }
>
> foreach (...)
> {
>   if (std.string.tolower(fext[0]) == "doc")
>     continue;
>   if (std.string.tolower(fext[0]) == "docx")
>     continue;
>   if (std.string.tolower(fext[0]) == "xls")
>     continue;
>   if (std.string.tolower(fext[0]) == "xlsx")
>     continue;
>   if (std.string.tolower(fext[0]) == "ppt")
>     continue;
>   if (std.string.tolower(fext[0]) == "pptx")
>    continue;
>   ...
>   ...
> }
>
> thanks.
>
> josé

What exactly are you trying to do with this? I get the impression that there is an attempt at "local optimization" when broader approach could lead to better results.

For instance. Using the OS's facilities to filter (six requests, one each for "*.doc", "*.docx") could actually end up being a lot faster.

If you could give more detail about what you are trying to achieve then it could be possible to get better results.

August 06, 2013

Re: Which option is faster...

Posted by dennis luehring
in reply to jicman

dennis luehring

Posted in reply to jicman

Am 05.08.2013 19:04, schrieb jicman:
>> so its totaly unclear if the presented code is your 2h monster,
>> what was
>> the runtime of your jscript?
> The files are in a network drive, so, that has some slowness
> already involved because of that.  The jscript use to take over 8
> hours.  The new D program has dropped that to under less than
> six.  This is huge to us.  But, I know that I can probably fine
> tune the program to make it a few minutes less. :-)

can you describe more detailed what are you doing - are you also reading
these files? why not run your tool on the server and only collect the results over network (wouldn't that be much faster)?

what is the impact of using the networkdrive? just copy your biggeste szenario on a local machine an run it against this to get a feeling how painfull the networkdrive communication is

>> > But I see that a great idea has been provided
>>
>> using a local variable for not lowercasing on each if is not an
>> great idea it is default programming style
> This will help.  If there are 100K files, which I know that there
> are more than that, it will help a little bit.
>> and i don't think you're be able to implement the tree
>> statemachine
>> when doing such simple performance killer like multiple
>> lowercase calls, and try to help youselfe by introducing
>> "continue"...
> Perhaps, but it's a good idea, nonetheless.

maybe your program can be better optimized globaly, but we need more information of what is the programing doing - can you give some
psuedo code like:

1. read all filenames recursivley -> 100k filenames
2. reduce down to known extensions -> 10k filenames
3. ...
4. ...
5. ...

i don't think that your lower-case is the major part of the 6h, there
must be other very slow parts in your project - which you don't find without putting in some bechmarking (and i don't understand why you don't start with benchmarking)

August 06, 2013

Re: Which option is faster...

Posted by jicman
in reply to Andre Artus

jicman

Posted in reply to Andre Artus

On Tuesday, 6 August 2013 at 04:10:57 UTC, Andre Artus wrote:
> On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
>>
>> Greetings!
>>
>> I have this code,
>>
>> foreach (...)
>> {
>>
>>  if (std.string.tolower(fext[0]) == "doc" ||
>>    std.string.tolower(fext[0]) == "docx" ||
>>    std.string.tolower(fext[0]) == "xls" ||
>>    std.string.tolower(fext[0]) == "xlsx" ||
>>    std.string.tolower(fext[0]) == "ppt" ||
>>    std.string.tolower(fext[0]) == "pptx")
>>   continue;
>> }
>>
>> foreach (...)
>> {
>>  if (std.string.tolower(fext[0]) == "doc")
>>    continue;
>>  if (std.string.tolower(fext[0]) == "docx")
>>    continue;
>>  if (std.string.tolower(fext[0]) == "xls")
>>    continue;
>>  if (std.string.tolower(fext[0]) == "xlsx")
>>    continue;
>>  if (std.string.tolower(fext[0]) == "ppt")
>>    continue;
>>  if (std.string.tolower(fext[0]) == "pptx")
>>   continue;
>>  ...
>>  ...
>> }
>>
>> thanks.
>>
>> josé
>
> What exactly are you trying to do with this? I get the impression that there is an attempt at "local optimization" when broader approach could lead to better results.
>
> For instance. Using the OS's facilities to filter (six requests, one each for "*.doc", "*.docx") could actually end up being a lot faster.
>
> If you could give more detail about what you are trying to achieve then it could be possible to get better results.

The files are in a network drive and doing a list foreach *.doc, *.docx, etc. will be more expensive than getting the list of all the files at once and then processing them accordingly.

August 06, 2013

Re: Which option is faster...

Posted by Andre Artus
in reply to jicman

Andre Artus

Posted in reply to jicman

On Tuesday, 6 August 2013 at 12:32:13 UTC, jicman wrote:
> On Tuesday, 6 August 2013 at 04:10:57 UTC, Andre Artus wrote:
>> On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
>>>
>>> Greetings!
>>>
>>> I have this code,
>>>
>>> foreach (...)
>>> {
>>>
>>> if (std.string.tolower(fext[0]) == "doc" ||
>>>   std.string.tolower(fext[0]) == "docx" ||
>>>   std.string.tolower(fext[0]) == "xls" ||
>>>   std.string.tolower(fext[0]) == "xlsx" ||
>>>   std.string.tolower(fext[0]) == "ppt" ||
>>>   std.string.tolower(fext[0]) == "pptx")
>>>  continue;
>>> }
>>>
>>> foreach (...)
>>> {
>>> if (std.string.tolower(fext[0]) == "doc")
>>>   continue;
>>> if (std.string.tolower(fext[0]) == "docx")
>>>   continue;
>>> if (std.string.tolower(fext[0]) == "xls")
>>>   continue;
>>> if (std.string.tolower(fext[0]) == "xlsx")
>>>   continue;
>>> if (std.string.tolower(fext[0]) == "ppt")
>>>   continue;
>>> if (std.string.tolower(fext[0]) == "pptx")
>>>  continue;
>>> ...
>>> ...
>>> }
>>>
>>> thanks.
>>>
>>> josé
>>
>> What exactly are you trying to do with this? I get the impression that there is an attempt at "local optimization" when broader approach could lead to better results.
>>
>> For instance. Using the OS's facilities to filter (six requests, one each for "*.doc", "*.docx") could actually end up being a lot faster.
>>
>> If you could give more detail about what you are trying to achieve then it could be possible to get better results.
>
> The files are in a network drive and doing a list foreach *.doc, *.docx, etc. will be more expensive than getting the list of all the files at once and then processing them accordingly.

Again, what are you trying to achieve?
Your statement is not necessarily true, for a  myriad of reasons, but it entirely depends on what you want to do.
I would reiterate Dennis Luehring's reply, why are you not benching? It seems like you are guessing at what the problems are, that's hardly ever useful.
One of the first rules of network optimization  is to reduce the amount od data, that normally means filtering.at the server, the next thing is coarse grained is better than fine (BOCTAOE/L).

August 07, 2013

Re: Which option is faster...

Posted by jicman
in reply to Andre Artus

jicman

Posted in reply to Andre Artus

On Tuesday, 6 August 2013 at 14:49:42 UTC, Andre Artus wrote:
> On Tuesday, 6 August 2013 at 12:32:13 UTC, jicman wrote:
>> On Tuesday, 6 August 2013 at 04:10:57 UTC, Andre Artus wrote:
>>> On Monday, 5 August 2013 at 13:59:24 UTC, jicman wrote:
>>>>
>>>> Greetings!
>>>>
>>>> I have this code,
>>>>
>>>> foreach (...)
>>>> {
>>>>
>>>> if (std.string.tolower(fext[0]) == "doc" ||
>>>>  std.string.tolower(fext[0]) == "docx" ||
>>>>  std.string.tolower(fext[0]) == "xls" ||
>>>>  std.string.tolower(fext[0]) == "xlsx" ||
>>>>  std.string.tolower(fext[0]) == "ppt" ||
>>>>  std.string.tolower(fext[0]) == "pptx")
>>>> continue;
>>>> }
>>>>
>>>> foreach (...)
>>>> {
>>>> if (std.string.tolower(fext[0]) == "doc")
>>>>  continue;
>>>> if (std.string.tolower(fext[0]) == "docx")
>>>>  continue;
>>>> if (std.string.tolower(fext[0]) == "xls")
>>>>  continue;
>>>> if (std.string.tolower(fext[0]) == "xlsx")
>>>>  continue;
>>>> if (std.string.tolower(fext[0]) == "ppt")
>>>>  continue;
>>>> if (std.string.tolower(fext[0]) == "pptx")
>>>> continue;
>>>> ...
>>>> ...
>>>> }
>>>>
>>>> thanks.
>>>>
>>>> josé
>>>
>>> What exactly are you trying to do with this? I get the impression that there is an attempt at "local optimization" when broader approach could lead to better results.
>>>
>>> For instance. Using the OS's facilities to filter (six requests, one each for "*.doc", "*.docx") could actually end up being a lot faster.
>>>
>>> If you could give more detail about what you are trying to achieve then it could be possible to get better results.
>>
>> The files are in a network drive and doing a list foreach *.doc, *.docx, etc. will be more expensive than getting the list of all the files at once and then processing them accordingly.
>
> Again, what are you trying to achieve?
> Your statement is not necessarily true, for a  myriad of reasons, but it entirely depends on what you want to do.
> I would reiterate Dennis Luehring's reply, why are you not benching? It seems like you are guessing at what the problems are, that's hardly ever useful.
> One of the first rules of network optimization  is to reduce the amount od data, that normally means filtering.at the server, the next thing is coarse grained is better than fine (BOCTAOE/L).

It's a long story and I will return in a few months and give you the whole story, but right now, time is not on my side.  I have answers for all the questions you folks have asked, and I appreciate all the input.  I have the answer that I was looking for, so in a few months, I will come back and explain the whole story.  Thanks for all the response and suggestions.

jic

August 07, 2013

Re: Which option is faster...

Posted by dennis luehring
in reply to jicman

dennis luehring

Posted in reply to jicman

Am 07.08.2013 06:30, schrieb jicman:
>> Again, what are you trying to achieve?
>> Your statement is not necessarily true, for a  myriad of
>> reasons, but it entirely depends on what you want to do.
>> I would reiterate Dennis Luehring's reply, why are you not
>> benching? It seems like you are guessing at what the problems
>> are, that's hardly ever useful.
>> One of the first rules of network optimization  is to reduce
>> the amount od data, that normally means filtering.at the
>> server, the next thing is coarse grained is better than fine
>> (BOCTAOE/L).
>
> It's a long story and I will return in a few months and give you
> the whole story, but right now, time is not on my side.  I have
> answers for all the questions you folks have asked, and I
> appreciate all the input.  I have the answer that I was looking
> for, so in a few months, I will come back and explain the whole
> story.  Thanks for all the response and suggestions.

after makeing us girls all wet to help you - your reply is
"no sex on the first date, im a gentlemen... but maybe in a few months"

so:

you having a jscript doing somehting with files,fileextensions over networkdrive - it runs around 8h

you ported that jscript to D - now it runs for 6h

you noob-guessed the lowercase-if-party could be evil (btw: it cost more time to guess then to benchmark)

you get trivial answers that won't get you very much, the lowercase would not boost your speed that much and the networkdrive latency will kill all the other statemachine ideas

you don't answer trivial questions about the big picture - and now you're out of time

open questions:
-why not collect the data on the server itself - instead of grabbing tiny bits over network? - this is for understanding your environent

-how big is the speed drop with your tool on the very same drive localy and over a networkdrive? - this is for understanding the latency

-are you also reading this files or just doing filename search (recursively?) and throwing out non office-extensions?
this is for getting an idea if buildin OS(operating system) features can help

see you in a few months

August 07, 2013

Re: Which option is faster...

Posted by Andre Artus
in reply to dennis luehring

Andre Artus

Posted in reply to dennis luehring

>> It's a long story and I will return in a few months and give you
>> the whole story, but right now, time is not on my side.  I have
>> answers for all the questions you folks have asked, and I
>> appreciate all the input.  I have the answer that I was looking
>> for, so in a few months, I will come back and explain the whole
>> story.  Thanks for all the response and suggestions.
>
> after makeing us girls all wet to help you - your reply is
> "no sex on the first date, im a gentlemen... but maybe in a few months"
>
> so:
>
> you having a jscript doing somehting with files,fileextensions over networkdrive - it runs around 8h
>
> you ported that jscript to D - now it runs for 6h
>
> you noob-guessed the lowercase-if-party could be evil (btw: it cost more time to guess then to benchmark)
>
> you get trivial answers that won't get you very much, the lowercase would not boost your speed that much and the networkdrive latency will kill all the other statemachine ideas
>
> you don't answer trivial questions about the big picture - and now you're out of time
>
> open questions:
> -why not collect the data on the server itself - instead of grabbing tiny bits over network? - this is for understanding your environent

Just for a sanity check I implemented a quick client-server setup where the daemon takes filespec from the client and returns a line-by-line list compressed into one packet.

The total running time on 8 terabytes of files stored over a dozen drives searched recursively: less than 1 minute.

Same over slow WiFi, negligible difference (list compresses to a few Kb) with LZMA.

I did not even bother to search each physical drive separately, just produced the list sequentially.

>
> -how big is the speed drop with your tool on the very same drive localy and over a networkdrive? - this is for understanding the latency
>
> -are you also reading this files or just doing filename search (recursively?) and throwing out non office-extensions?
> this is for getting an idea if buildin OS(operating system) features can help
>
> see you in a few months

It's impossible to help people who refuse to give basic information.

August 10, 2013

Re: Which option is faster...

Posted by H. S. Teoh
in reply to jicman

H. S. Teoh

Posted in reply to jicman

On Mon, Aug 05, 2013 at 03:59:23PM +0200, jicman wrote:
> 
> Greetings!
> 
> I have this code,
> 
> foreach (...)
> {
> 
>   if (std.string.tolower(fext[0]) == "doc" ||
>     std.string.tolower(fext[0]) == "docx" ||
>     std.string.tolower(fext[0]) == "xls" ||
>     std.string.tolower(fext[0]) == "xlsx" ||
>     std.string.tolower(fext[0]) == "ppt" ||
>     std.string.tolower(fext[0]) == "pptx")
>    continue;
> }
> 
> foreach (...)
> {
>   if (std.string.tolower(fext[0]) == "doc")
>     continue;
>   if (std.string.tolower(fext[0]) == "docx")
>     continue;
>   if (std.string.tolower(fext[0]) == "xls")
>     continue;
>   if (std.string.tolower(fext[0]) == "xlsx")
>     continue;
>   if (std.string.tolower(fext[0]) == "ppt")
>     continue;
>   if (std.string.tolower(fext[0]) == "pptx")
>    continue;
>   ...
>   ...
> }
[...]

It would appear that your bottleneck is not in the continue statements, but in the repeated calls to std.string.tolower. It's probably better to write it this way:

	foreach (...)
	{
		auto ext = std.string.tolower(fext[0]);
		if (ext == "doc" || ext == "docx" || ... )
			continue;
	}

This way you save on the overhead of many identical function calls, which probably outweights any benefit you get by optimizing continue.


T

-- 
It's amazing how careful choice of punctuation can leave you hanging:

August 10, 2013

Re: Which option is faster...

Posted by H. S. Teoh
in reply to dennis luehring

H. S. Teoh

Posted in reply to dennis luehring

On Mon, Aug 05, 2013 at 04:47:36PM +0200, dennis luehring wrote:
> > Ok, how would you make it faster?
> 
> i don't see a better solution here - how to reduce ONE lowercase and SOME compares in any way? (i dont think a hash or something will help) but i know that anything like your continue-party is worth nothing (feels a little bit like script-kiddies "do it with assembler that would it make million times faster" blabla)

If you really want optimal performance, use std.regex:

	import std.regex;
	auto reExtMatch = ctRegex!(`doc|docx|exe|...`, "i");
	foreach (...) {
		if (fext[0].match(reExtMatch))
			continue;
	}

The regex is set up to ignore case (the "i" flag), and is compiled at compile-time to generate optimal matching code for the given list of extensions. To get any faster than this, you'll have to hand-optimize or write in assembly. :)


T

-- 
Computerese Irregular Verb Conjugation: I have preferences.  You have biases.  He/She has prejudices. -- Gene Wirchenko

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation