Jump to page: 1 2
Thread overview
some regex vs std.ascii vs handcode times
Mar 19, 2012
Jay Norwood
Mar 19, 2012
dennis luehring
Mar 19, 2012
Dmitry Olshansky
Mar 21, 2012
Martin Nowak
Mar 19, 2012
Jay Norwood
Mar 21, 2012
Juan Manuel Cabo
Mar 21, 2012
Juan Manuel Cabo
Mar 21, 2012
Juan Manuel Cabo
Mar 22, 2012
Jay Norwood
Mar 26, 2012
Jay Norwood
Mar 27, 2012
Juan Manuel Cabo
Mar 27, 2012
Juan Manuel Cabo
Mar 27, 2012
Jay Norwood
March 19, 2012
I'm timing operations processing 10 2MB text files in parallel.  I haven't gotten to the part where I put the words in the map, but I've done enough through this point to say a few things about the measurements.

This first one took 1ms, which tells me that the taskPool tasks are doing a good job, and this will not be a problem.

// do nothing
//finished! time: 1 ms
void wcp_nothing(string fn)
{
	//G:\d\a7\a7\Release>a7

}


This second one was just to read in the files to get a baseline. One surpise, seen later is that the byChunk actually completed faster by several ms.

// read files
//finished! time: 31 ms
void wcp_whole_file(string fn)
{
	auto input = std.file.read(fn);
}


On the other end of the spectrum is the byLine version of the read.  So this is way too slow to be promoting in our examples, and if anyone is using this in the code you should instead read chunks ... maybe 1MB like in my example later below, and then split up the lines yourself.

// read files by line ... yikes! don't want to do this
//finished! time: 485 ms
void wcp_byLine(string fn)
{
	auto f = File(fn);
	foreach(line; f.byLine(std.string.KeepTerminator.yes)){
	}
}


Ok, this was the good surprise.  Reading by chunks was faster than reading the whole file, by several ms.

// read files by chunk ...!better than full input
//finished! time: 23 ms
void wcp_files_by_chunk(string fn)
{
	auto f = File(fn);
	foreach(chunk; f.byChunk(1_000_000)){
	}
}


So this is doing the line count while reading chunks.  It is back to the same time as the read of full input.  It is ok though since it would avoid issues with reading in a really big file.  So the final thing will probably use this rather than the full input read version.  But for now I'm going to read the full input for the other tests.

// read lc by chunk ...same as full input
//finished! time: 34 ms
void wcp_lc_bychunk (string fn)
{
	ulong l_cnt;
	auto f = File(fn);
	foreach(chunk; f.byChunk(1_000_000)){
		foreach(c;chunk){
			if (c=='\n')
				l_cnt++;
		}
	}
}

There doesn't appear to be any significant overhead to reading dchar vs char.  I haven't looked at the code (or decode) underneath.  I presume it is decoding and expanding into dchar...

// read dchar lc by chunk ...same
//finished! time: 34 ms
void wcp_dchar(string fn)
{
	ulong l_cnt;
	auto f = File(fn);
	foreach(chunk; f.byChunk(1_000_000)){
		foreach(dchar c;chunk){
			if (c=='\n')
				l_cnt++;
		}
	}
}

So here is some surprise ... why is regex 136ms vs 34 ms hand code?

// count lines with regex
//finished! time: 136 ms
void wcp_lines_with_regex(string fn)
{
	string input = cast(string)std.file.read(fn);
	auto rx = regex("\n");
	ulong l_cnt;
	foreach(e; splitter(input, rx))
	{
		l_cnt ++;
	}
}

So this is faster than the non-compiled regex, but still slower than hand code, which was in the 35 ms range.  Also, the documentation implies that the "g" flag should have read in all the matches at once, but when I tested for that, the length from the result of a single call to match was always 1.  So I think it must be broken.


// count lines with compiled ctRegex matcher
//finished! time: 97 ms
void wcp_ctRegex(string fn)
{
	string input = cast(string)std.file.read(fn);
	enum ctr =  ctRegex!("\n","g");
	ulong l_cnt;
	foreach(m; match(input,ctr))
	{
		l_cnt ++;
	}
}


//count lines with char match, this or the one with chunks about the same
//finished! time: 34 ms
void wcp_char(string fn)
{
	string input = cast(string)std.file.read(fn);
	ulong l_cnt;
	foreach(c; input)
	{
		if (c == '\n')
		l_cnt ++;
	}
}

This is the fastest of the word finders, maybe by 40ms, which is meaningful if you project to larger tasks.  This is similar to the code used in the u++ benchmark processing of  alice.txt.  Never mind the line counts for now.  I'm just wanting to measure the word count impact separately.

//find words using pointers
//finished! time: 110 ms
void wcp_pointer(string fn)
{
	string input = cast(string)std.file.read(fn);
	ulong w_cnt;
	char c;
	auto p = input.ptr;
	auto pe = p+input.length;
	while(p<pe)
	{
		c = *p;

		if (c >= 'a' && c <= 'z' ||
			c >= 'A' && c <= 'Z')
		{
			++w_cnt;
			auto st = p++;
			while (p<pe){
				c = *p;
				if 	(!(c >= 'a' && c <= 'z' ||
					 c >= 'A' && c <= 'Z' ||
					 c >= '0' && c <= '9'))
				{
					break;
				}
				p++;
			}
			auto wpend = p;
		}
		p++;
	}
}

Ok, this is way too slow with both ctRegex and with regex, and so again I think there is something broken with the greedy flag since moving it outside the foreach only returned a length of 1 in both cases.

// count words with compiled ctRegex matcher !way too slow
//finished! time: 1299 ms for ctRegex
//finished! time: 2608 ms for regex
void wcp_words_ctRegex
(string fn)
{
	string input = cast(string)std.file.read(fn);
	enum ctr =  regex("[a-zA-Z][a-zA-Z0-9]*","g");
	ulong w_cnt;
	foreach(m; match(input,ctr))
	{
		//auto s = m.hit;
		w_cnt ++;
	}
}

Using slices is slower than the pointer version previously by about 40ms.  I turned off the range checking in this release build, so that doesn't explain it.

//find words with slices .. ok this is slower by a bunch
//finished! time: 153 ms
void wcp_word_slices(string fn)
{
	string input = cast(string)std.file.read(fn);
	ulong w_cnt;
	while (!input.empty)
	{
		auto c = input[0];
  		if (c >= 'a' && c <= 'z' ||
			c >= 'A' && c <= 'Z')
		{
			++w_cnt;
			auto word = input;
 			foreach(j,char w ;input){
				if 	(!(w >= 'a' && w <= 'z' ||
					   w >= 'A' && w <= 'Z' ||
					   w >= '0' && w <= '9'))
				{
					word = input[0..j];
					input = input[j..$];
					break;
				}
			}
		}
		else input = input[1..$];
	}
}

Note that using std.ascii.isXx is actually calling a function, which I find surprising with all the template code emphasis.  Note that that it added another 80ms vs the in-line code above.  So I would say provide a mixin template for this, because it is currently being reused by several libraries.

//find words with std.ascii.isAlpha and isAlphaNum .. worse
//finished! time: 232 ms
void wcp_std_ascii (string fn)
{
	string input = cast(string)std.file.read(fn);
	ulong w_cnt;
	while (!input.empty)
	{
		auto c = input[0];
  		if (std.ascii.isAlpha(c))
		{
			++w_cnt;
			auto word = input;
 			foreach(j,char w ;input){
				if 	(!std.ascii.isAlphaNum(w))
				{
					word = input[0..j];
					input = input[j..$];
					break;
				}
			}
		}
		else input = input[1..$];
	}
}


March 19, 2012
please put the test-files and testcode (which seem to fit in one file)
somewhere so others can check your timings

Am 19.03.2012 05:12, schrieb Jay Norwood:
> I'm timing operations processing 10 2MB text files in parallel.
> I haven't gotten to the part where I put the words in the map,
> but I've done enough through this point to say a few things about
> the measurements.
>
> This first one took 1ms, which tells me that the taskPool tasks
> are doing a good job, and this will not be a problem.
>
> // do nothing
> //finished! time: 1 ms
> void wcp_nothing(string fn)
> {
> 	//G:\d\a7\a7\Release>a7
>
> }
>
>
> This second one was just to read in the files to get a baseline.
> One surpise, seen later is that the byChunk actually completed
> faster by several ms.
>
> // read files
> //finished! time: 31 ms
> void wcp_whole_file(string fn)
> {
> 	auto input = std.file.read(fn);
> }
>
>
> On the other end of the spectrum is the byLine version of the
> read.  So this is way too slow to be promoting in our examples,
> and if anyone is using this in the code you should instead read
> chunks ... maybe 1MB like in my example later below, and then
> split up the lines yourself.
>
> // read files by line ... yikes! don't want to do this
> //finished! time: 485 ms
> void wcp_byLine(string fn)
> {
> 	auto f = File(fn);
> 	foreach(line; f.byLine(std.string.KeepTerminator.yes)){
> 	}
> }
>
>
> Ok, this was the good surprise.  Reading by chunks was faster
> than reading the whole file, by several ms.
>
> // read files by chunk ...!better than full input
> //finished! time: 23 ms
> void wcp_files_by_chunk(string fn)
> {
> 	auto f = File(fn);
> 	foreach(chunk; f.byChunk(1_000_000)){
> 	}
> }
>
>
> So this is doing the line count while reading chunks.  It is back
> to the same time as the read of full input.  It is ok though
> since it would avoid issues with reading in a really big file.
> So the final thing will probably use this rather than the full
> input read version.  But for now I'm going to read the full input
> for the other tests.
>
> // read lc by chunk ...same as full input
> //finished! time: 34 ms
> void wcp_lc_bychunk (string fn)
> {
> 	ulong l_cnt;
> 	auto f = File(fn);
> 	foreach(chunk; f.byChunk(1_000_000)){
> 		foreach(c;chunk){
> 			if (c=='\n')
> 				l_cnt++;
> 		}
> 	}
> }
>
> There doesn't appear to be any significant overhead to reading
> dchar vs char.  I haven't looked at the code (or decode)
> underneath.  I presume it is decoding and expanding into dchar...
>
> // read dchar lc by chunk ...same
> //finished! time: 34 ms
> void wcp_dchar(string fn)
> {
> 	ulong l_cnt;
> 	auto f = File(fn);
> 	foreach(chunk; f.byChunk(1_000_000)){
> 		foreach(dchar c;chunk){
> 			if (c=='\n')
> 				l_cnt++;
> 		}
> 	}
> }
>
> So here is some surprise ... why is regex 136ms vs 34 ms hand
> code?
>
> // count lines with regex
> //finished! time: 136 ms
> void wcp_lines_with_regex(string fn)
> {
> 	string input = cast(string)std.file.read(fn);
> 	auto rx = regex("\n");
> 	ulong l_cnt;
> 	foreach(e; splitter(input, rx))
> 	{
> 		l_cnt ++;
> 	}
> }
>
> So this is faster than the non-compiled regex, but still slower
> than hand code, which was in the 35 ms range.  Also, the
> documentation implies that the "g" flag should have read in all
> the matches at once, but when I tested for that, the length from
> the result of a single call to match was always 1.  So I think it
> must be broken.
>
>
> // count lines with compiled ctRegex matcher
> //finished! time: 97 ms
> void wcp_ctRegex(string fn)
> {
> 	string input = cast(string)std.file.read(fn);
> 	enum ctr =  ctRegex!("\n","g");
> 	ulong l_cnt;
> 	foreach(m; match(input,ctr))
> 	{
> 		l_cnt ++;
> 	}
> }
>
>
> //count lines with char match, this or the one with chunks about
> the same
> //finished! time: 34 ms
> void wcp_char(string fn)
> {
> 	string input = cast(string)std.file.read(fn);
> 	ulong l_cnt;
> 	foreach(c; input)
> 	{
> 		if (c == '\n')
> 		l_cnt ++;
> 	}
> }
>
> This is the fastest of the word finders, maybe by 40ms, which is
> meaningful if you project to larger tasks.  This is similar to
> the code used in the u++ benchmark processing of  alice.txt.
> Never mind the line counts for now.  I'm just wanting to measure
> the word count impact separately.
>
> //find words using pointers
> //finished! time: 110 ms
> void wcp_pointer(string fn)
> {
> 	string input = cast(string)std.file.read(fn);
> 	ulong w_cnt;
> 	char c;
> 	auto p = input.ptr;
> 	auto pe = p+input.length;
> 	while(p<pe)
> 	{
> 		c = *p;
>
> 		if (c>= 'a'&&  c<= 'z' ||
> 			c>= 'A'&&  c<= 'Z')
> 		{
> 			++w_cnt;
> 			auto st = p++;
> 			while (p<pe){
> 				c = *p;
> 				if 	(!(c>= 'a'&&  c<= 'z' ||
> 					 c>= 'A'&&  c<= 'Z' ||
> 					 c>= '0'&&  c<= '9'))
> 				{
> 					break;
> 				}
> 				p++;
> 			}
> 			auto wpend = p;
> 		}
> 		p++;
> 	}
> }
>
> Ok, this is way too slow with both ctRegex and with regex, and so
> again I think there is something broken with the greedy flag
> since moving it outside the foreach only returned a length of 1
> in both cases.
>
> // count words with compiled ctRegex matcher !way too slow
> //finished! time: 1299 ms for ctRegex
> //finished! time: 2608 ms for regex
> void wcp_words_ctRegex
> (string fn)
> {
> 	string input = cast(string)std.file.read(fn);
> 	enum ctr =  regex("[a-zA-Z][a-zA-Z0-9]*","g");
> 	ulong w_cnt;
> 	foreach(m; match(input,ctr))
> 	{
> 		//auto s = m.hit;
> 		w_cnt ++;
> 	}
> }
>
> Using slices is slower than the pointer version previously by
> about 40ms.  I turned off the range checking in this release
> build, so that doesn't explain it.
>
> //find words with slices .. ok this is slower by a bunch
> //finished! time: 153 ms
> void wcp_word_slices(string fn)
> {
> 	string input = cast(string)std.file.read(fn);
> 	ulong w_cnt;
> 	while (!input.empty)
> 	{
> 		auto c = input[0];
>     		if (c>= 'a'&&  c<= 'z' ||
> 			c>= 'A'&&  c<= 'Z')
> 		{
> 			++w_cnt;
> 			auto word = input;
>    			foreach(j,char w ;input){
> 				if 	(!(w>= 'a'&&  w<= 'z' ||
> 					   w>= 'A'&&  w<= 'Z' ||
> 					   w>= '0'&&  w<= '9'))
> 				{
> 					word = input[0..j];
> 					input = input[j..$];
> 					break;
> 				}
> 			}
> 		}
> 		else input = input[1..$];
> 	}
> }
>
> Note that using std.ascii.isXx is actually calling a function,
> which I find surprising with all the template code emphasis.
> Note that that it added another 80ms vs the in-line code above.
> So I would say provide a mixin template for this, because it is
> currently being reused by several libraries.
>
> //find words with std.ascii.isAlpha and isAlphaNum .. worse
> //finished! time: 232 ms
> void wcp_std_ascii (string fn)
> {
> 	string input = cast(string)std.file.read(fn);
> 	ulong w_cnt;
> 	while (!input.empty)
> 	{
> 		auto c = input[0];
>     		if (std.ascii.isAlpha(c))
> 		{
> 			++w_cnt;
> 			auto word = input;
>    			foreach(j,char w ;input){
> 				if 	(!std.ascii.isAlphaNum(w))
> 				{
> 					word = input[0..j];
> 					input = input[j..$];
> 					break;
> 				}
> 			}
> 		}
> 		else input = input[1..$];
> 	}
> }
>
>

March 19, 2012
On 19.03.2012 8:12, Jay Norwood wrote:
> I'm timing operations processing 10 2MB text files in parallel. I
> haven't gotten to the part where I put the words in the map, but I've
> done enough through this point to say a few things about the measurements.
>

I hope you are going to tell us dmd flags being used. Hopefully -inline is among them.

> This first one took 1ms, which tells me that the taskPool tasks are
> doing a good job, and this will not be a problem.
>
> // do nothing
> //finished! time: 1 ms
> void wcp_nothing(string fn)
> {
> //G:\d\a7\a7\Release>a7
>
> }
>
>
> This second one was just to read in the files to get a baseline. One
> surpise, seen later is that the byChunk actually completed faster by
> several ms.
>
> // read files
> //finished! time: 31 ms
> void wcp_whole_file(string fn)
> {
> auto input = std.file.read(fn);
> }
>
>
> On the other end of the spectrum is the byLine version of the read. So
> this is way too slow to be promoting in our examples, and if anyone is
> using this in the code you should instead read chunks ... maybe 1MB like
> in my example later below, and then split up the lines yourself.
>
> // read files by line ... yikes! don't want to do this
> //finished! time: 485 ms
> void wcp_byLine(string fn)
> {
> auto f = File(fn);
> foreach(line; f.byLine(std.string.KeepTerminator.yes)){
> }
> }
>
>
> Ok, this was the good surprise. Reading by chunks was faster than
> reading the whole file, by several ms.
>
> // read files by chunk ...!better than full input
> //finished! time: 23 ms
> void wcp_files_by_chunk(string fn)
> {
> auto f = File(fn);
> foreach(chunk; f.byChunk(1_000_000)){
> }
> }
>
>
> So this is doing the line count while reading chunks. It is back to the
> same time as the read of full input. It is ok though since it would
> avoid issues with reading in a really big file. So the final thing will
> probably use this rather than the full input read version. But for now
> I'm going to read the full input for the other tests.
>
> // read lc by chunk ...same as full input
> //finished! time: 34 ms
> void wcp_lc_bychunk (string fn)
> {
> ulong l_cnt;
> auto f = File(fn);
> foreach(chunk; f.byChunk(1_000_000)){
> foreach(c;chunk){
> if (c=='\n')
> l_cnt++;
> }
> }
> }
>
> There doesn't appear to be any significant overhead to reading dchar vs
> char. I haven't looked at the code (or decode) underneath. I presume it
> is decoding and expanding into dchar...
>
> // read dchar lc by chunk ...same
> //finished! time: 34 ms
> void wcp_dchar(string fn)
> {
> ulong l_cnt;
> auto f = File(fn);
> foreach(chunk; f.byChunk(1_000_000)){
> foreach(dchar c;chunk){
> if (c=='\n')
> l_cnt++;
> }
> }
> }

That was strange - decoding takes time, that surely trips few msesc in favor of non-decoding version. Probably I/o is hiding the facts here.
>
> So here is some surprise ... why is regex 136ms vs 34 ms hand code?

Are you really that surprised?! regex is mm very generic tool, vs hardcoded byte comparison?
In detail: regex still finds a "hit" and takes a slice of input + plus it works as if "n" was unknown string before running.
A more logical comparison would be splitting chunk by string separator that is a variable.
I actually find timing not bad at all here.

>
> // count lines with regex
> //finished! time: 136 ms
> void wcp_lines_with_regex(string fn)
> {
> string input = cast(string)std.file.read(fn);
> auto rx = regex("\n");
> ulong l_cnt;
> foreach(e; splitter(input, rx))
> {
> l_cnt ++;
> }
> }
>
> So this is faster than the non-compiled regex, but still slower than
> hand code, which was in the 35 ms range. Also, the documentation implies
> that the "g" flag should have read in all the matches at once, but when
> I tested for that, the length from the result of a single call to match
> was always 1. So I think it must be broken.

It's by design, match returns lazy _range_. g makes it search more then first match.

>
>
> // count lines with compiled ctRegex matcher
> //finished! time: 97 ms
> void wcp_ctRegex(string fn)
> {
> string input = cast(string)std.file.read(fn);
> enum ctr = ctRegex!("\n","g");
> ulong l_cnt;
> foreach(m; match(input,ctr))
> {
> l_cnt ++;
> }
> }

>
>
> //count lines with char match, this or the one with chunks about the same
> //finished! time: 34 ms
> void wcp_char(string fn)
> {
> string input = cast(string)std.file.read(fn);
> ulong l_cnt;
> foreach(c; input)
> {
> if (c == '\n')
> l_cnt ++;
> }
> }
>
> This is the fastest of the word finders, maybe by 40ms, which is
> meaningful if you project to larger tasks. This is similar to the code
> used in the u++ benchmark processing of alice.txt. Never mind the line
> counts for now. I'm just wanting to measure the word count impact
> separately.
>

Sorry to spoil your excitement but all of them do not count words. Not in a proper Unicode sense at all.


> //find words using pointers
> //finished! time: 110 ms
> void wcp_pointer(string fn)
> {
> string input = cast(string)std.file.read(fn);
> ulong w_cnt;
> char c;
> auto p = input.ptr;
> auto pe = p+input.length;
> while(p<pe)
> {
> c = *p;
>
> if (c >= 'a' && c <= 'z' ||
> c >= 'A' && c <= 'Z')
> {
> ++w_cnt;
> auto st = p++;
> while (p<pe){
> c = *p;
> if (!(c >= 'a' && c <= 'z' ||
> c >= 'A' && c <= 'Z' ||
> c >= '0' && c <= '9'))
> {
> break;
> }
> p++;
> }
> auto wpend = p;
> }
> p++;
> }
> }
>
> Ok, this is way too slow with both ctRegex and with regex, and so again
> I think there is something broken with the greedy flag since moving it
> outside the foreach only returned a length of 1 in both cases.

Nothing wrong with "g" flag, it implies that regex seeks all matches but returns lazy range. Doing all work eagerly in one go requires allocations. Again extra work that being done by R-T regex here is quite high and unavoidable, C-T can do much better though.

>
> // count words with compiled ctRegex matcher !way too slow
> //finished! time: 1299 ms for ctRegex
> //finished! time: 2608 ms for regex
> void wcp_words_ctRegex
> (string fn)
> {
> string input = cast(string)std.file.read(fn);
> enum ctr = regex("[a-zA-Z][a-zA-Z0-9]*","g");
> ulong w_cnt;
> foreach(m; match(input,ctr))
> {
> //auto s = m.hit;
> w_cnt ++;
> }
> }
>
> Using slices is slower than the pointer version previously by about
> 40ms. I turned off the range checking in this release build, so that
> doesn't explain it.
>
> //find words with slices .. ok this is slower by a bunch
> //finished! time: 153 ms
> void wcp_word_slices(string fn)
> {
> string input = cast(string)std.file.read(fn);
> ulong w_cnt;
> while (!input.empty)
> {
> auto c = input[0];
> if (c >= 'a' && c <= 'z' ||
> c >= 'A' && c <= 'Z')
> {
> ++w_cnt;
> auto word = input;
> foreach(j,char w ;input){
> if (!(w >= 'a' && w <= 'z' ||
> w >= 'A' && w <= 'Z' ||
> w >= '0' && w <= '9'))
> {
> word = input[0..j];
> input = input[j..$];
> break;
> }
> }
> }
> else input = input[1..$];
> }
> }
>
> Note that using std.ascii.isXx is actually calling a function, which I
> find surprising with all the template code emphasis. Note that that it
> added another 80ms vs the in-line code above. So I would say provide a
> mixin template for this, because it is currently being reused by several
> libraries.
>
> //find words with std.ascii.isAlpha and isAlphaNum .. worse
> //finished! time: 232 ms
> void wcp_std_ascii (string fn)
> {
> string input = cast(string)std.file.read(fn);
> ulong w_cnt;
> while (!input.empty)
> {
> auto c = input[0];
> if (std.ascii.isAlpha(c))
> {
> ++w_cnt;
> auto word = input;
> foreach(j,char w ;input){
> if (!std.ascii.isAlphaNum(w))
> {
> word = input[0..j];
> input = input[j..$];
> break;
> }
> }
> }
> else input = input[1..$];
> }
> }
>
>


-- 
Dmitry Olshansky
March 19, 2012
On 3/18/12 11:12 PM, Jay Norwood wrote:
> I'm timing operations processing 10 2MB text files in parallel. I
> haven't gotten to the part where I put the words in the map, but I've
> done enough through this point to say a few things about the measurements.

Great work! This prompts quite a few bug reports and enhancement suggestions - please submit them to bugzilla.

Two quick notes:

> On the other end of the spectrum is the byLine version of the read. So
> this is way too slow to be promoting in our examples, and if anyone is
> using this in the code you should instead read chunks ... maybe 1MB like
> in my example later below, and then split up the lines yourself.
>
> // read files by line ... yikes! don't want to do this
> //finished! time: 485 ms
> void wcp_byLine(string fn)
> {
> auto f = File(fn);
> foreach(line; f.byLine(std.string.KeepTerminator.yes)){
> }
> }

What OS did you use? (The implementation of byLine varies a lot across OSs.)

I wanted for a long time to improve byLine by allowing it to do its own buffering. That means once you used byLine it's not possible to stop it, get back to the original File, and continue reading it. Using byLine is a commitment. This is what most uses of it do anyway.

> Ok, this was the good surprise. Reading by chunks was faster than
> reading the whole file, by several ms.

What may be at work here is cache effects. Reusing the same 1MB may place it in faster cache memory, whereas reading 20MB at once may spill into slower memory.


Andrei
March 19, 2012
On Monday, 19 March 2012 at 17:23:36 UTC, Andrei Alexandrescu wrote:
> On 3/18/12 11:12 PM, Jay Norwood wrote:
>> I'm timing operations processing 10 2MB text files in parallel. I
>> haven't gotten to the part where I put the words in the map, but I've
>> done enough through this point to say a few things about the measurements.
>
> Great work! This prompts quite a few bug reports and enhancement suggestions - please submit them to bugzilla.

I don't know if they are bugs.  On D.learn I got the explanation that the matches.captures.length() just returns the matches in the  expressions surrounded by (),  so I don't think this can be used ,other than in a for loop, to count lines, for example.  std.algorithm.count works ok, but I was hoping that there was something in the ctRegex that would make it work as fast as the hand-coded string scan.

>
> Two quick notes:
>
>> On the other end of the spectrum is the byLine version of the read. So
>> this is way too slow to be promoting in our examples, and if anyone is
>> using this in the code you should instead read chunks ... maybe 1MB like
>> in my example later below, and then split up the lines yourself.
>>
>> // read files by line ... yikes! don't want to do this
>> //finished! time: 485 ms
>> void wcp_byLine(string fn)
>> {
>> auto f = File(fn);
>> foreach(line; f.byLine(std.string.KeepTerminator.yes)){
>> }
>> }
>
> What OS did you use? (The implementation of byLine varies a lot across OSs.)

I'm doing everything now on win7-64 right now.


>
> I wanted for a long time to improve byLine by allowing it to do its own buffering. That means once you used byLine it's not possible to stop it, get back to the original File, and continue reading it. Using byLine is a commitment. This is what most uses of it do anyway.
>
>> Ok, this was the good surprise. Reading by chunks was faster than
>> reading the whole file, by several ms.
>
> What may be at work here is cache effects. Reusing the same 1MB may place it in faster cache memory, whereas reading 20MB at once may spill into slower memory.

Yes, I would guess that's the problem. This corei7 has 8MB cache, and the threadpool creates 7 active tasks by default, as I understand, so even 1MB blocks is on the border when running parallel. I'll lower the chunk size to some level that seems reasonable and retest.

>
>
> Andrei


March 21, 2012
On Monday, 19 March 2012 at 17:23:36 UTC, Andrei Alexandrescu wrote:

[.....]

>
> I wanted for a long time to improve byLine by allowing it to do its own buffering. That means once you used byLine it's not possible to stop it, get back to the original File, and continue reading it. Using byLine is a commitment. This is what most uses of it do anyway.

Great!! Perhaps we don't have to choose. We may have both!!
Allow me to suggest:

      byLineBuffered(bufferSize, keepTerminator);
or    byLineOnly(bufferSize, keepTerminator);
or    byLineChunked(bufferSize, keepTerminator);
or    byLineFastAndDangerous :-) hahah :-)

Or the other way around:

      byLine(keepTerminator, underlyingBufferSize);
renaming the current one to:
      byLineUnbuffered(keepTerminator);

Other ideas (I think I read them somewhere about
this same byLine topic):
  * I think it'd be cool if 'line' could be a slice of the
underlying buffer when possible if buffering is added.
  * Another good idea would be a new argument, maxLineLength,
so that one can avoid reading and allocating the whole
file into a big line string if there are no newlines
in the file, and one knows the max length desired.

--jm


>
>> Ok, this was the good surprise. Reading by chunks was faster than
>> reading the whole file, by several ms.
>
> What may be at work here is cache effects. Reusing the same 1MB may place it in faster cache memory, whereas reading 20MB at once may spill into slower memory.
>
>
> Andrei




March 21, 2012
On Monday, 19 March 2012 at 04:12:33 UTC, Jay Norwood wrote:

[....]

> Ok, this was the good surprise.  Reading by chunks was faster than reading the whole file, by several ms.
>
> // read files by chunk ...!better than full input
> //finished! time: 23 ms
> void wcp_files_by_chunk(string fn)
> {
> 	auto f = File(fn);
> 	foreach(chunk; f.byChunk(1_000_000)){
> 	}
> }

Try copying each received chunk into an array of the size of
the file, so that the comparison is fair (allocation time
of one big array != allocation time of single chunk). Plus,
std.file.read() might reallocate).
  Plus, I think that the GC runs at the end of a program,
and it's definitely not the same to have it scan through
a bigger heap (20Mb) than just scan through 1Mb of ram,
and with 20Mb of a file, it might have gotten 'distracted'
with pointer-looking data.
  Are you benchmarking the time of the whole program,
or of just that snippet? Is the big array out of scope
after the std.file.read() ? If so, try putting the benchmark
start and end inside the function, at the same scope
maybe inside that wcp_whole_file() function.


[....]
>
> So here is some surprise ... why is regex 136ms vs 34 ms hand code?

It's not surprising to me. I don't think that there
is a single regex engine in the world (I don't think even
the legendary Ken Thompson machine code engine) that can
surpass a hand coded:
     foreach(c; buffer) { lineCount += (c == '\n'); }
for line counting.

The same goes for indexOf. If you are searching for
the position of a single char (unless maybe that the
regex is optimized for that specific case, I'd like
to know of one if it exists), I think nothing beats indexOf.


Regular expressions are for what you rather never hand code,
or for convenience.
I would rather write a regex to do a complex subtitution in Vim,
than make a Vim function just for that. They have an amazing
descriptive power. You are basically describing a program
in a single regex line.


When people talk about *fast* regexen, they talk comparatively
to other engines, not as an absolute measure.
This confuses a lot of people.

--jm



March 21, 2012
On Wednesday, 21 March 2012 at 05:49:29 UTC, Juan Manuel Cabo wrote:
> On Monday, 19 March 2012 at 04:12:33 UTC, Jay Norwood wrote:
>
> [....]
>
>> Ok, this was the good surprise.  Reading by chunks was faster than reading the whole file, by several ms.
>>
>> // read files by chunk ...!better than full input
>> //finished! time: 23 ms
>> void wcp_files_by_chunk(string fn)
>> {
>> 	auto f = File(fn);
>> 	foreach(chunk; f.byChunk(1_000_000)){
>> 	}
>> }
>

mmm, I just looked in std/file.d and std/stdio.d.
The std.file.read() function calls GetFileSize()
before reading, and you are dealing with very tight
differences (23ms vs 31ms). So there is a chance that
either the difference is accounted for by the extra
GetFileSize() (and extra stat() in the posix version),
or your threads/process loosing their scheduled slice
for that extra I/O call of GetFileSize().
  Also, in windows, std.file.read() uses ReadFile, while
byChunk uses fread(), though it should all be the same
in the end.

  It is better to try with files big enough to see whether
the timing difference gets either bigger or stays
just around those 8ms.

I'll later try all this too!!! Nice benchmarking
by the way!! Got me interested!

--jm



March 21, 2012
On Mon, 19 Mar 2012 08:56:28 +0100, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:

> There doesn't appear to be any significant overhead to reading dchar vs
> char. I haven't looked at the code (or decode) underneath. I presume it
> is decoding and expanding into dchar...
>  // read dchar lc by chunk ...same
> //finished! time: 34 ms
> void wcp_dchar(string fn)
> {
> ulong l_cnt;
> auto f = File(fn);
> foreach(chunk; f.byChunk(1_000_000)){
> foreach(dchar c;chunk){
> if (c=='\n')
> l_cnt++;
> }
> }
> }
>  That was strange - decoding takes time, that surely trips few msesc in favor of non-decoding version. Probably I/o is hiding the facts here.

1.5x-3x and we haven't even ported back the faster phobos
std.utf.decode to druntime which is called by foreach.
March 22, 2012
On Wednesday, 21 March 2012 at 05:49:29 UTC, Juan Manuel Cabo wrote:

>   Are you benchmarking the time of the whole program,
> or of just that snippet? Is the big array out of scope
> after the std.file.read() ? If so, try putting the benchmark
> start and end inside the function, at the same scope
> maybe inside that wcp_whole_file() function.

The empty loop measurement, which was the first benchmark, shows that the overhead of everything outside the measurement is only 1ms.  What I'm measuring is a parallel foreach loop that calls the small functions on 10 files, with the default number of threadpool threads, which is 7 threads, based on the documentation in std.parallelism.

I'm not measuring until end of program or even the console output.  It is all measured with the stopwatch timer, and around the parallel foreach loop.

>
>
> [....]
>>
>> So here is some surprise ... why is regex 136ms vs 34 ms hand code?
>
> It's not surprising to me. I don't think that there
> is a single regex engine in the world (I don't think even
> the legendary Ken Thompson machine code engine) that can
> surpass a hand coded:
>      foreach(c; buffer) { lineCount += (c == '\n'); }
> for line counting.

Well, ok, but I think the comments like below from the std.regex raise hopes of something approaching hand code.

"  //create static regex at compile-time, contains fast native code
  enum ctr = ctRegex!(`^.*/([^/]+)/?$`);
"

I'll put the test code on github this weekend.  I still want to try a few things that have been suggested.

on the chunk measurements ... I don't understand what is not "fair". In both measurements I processed the entire file.


On the use of larger files ... yes that will be interesting, but for these current measurements  the file reads are only taking on the order of 30ms for 20MB, which tells me they are already either being cached by win7, or else by the ssd's cache.

 I'll use the article instructions below and put the files being read into the cache prior to the test,  so that the file read time  should be small and consistent relative to the other buffer processing time inside the loops.

http://us.generation-nt.com/activate-windows-file-caching-tip-tips-tricks-2130881-0.html


Thanks




« First   ‹ Prev
1 2