Jump to page: 1 2
Thread overview
Slow GC?
Mar 14, 2008
bearophile
Mar 14, 2008
Vladimir Panteleev
Mar 14, 2008
Vladimir Panteleev
Mar 14, 2008
bearophile
Mar 14, 2008
Vladimir Panteleev
Mar 15, 2008
Walter Bright
Mar 15, 2008
Vladimir Panteleev
Mar 15, 2008
Walter Bright
Mar 15, 2008
bearophile
Mar 15, 2008
bearophile
Mar 15, 2008
Vladimir Panteleev
March 14, 2008
While writing a program I have seen it's quite slow (about 35-40 seconds to run a program that requires less than 8 seconds to run with Python), I have progressively simplified it, to the following D code, able to show a similar problem:


import std.file, std.string, std.gc;
import std.stdio: putr = writefln;
static import std.c.time;
double clock() {
    auto t = std.c.time.clock();
    return t/cast(double)std.c.time.CLOCKS_PER_SEC;
}
void main() {
  //disable;
  auto t0 = clock();
  auto txt = cast(string)read("text.txt"); // 6.3 MB of text
  auto t1 = clock();
  auto words = txt.split();
  auto t2 = clock();

  putr("loading time: ", t1 - t0); // 0.08 s
  putr("splitting time: ", t2 - t1); // 3.69 s with GC, 0.66 s without
  // Total running time with GC = 10.85 s
}


As you can see the Running time is about 3.7 seconds, but then it needs about 7 more seconds to deallocate the memory used.

I have written the same program in Python, not using Psyco or disabling the GC or doing something fancy:


from time import clock
t0 = clock()
txt = file("text.txt").read() // 6.3 MB of text
t1 = clock()
words = txt.split()
t2 = clock()
print "loading time:", round(t1 - t0, 2) # 0.29 s
print "splitting time:", round(t2 - t1, 2) # 1.8 s
# Total running time = 3.18 s


Splitting is twice faster, but that's not the most important thing to see: when the program is finished it needs about 1.1 seconds to deallocate its memory and terminate.
It seems there is some problem in the D GC (in my original D program on a total running time of about 40 seconds, 17 are necessary for the final memory deallocation, while the same program in Python needs less than 2 seconds for the final memory + VirtualMachine deallocations).

Tests performed on a Win2K, 256 MB RAM, pentium3 500 MHz, with Python 2.5.2, and DMD 1.028. My timings are approximated but they don't need more than a 0.2 s resolution to show the situation.

Bye,
bearophile
March 14, 2008
On Fri, 14 Mar 2008 04:38:23 +0200, bearophile <bearophileHUGS@lycos.com> wrote:

> While writing a program I have seen it's quite slow (about 35-40 seconds to run a program that requires less than 8 seconds to run with Python), I have progressively simplified it, to the following D code, able to show a similar problem:

Splitting a huge block of text by whitespace will cause an array that's comparable to the size of the original text (a slice is 8 bytes, so if words are 6-7 characters long by average, the size would be the same). The slowdown appears because the GC needs to look up every one of those pointers, which point to the same large memory region.

I made a patch for Tango's GC which caches the last page during the "mark" part of the scan. This reduces the execution time of your program tenfold for me, but it will only work when there's many consecutive pointers to the same GC page which is part of a big (>2048 bytes) allocation. There should be no noticeable performance penalty for the general case (as the overhead is just one compare and sometimes an assignment), so I've created an enhancement ticket:

http://dsource.org/projects/tango/ticket/982

Interestingly, when I first compiled your program with Tango's runtime, and it ran a lot slower. I found the bottleneck, then created a ticket with an explanation and patch here:

http://dsource.org/projects/tango/ticket/981

I also have a patch for Phobos (for expanding arrays, not the GC), but it doesn't yield a big improvement (about 13% faster).

-- 
Best regards,
 Vladimir                          mailto:thecybershadow@gmail.com
March 14, 2008
On Fri, 14 Mar 2008 09:54:22 +0200, Vladimir Panteleev <thecybershadow@gmail.com> wrote:

> I made a patch for Tango's GC which caches the last page during the "mark" part of the scan.

Phobos patch: http://d.puremagic.com/issues/show_bug.cgi?id=1923

Improvement is also almost tenfold :)

-- 
Best regards,
 Vladimir                          mailto:thecybershadow@gmail.com
March 14, 2008
Vladimir Panteleev:

You are very fast :-)

>Splitting a huge block of text by whitespace will cause an array that's comparable to the size of the original text (a slice is 8 bytes, so if words are 6-7 characters long by average, the size would be the same).<

Note that in practical situations I don't use a split() on such big chunk of text. In such situation I use xsplit(), the lazy version (that you can find in my d libs. That I hope to see added both to the Python and D string methods/string library) that is "hugely" faster in such situation :-)


>Phobos patch: http://d.puremagic.com/issues/show_bug.cgi?id=1923<

I'd like to try that code; can I do it?


>Improvement is also almost tenfold :)<

Sorry for the potentially stupid question: is this an improvement in the text splitting time, in the final garbage collection time when the program is finished, or is it just a blind sum of both?

General note: I think more people have to compare the speed of programs written in other languages (like Python, Java, OCaML, Clean, C with GCC, Perl, etc) to the same programs written in D, because this (often?) shows spots where D may need some improvements. I know D is a language in development still, so too much early optimization may be bad, but I think a bit of care in that regard may help anyway.

Bye and thank you,
bearophile
March 14, 2008
On Fri, 14 Mar 2008 15:00:48 +0200, bearophile <bearophileHUGS@lycos.com> wrote:

>>Phobos patch: http://d.puremagic.com/issues/show_bug.cgi?id=1923<
>
> I'd like to try that code; can I do it?

Yes, just apply the patch (to src/phobos/internal/gc/gcx.d) and rebuild Phobos (using the makefiles).

>>Improvement is also almost tenfold :)<
>
> Sorry for the potentially stupid question: is this an improvement in the text splitting time, in the final garbage collection time when the program is finished, or is it just a blind sum of both?

Of the total running time. The slowdown happens when D starts a garbage collection cycle while split() is running.

Here are the exact times for me:

== Before ==
loading time: 0.14
splitting time: 0.875
total execution time: 3.234 seconds.

== After ==
loading time: 0.063
splitting time: 0.156
total execution time: 0.358 seconds.

> General note: I think more people have to compare the speed of programs written in other languages (like Python, Java, OCaML, Clean, C with GCC, Perl, etc) to the same programs written in D, because this (often?) shows spots where D may need some improvements. I know D is a language in development still, so too much early optimization may be bad, but I think a bit of care in that regard may help anyway.

Good optimizations often require significant and complicated changes to the code. Luckily this one doesn't :)

-- 
Best regards,
 Vladimir                          mailto:thecybershadow@gmail.com
March 15, 2008
Vladimir Panteleev wrote:
> Good optimizations often require significant and complicated changes
> to the code. Luckily this one doesn't :)


What exactly does the patch do?
March 15, 2008
Vladimir Panteleev:

>Yes, just apply the patch (to src/phobos/internal/gc/gcx.d) and rebuild Phobos (using the makefiles).<

I'm sure for most of you doing this is really easy, but for me it wasn't an immediate thing. I have succed.

Phobos of DMD 1.028 for Win compiled with the patch (I'll keep them online for just few days):
http://www.fantascienza.net/leonardo/phobos.7z
Or bigger download (use the smaller if possible):
http://www.fantascienza.net/leonardo/phobos.zip
I suggest interested people to try this patch, to find if it gives improvements or the opposite to code that uses the GC.


>Here are the exact times for me:<

My timings, the deallocation is about 9 times faster!

OLD TIMINGS, with GC:
loading time: 0.41 s
splitting time: 3.76 s
TOTAL time: 11.4 s

OLD TIMINGS, without GC:
loading time: 0.40 s
splitting time: 0.67 s
TOTAL time: 8.25 s


NEW TIMINGS, with GC:
loading time: 0.39 s
splitting time: 0.8 s
TOTAL time: 1.71 s

NEW TIMINGS, without GC:
loading time: 0.39 s
splitting time: 0.67 s
TOTAL time: 1.58 s


The generation of a a long list of matches with a regex like the following one goes from 7.63s to 3.28s:
auto txt = cast(string)read("text.txt");
auto words = RegExp("[a-z]+", "g").match(txt);


The "bintrees" benchmark of the Shootout version (with the version of the code that doesn't use manual memory management. The code is online in that site, look for my nick in the code) goes (with input n = 15) from 42.1s to 34.1s.


>Good optimizations often require significant and complicated changes to the code. Luckily this one doesn't :)<

I know, things aren't easy, but:
- Various tests have shown me that the D GC is (was? With this patch it may be a bit faster in some situations) up to 10-11 times slower than the GC of the last Java, so I presume there's space for improvements.
- Comparing the speed of the same programs from D to other fast languages (C with GCC, OcaML, Clean, etc) helps you to *know* the spots where the performance is bad. Then later you can let things as they are, or you can try to improve things. Otherwise you just are ignorant and you can't even start thinking about solving problems. It's like a higher-level form of code (language) profiling :-)

And by the way, now the D program I was talking about is about 15% faster than the Python version ;-)

Bye and thank you,
bearophile
March 15, 2008
bearophile:
> My timings, the deallocation is about 9 times faster!

But watch the RAM used too, to me it seems to use more...

Bye,
bearophile
March 15, 2008
On Sat, 15 Mar 2008 04:48:27 +0200, Walter Bright <newshound1@digitalmars.com> wrote:

> Vladimir Panteleev wrote:
>> Good optimizations often require significant and complicated changes to the code. Luckily this one doesn't :)
>
>
> What exactly does the patch do?

In gcx.mark(), it remembers the last pointer it examined that pointed to a B_PAGE or B_PAGEPLUS memory page. If the next value that looks like a pointer (is >=minAddr && <maxAddr) also points to the same memory page ((lastPointer & ~(PAGESIZE-1)) == (newPointer & ~(PAGESIZE-1))), it skips looking at it, thus skipping a call to findPool, looking up the page's bin size, and testing / setting the respective bits.

This greatly optimizes the GC for scanning areas that have consecutive pointers pointing to the same block of memory, e.g. slices of a bigger array. Note that I also changed "if (p >= minAddr)" to "if (p >= minAddr && p < maxAddr)" (which should also give a speed boost when scanning objects with static byte/ubyte/char arrays or floats).

-- 
Best regards,
 Vladimir                          mailto:thecybershadow@gmail.com
March 15, 2008
On Sat, 15 Mar 2008 05:00:14 +0200, bearophile <bearophileHUGS@lycos.com> wrote:

> bearophile:
>> My timings, the deallocation is about 9 times faster!
>
> But watch the RAM used too, to me it seems to use more...

Odd, it shouldn't. The only memory overhead is one DWORD on the stack.

-- 
Best regards,
 Vladimir                          mailto:thecybershadow@gmail.com
« First   ‹ Prev
1 2