Thread overview
How can I make a program which uses all cores and 100% of cpu power?
Oct 11, 2019
Murilo
Oct 11, 2019
Daniel Kozak
Oct 11, 2019
Daniel Kozak
Oct 11, 2019
Ali Çehreli
Dec 06, 2019
Murilo
Oct 11, 2019
Russel Winder
Dec 06, 2019
Murilo
October 11, 2019
I have started working with neural networks and for that I need a lot of computing power but the programs I make only use around 30% of the cpu, or at least that is what Task Manager tells me. How can I make it use all 4 cores of my AMD FX-4300 and how can I make it use 100% of it?
October 11, 2019
On Fri, Oct 11, 2019 at 2:45 AM Murilo via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> wrote:
>
> I have started working with neural networks and for that I need a lot of computing power but the programs I make only use around 30% of the cpu, or at least that is what Task Manager tells me. How can I make it use all 4 cores of my AMD FX-4300 and how can I make it use 100% of it?

You should use minimally same amount of threads as you have cores. So
in your case 4 or even more.
Than you should buy a new CPU if you really need a lot of computing
power :). Other issue can be using blocking IO, so your threads are in
idle, so can stress your CPU.
October 11, 2019
On Fri, Oct 11, 2019 at 6:58 AM Daniel Kozak <kozzi11@gmail.com> wrote:
>
>  so can stress your CPU.

can't
October 10, 2019
On 10/10/2019 05:41 PM, Murilo wrote:
> I have started working with neural networks and for that I need a lot of
> computing power but the programs I make only use around 30% of the cpu,
> or at least that is what Task Manager tells me. How can I make it use
> all 4 cores of my AMD FX-4300 and how can I make it use 100% of it?

Your threads must allocate as little memory as possible because memory allocation can trigger garbage collection and garbage collection stops all threads (except the one that's performing collection).

We studied the effects of different allocation schemes during our last local D meetup[1]. The following program has two similar worker threads. One allocates in an inner scope, the other one uses a static Appender and clears its state as needed.

The program sets 'w' to 'worker' inside main(). Change it to 'worker2' to see a huge difference: On my 4-core laptop its 100% versus 400% CPU usage.

import std.random;
import std.range;
import std.algorithm;
import std.concurrency;
import std.parallelism;

enum inner_N = 100;

void worker() {
  ulong result;
  while (true) {
    int[] arr;
    foreach (j; 0 .. inner_N) {
      arr ~= uniform(0, 2);
    }
    result += arr.sum;
  }
}

void worker2() {
  ulong result;
  static Appender!(int[]) arr;
  while (true) {
    arr.clear();
    foreach (j; 0 .. inner_N) {
      arr ~= uniform(0, 2);
    }
    result += arr.data.sum;
  }
}

void main() {
  // Replace with 'worker2' to see the speedup
  alias w = worker;

  auto workers = totalCPUs.iota.map!(_ => spawn(&w)).array;

  w();
}

The static Appender is thread-safe because each thread gets their own copy due to data being thread-local by default in D. However, it doesn't mean that the functions are reentrant: If they get called recursively perhaps indirectly, then the subsequent executions would corrupt previous executions' Appender states.

Ali

[1] https://www.meetup.com/D-Lang-Silicon-Valley/events/kmqcvqyzmbzb/ Are you someone in the Bay Area but do not come to our meetups? We've been eating your falafel wraps! ;)

October 11, 2019
On Fri, 2019-10-11 at 00:41 +0000, Murilo via Digitalmars-d-learn wrote:
> I have started working with neural networks and for that I need a lot of computing power but the programs I make only use around 30% of the cpu, or at least that is what Task Manager tells me. How can I make it use all 4 cores of my AMD FX-4300 and how can I make it use 100% of it?

Why do you want to get CPU utilisation to 100%?

I would have thought you'd want to get the neural net to be as fast as possible, this does not necessarily imply that all CPU cycles must be used.

A neural net is, at it's heart, a set of communicating nodes. This is as much
an I/O bound model as it is compute bound one – nodes are generally waiting
for input as much as they are computing a value. The obvious solution
architecture for a small computer is to create a task per node on a thread
pool, with a few more threads in the pool than you have processors, and hope
that you can organise the communication between tasks so as to avoid cache
misses. This can be tricky when using multi-core processors. It gets even
worse when you have hyperthreads – many organisations doing CPU bound
computations switch off hyperthreads as they cause more problems than theysolve.

-- 
Russel.
===========================================
Dr Russel Winder      t: +44 20 7585 2200
41 Buckmaster Road    m: +44 7770 465 077
London SW11 1EN, UK   w: www.russel.org.uk



December 06, 2019
On Friday, 11 October 2019 at 06:18:03 UTC, Ali Çehreli wrote:
> Your threads must allocate as little memory as possible because memory allocation can trigger garbage collection and garbage collection stops all threads (except the one that's performing collection).
> We studied the effects of different allocation schemes during our last local D meetup[1]. The following program has two similar worker threads. One allocates in an inner scope, the other one uses a static Appender and clears its state as needed.
> The static Appender is thread-safe because each thread gets their own copy due to data being thread-local by default in D. However, it doesn't mean that the functions are reentrant: If they get called recursively perhaps indirectly, then the subsequent executions would corrupt previous executions' Appender states.

Thanks for the information, they were very helpful.
December 06, 2019
On Friday, 11 October 2019 at 06:57:46 UTC, Russel Winder wrote:
> A neural net is, at it's heart, a set of communicating nodes. This is as much
> an I/O bound model as it is compute bound one – nodes are generally waiting
> for input as much as they are computing a value. The obvious solution
> architecture for a small computer is to create a task per node on a thread
> pool, with a few more threads in the pool than you have processors, and hope
> that you can organise the communication between tasks so as to avoid cache
> misses. This can be tricky when using multi-core processors. It gets even
> worse when you have hyperthreads – many organisations doing CPU bound
> computations switch off hyperthreads as they cause more problems than theysolve.

Thanks, that helped a lot. But I already figured out a new training algorithm that is a lot faster, no need to use parallelism anymore.