On Sunday, 26 December 2021 at 11:24:54 UTC, rikki cattermole wrote:
> I would start by removing the use of stdout in your loop kernel - I'm not familiar with what you are calculating, but if you can basically have the (parallel) loop operate from (say) one array directly into another then you can get extremely good parallel scaling with almost no effort.
I'm basically generating a default list of LFSR's for my Reed Solomon codes. LFSR can be used in Pseudo random numbers, but in this case it's to build a Galois field for Error Correction.
Using it is simple, you need to know a binary number that when xored when a 1 bit exits the range, will result in the maximum numbers (excluding zero). So if we do 4 bits (xor of 3) you'd get:
0 0001 -- initial
0 0010
0 0100
0 1000
1 0011 <- 0000
0 0110
0 1100
1 1011 <- 1000
1 0101 <- 0110
0 1010
1 0111 <- 0100
0 1110
1 1111 <- 1100
1 1101 <- 1110
1 1001 <- 1010
1 0001 <- 0010 -- back to our initial value
As such the bulk of the work is done in this function. Other functions leading to this mostly figure out what value should be according to some rules i set before trying to work (quite a few only need 2 bits on).
bool testfunc(ulong value, ulong bitswide) {
ulong cnt=1, lfsr=2,up=1UL<<bitswide;
value |= up; //eliminates need to and the result
while(cnt < up && lfsr != 1) {
lfsr <<= 1;
if (lfsr & up)
lfsr ^= value;
cnt++;
}
return cnt == up-1;
}
//within main, cyclebits will call testfunc when value is calculated
foreach(bitwidth; taskPool.parallel(iota(start, end))) {
for(ulong bitson=2; bitson <= bitwidth; bitson+=1) {
ulong v = cyclebits(bitwidth, bitson, &testfunc);
if (v) {
writeln("\t0x", cast(void*)v, ",\t/*",bitwidth, "*/"); //only place IO takes place
break;
}
}
}
rikki cattermole wrote:
> Your question at the moment doesn't really have much context to it so it's difficult to suggest where you should go directly.
I suppose, if I started doing work where I'm sharing resources (probably memory) would i have to go with semaphores and locks. I remember trying to read how to use threads in the past in C/C++ and it was a headache to setup where i just gave up.
I assume it's best to divide work up where it can be completed without competing for resources or race conditions, hefty enough work to make it worth the cost of instantiating the thread in the first place. So aside from the library documentation is there a good source for learning/using parallel and best practices? I'll love to be using more of this in the future if it isn't as big a blunder as it's made out to be.
> Not using in the actual loop should make the code faster even without threads because having a function call in the hot code will mean compilers optimizer will give up on certain transformations - i.e. do all the work as compactly as possible then output the data in one step at the end.
In this case I'm not sure how long each step takes, so I'm hoping intermediary results i can copy by hand will work (it may take a second or several minutes). If this wasn't a brute force elimination of so many combinations I'm sure a different approach would work.
On 27/12/2021 12:10 AM, max haughton wrote:
> It'll speed it up significantly.
Standard IO has locks in it. So you end up with all calculations grinding to a halt waiting for another thread to finish doing something.
I assume that's only when they are trying to actually use it? Early in the cycles (under 30) they were outputting quickly, but after 31 it can be minutes between results, and each thread (if I'm right) is working on a different number. So ones found where 3,5,9 are pretty fast while all the others have a lot of failures before i get a good result.