gdc internals: weird way to translate "while" loops makes autovectorization impossible

Feb 06, 2007

Feb 25, 2007

Feb 26, 2007

At the moment, gdc seems to be incapable of vectorizing simple loops. Example. The following C program will, if compiled with gcc -O3 -ftree-vectorize -msse, generate SSE vectorized assembler code. This can be verified by compiling with -ftree-vectorizer-verbose=7. > #include "stdlib.h" > > int main() { > float*ptr=malloc(sizeof(float)*8); > float *tptr=ptr+8; > while (ptr!=tptr) { *ptr=1.0; ptr++; } > return 0; > } > I translated the program to D. Result: > import std.c.stdlib : malloc; > > int main() { > float*ptr=cast(float*)malloc(float.sizeof*8); > float *tptr=ptr+8; > while (ptr!=tptr) { *ptr=1.0; ptr++; } > return 0; > } > Using the same flags, gdc doesn't vectorize the loop. I took the gdc vectorization process apart, added a few fprintf statements, and got the following log: test.d.t77.vect > ;; Function main (_Dmain) > > > test.d:6: note: ===== analyze_loop_nest ===== > test.d:6: note: === vect_analyze_loop_form === > test.d:6: note: not vectorized: too many BBs in loop. > > I added that below. > dropping additional loop info ... > ;; > ;; Loop 1: > ;; header 4, latch 3 > ;; depth 1, level 1, outer 0 > ;; nodes: 4 3 2 > dropping block info ... > Block 4 > ;; basic block 4, loop depth 1, count 0 > ;; prev block 3, next block 5 > ;; pred: 3 [100.0%] (fallthru,exec) 1 [100.0%] (fallthru,exec) > ;; succ: 2 [100.0%] (fallthru,exec) > # ivtmp.27_1 = PHI <ivtmp.27_10(3), 8(1)>; > # TMT.5_18 = PHI <TMT.5_13(3), TMT.5_12(1)>; > # ptr_17 = PHI <ptr_9(3), ptr_3(1)>; > <L2>:; > # TMT.5_13 = V_MAY_DEF <TMT.5_18>; > *ptr_17 = 1.0e+0; > ptr_9 = ptr_17 + 4; > goto <bb 2> (<L1>); > > Block 3 > ;; basic block 3, loop depth 1, count 0 > ;; prev block 2, next block 4 > ;; pred: 2 [88.9%] (dfs_back,false,exec) > ;; succ: 4 [100.0%] (fallthru,exec) > <L11>:; > > Block 2 > ;; basic block 2, loop depth 1, count 0 > ;; prev block 1, next block 3 > ;; pred: 4 [100.0%] (fallthru,exec) > ;; succ: 5 [11.1%] (loop_exit,true,exec) 3 [88.9%] (dfs_back,false,exec) > <L1>:; > ivtmp.27_10 = ivtmp.27_1 - 1; > if (ivtmp.27_10 == 0) goto <L3>; else goto <L11>; > > Done > > From here on out, it's normal gdc output. > > test.d:6: note: bad loop form. > test.d:6: note: vectorized 0 loops in function. > main () > { > uintD.1160 ivtmp.27D.1244; > floatD.1163 * ptrD.1186; > floatD.1163 * tptrD.1189; > intD.1177 D.1196; > boolD.1173 D.1195; > boolD.1173 D.1194; > voidD.1154 * D.1191; > > # BLOCK 0 freq:1111, starting at line 4 > # PRED: ENTRY > [test.d : 4] D.1191_2 = [test.d : 4] malloc (32); > [test.d : 4] ptrD.1186_3 = (floatD.1163 *) D.1191_2; > [test.d : 5] tptrD.1189_4 = ptrD.1186_3 + 32; > [test.d : 6] if (ptrD.1186_3 == tptrD.1189_4) [test.d : 6] goto <L3>; else goto <L9>; > # SUCC: 5 1 > > # BLOCK 1 freq:988 > # PRED: 0 > <L9>:; > goto <bb 4> (<L2>); > # SUCC: 4 > > # BLOCK 2 freq:8889, starting at line 6 > # PRED: 4 > [test.d : 6] <L1>:; > ivtmp.27D.1244_10 = ivtmp.27D.1244_1 - 1; > [test.d : 6] if (ivtmp.27D.1244_10 == 0) [test.d : 6] goto <L3>; else goto <L11>; > # SUCC: 5 3 > > # BLOCK 3 freq:7901 > # PRED: 2 > <L11>:; > # SUCC: 4 > > # BLOCK 4 freq:8889, starting at line 6 > # PRED: 3 1 > # ivtmp.27D.1244_1 = PHI <ivtmp.27D.1244_10(3), 8(1)>; > # ptrD.1186_17 = PHI <ptrD.1186_9(3), ptrD.1186_3(1)>; > <L2>:; > [test.d : 6] *ptrD.1186_17 = 1.0e+0; > [test.d : 6] ptrD.1186_9 = ptrD.1186_17 + 4; > [test.d : 6] goto <bb 2> (<L1>); > # SUCC: 2 > > # BLOCK 5 freq:1111 > # PRED: 2 0 > <L3>:; > return 0; > # SUCC: EXIT > > } Note the block 3, which only consists of a label and nothing more. Apparently, having three BBs instead of two blocks gdc from optimizing the loop. See also, tree-vect-analyze.c:1861. I don't know enough about gdc to see where that empty block is generated, but could we get rid of it, possibly, please? Greetings .. downs

downs wrote: > At the moment, gdc seems to be incapable of vectorizing simple loops. > Example. > The following C program will, if compiled with gcc -O3 -ftree-vectorize -msse, generate SSE vectorized assembler code. > This can be verified by compiling with -ftree-vectorizer-verbose=7. >> #include "stdlib.h" >> >> int main() { >> float*ptr=malloc(sizeof(float)*8); >> float *tptr=ptr+8; >> while (ptr!=tptr) { *ptr=1.0; ptr++; } return 0; >> } >> > > I translated the program to D. Result: > >> import std.c.stdlib : malloc; >> >> int main() { >> float*ptr=cast(float*)malloc(float.sizeof*8); >> float *tptr=ptr+8; >> while (ptr!=tptr) { *ptr=1.0; ptr++; } return 0; >> } >> > > Using the same flags, gdc doesn't vectorize the loop. > I took the gdc vectorization process apart, added a few fprintf statements, and got the following log: > test.d.t77.vect > >> ;; Function main (_Dmain) >> >> >> test.d:6: note: ===== analyze_loop_nest ===== >> test.d:6: note: === vect_analyze_loop_form === >> test.d:6: note: not vectorized: too many BBs in loop. >>> I added that below. >> dropping additional loop info ... ;; >> ;; Loop 1: >> ;; header 4, latch 3 >> ;; depth 1, level 1, outer 0 >> ;; nodes: 4 3 2 >> dropping block info ... Block 4 >> ;; basic block 4, loop depth 1, count 0 >> ;; prev block 3, next block 5 >> ;; pred: 3 [100.0%] (fallthru,exec) 1 [100.0%] (fallthru,exec) >> ;; succ: 2 [100.0%] (fallthru,exec) >> # ivtmp.27_1 = PHI <ivtmp.27_10(3), 8(1)>; >> # TMT.5_18 = PHI <TMT.5_13(3), TMT.5_12(1)>; >> # ptr_17 = PHI <ptr_9(3), ptr_3(1)>; >> <L2>:; >> # TMT.5_13 = V_MAY_DEF <TMT.5_18>; >> *ptr_17 = 1.0e+0; >> ptr_9 = ptr_17 + 4; >> goto <bb 2> (<L1>); >> >> Block 3 >> ;; basic block 3, loop depth 1, count 0 >> ;; prev block 2, next block 4 >> ;; pred: 2 [88.9%] (dfs_back,false,exec) >> ;; succ: 4 [100.0%] (fallthru,exec) >> <L11>:; >> >> Block 2 >> ;; basic block 2, loop depth 1, count 0 >> ;; prev block 1, next block 3 >> ;; pred: 4 [100.0%] (fallthru,exec) >> ;; succ: 5 [11.1%] (loop_exit,true,exec) 3 [88.9%] (dfs_back,false,exec) >> <L1>:; >> ivtmp.27_10 = ivtmp.27_1 - 1; >> if (ivtmp.27_10 == 0) goto <L3>; else goto <L11>; >> >> Done >>> From here on out, it's normal gdc output. >> test.d:6: note: bad loop form. >> test.d:6: note: vectorized 0 loops in function. >> main () >> { >> uintD.1160 ivtmp.27D.1244; >> floatD.1163 * ptrD.1186; >> floatD.1163 * tptrD.1189; >> intD.1177 D.1196; >> boolD.1173 D.1195; >> boolD.1173 D.1194; >> voidD.1154 * D.1191; >> >> # BLOCK 0 freq:1111, starting at line 4 >> # PRED: ENTRY >> [test.d : 4] D.1191_2 = [test.d : 4] malloc (32); >> [test.d : 4] ptrD.1186_3 = (floatD.1163 *) D.1191_2; >> [test.d : 5] tptrD.1189_4 = ptrD.1186_3 + 32; >> [test.d : 6] if (ptrD.1186_3 == tptrD.1189_4) [test.d : 6] goto <L3>; else goto <L9>; >> # SUCC: 5 1 >> >> # BLOCK 1 freq:988 >> # PRED: 0 >> <L9>:; >> goto <bb 4> (<L2>); >> # SUCC: 4 >> >> # BLOCK 2 freq:8889, starting at line 6 >> # PRED: 4 >> [test.d : 6] <L1>:; >> ivtmp.27D.1244_10 = ivtmp.27D.1244_1 - 1; >> [test.d : 6] if (ivtmp.27D.1244_10 == 0) [test.d : 6] goto <L3>; else goto <L11>; >> # SUCC: 5 3 >> >> # BLOCK 3 freq:7901 >> # PRED: 2 >> <L11>:; >> # SUCC: 4 >> >> # BLOCK 4 freq:8889, starting at line 6 >> # PRED: 3 1 >> # ivtmp.27D.1244_1 = PHI <ivtmp.27D.1244_10(3), 8(1)>; >> # ptrD.1186_17 = PHI <ptrD.1186_9(3), ptrD.1186_3(1)>; >> <L2>:; >> [test.d : 6] *ptrD.1186_17 = 1.0e+0; >> [test.d : 6] ptrD.1186_9 = ptrD.1186_17 + 4; >> [test.d : 6] goto <bb 2> (<L1>); >> # SUCC: 2 >> >> # BLOCK 5 freq:1111 >> # PRED: 2 0 >> <L3>:; >> return 0; >> # SUCC: EXIT >> >> } > > Note the block 3, which only consists of a label and nothing more. Apparently, having three BBs instead of two blocks gdc from optimizing the loop. > See also, tree-vect-analyze.c:1861. > I don't know enough about gdc to see where that empty block is generated, but could we get rid of it, possibly, please? > Greetings .. downs > Any news on this? I believe the fact that a rather important optimization feature of GCC is completely broken on GDC merits at least a brief response. If you don't plan to fix this, then please, do tell me so. If I made some kind of obvious, plain mistake, then please do tell me so. But either way, I would really appreciate a response. Oh, btw: the following code serves to illustrate the problem further. > import std.c.stdlib : malloc; > > int main() { > float*ptr=cast(float*)malloc(float.sizeof*8); float*tptr=ptr+8; > version(spaghetti) { loop: *ptr=1.0; ptr++; if (ptr!=tptr) goto loop; } > else { do { *ptr=1.0; ptr++; } while (ptr!=tptr) } > return 0; > } One of these versions will get autovectorized, the other not. Anyway, greetings. --downs

February 26, 2007

Re: gdc internals: weird way to translate "while" loops makes autovectorization impossible

Posted by David Friedman
in reply to Downs

Permalink

David Friedman

Posted in reply to Downs

Permalink

Downs wrote:
> downs wrote:
> 
>> At the moment, gdc seems to be incapable of vectorizing simple loops.
>> Example.
>> The following C program will, if compiled with gcc -O3 -ftree-vectorize -msse, generate SSE vectorized assembler code.
>> This can be verified by compiling with -ftree-vectorizer-verbose=7.
>>
>>> #include "stdlib.h"
>>>
>>> int main() {
>>>  float*ptr=malloc(sizeof(float)*8);
>>>  float *tptr=ptr+8;
>>>  while (ptr!=tptr) { *ptr=1.0; ptr++; }  return 0;
>>> }
>>>
>>
>> I translated the program to D. Result:
>>
>>> import std.c.stdlib : malloc;
>>>
>>> int main() {
>>>  float*ptr=cast(float*)malloc(float.sizeof*8);
>>>  float *tptr=ptr+8;
>>>  while (ptr!=tptr) { *ptr=1.0; ptr++; }  return 0;
>>> }
>>>
>>
>> Using the same flags, gdc doesn't vectorize the loop.
>> I took the gdc vectorization process apart, added a few fprintf statements, and got the following log:
>> test.d.t77.vect
>>
>>> ;; Function main (_Dmain)
>>>
>>>
>>> test.d:6: note: ===== analyze_loop_nest =====
>>> test.d:6: note: === vect_analyze_loop_form ===
>>> test.d:6: note: not vectorized: too many BBs in loop.
>>>
>>>> I added that below.
>>>
>>> dropping additional loop info ... ;;
>>> ;; Loop 1:
>>> ;;  header 4, latch 3
>>> ;;  depth 1, level 1, outer 0
>>> ;;  nodes: 4 3 2
>>> dropping block info ...  Block 4
>>> ;; basic block 4, loop depth 1, count 0
>>> ;; prev block 3, next block 5
>>> ;; pred:       3 [100.0%]  (fallthru,exec) 1 [100.0%]  (fallthru,exec)
>>> ;; succ:       2 [100.0%]  (fallthru,exec)
>>> # ivtmp.27_1 = PHI <ivtmp.27_10(3), 8(1)>;
>>> # TMT.5_18 = PHI <TMT.5_13(3), TMT.5_12(1)>;
>>> # ptr_17 = PHI <ptr_9(3), ptr_3(1)>;
>>> <L2>:;
>>> #   TMT.5_13 = V_MAY_DEF <TMT.5_18>;
>>> *ptr_17 = 1.0e+0;
>>> ptr_9 = ptr_17 + 4;
>>> goto <bb 2> (<L1>);
>>>
>>>  Block 3
>>> ;; basic block 3, loop depth 1, count 0
>>> ;; prev block 2, next block 4
>>> ;; pred:       2 [88.9%]  (dfs_back,false,exec)
>>> ;; succ:       4 [100.0%]  (fallthru,exec)
>>> <L11>:;
>>>
>>>  Block 2
>>> ;; basic block 2, loop depth 1, count 0
>>> ;; prev block 1, next block 3
>>> ;; pred:       4 [100.0%]  (fallthru,exec)
>>> ;; succ:       5 [11.1%]  (loop_exit,true,exec) 3 [88.9%]  (dfs_back,false,exec)
>>> <L1>:;
>>> ivtmp.27_10 = ivtmp.27_1 - 1;
>>> if (ivtmp.27_10 == 0) goto <L3>; else goto <L11>;
>>>
>>> Done
>>>
>>>> From here on out, it's normal gdc output.
>>>
>>> test.d:6: note: bad loop form.
>>> test.d:6: note: vectorized 0 loops in function.
>>> main ()
>>> {
>>>   uintD.1160 ivtmp.27D.1244;
>>>   floatD.1163 * ptrD.1186;
>>>   floatD.1163 * tptrD.1189;
>>>   intD.1177 D.1196;
>>>   boolD.1173 D.1195;
>>>   boolD.1173 D.1194;
>>>   voidD.1154 * D.1191;
>>>
>>>   # BLOCK 0 freq:1111, starting at line 4
>>>   # PRED: ENTRY
>>>   [test.d : 4] D.1191_2 = [test.d : 4] malloc (32);
>>>   [test.d : 4] ptrD.1186_3 = (floatD.1163 *) D.1191_2;
>>>   [test.d : 5] tptrD.1189_4 = ptrD.1186_3 + 32;
>>>   [test.d : 6] if (ptrD.1186_3 == tptrD.1189_4) [test.d : 6] goto <L3>; else goto <L9>;
>>>   # SUCC: 5 1
>>>
>>>   # BLOCK 1 freq:988
>>>   # PRED: 0
>>> <L9>:;
>>>   goto <bb 4> (<L2>);
>>>   # SUCC: 4
>>>
>>>   # BLOCK 2 freq:8889, starting at line 6
>>>   # PRED: 4
>>> [test.d : 6] <L1>:;
>>>   ivtmp.27D.1244_10 = ivtmp.27D.1244_1 - 1;
>>>   [test.d : 6] if (ivtmp.27D.1244_10 == 0) [test.d : 6] goto <L3>; else goto <L11>;
>>>   # SUCC: 5 3
>>>
>>>   # BLOCK 3 freq:7901
>>>   # PRED: 2
>>> <L11>:;
>>>   # SUCC: 4
>>>
>>>   # BLOCK 4 freq:8889, starting at line 6
>>>   # PRED: 3 1
>>>   # ivtmp.27D.1244_1 = PHI <ivtmp.27D.1244_10(3), 8(1)>;
>>>   # ptrD.1186_17 = PHI <ptrD.1186_9(3), ptrD.1186_3(1)>;
>>> <L2>:;
>>>   [test.d : 6] *ptrD.1186_17 = 1.0e+0;
>>>   [test.d : 6] ptrD.1186_9 = ptrD.1186_17 + 4;
>>>   [test.d : 6] goto <bb 2> (<L1>);
>>>   # SUCC: 2
>>>
>>>   # BLOCK 5 freq:1111
>>>   # PRED: 2 0
>>> <L3>:;
>>>   return 0;
>>>   # SUCC: EXIT
>>>
>>> }
>>
>>
>> Note the block 3, which only consists of a label and nothing more. Apparently, having three BBs instead of two blocks gdc from optimizing the loop.
>> See also, tree-vect-analyze.c:1861.
>> I don't know enough about gdc to see where that empty block is generated, but could we get rid of it, possibly, please?
>> Greetings .. downs
>>
> 
> Any news on this? I believe the fact that a rather important optimization feature of GCC is completely broken on GDC merits at least a brief response.
> If you don't plan to fix this, then please, do tell me so.
> If I made some kind of obvious, plain mistake, then please do tell me so.
> But either way, I would really appreciate a response.
> 
> 
> Oh, btw: the following code serves to illustrate the problem further.
>  > import std.c.stdlib : malloc;
>  >
>  > int main() {
>  >  float*ptr=cast(float*)malloc(float.sizeof*8); float*tptr=ptr+8;
>  >  version(spaghetti) { loop: *ptr=1.0; ptr++; if (ptr!=tptr) goto loop; }
>  >  else { do { *ptr=1.0; ptr++; } while (ptr!=tptr) }
>  >  return 0;
>  > }
> One of these versions will get autovectorized, the other not.
> Anyway, greetings. --downs


I will fix this, but I am not doing any work on it presently.  My best guess so far is that the C compiler puts loop conditions at the end of the loop to reduce the number of basic blocks (but has an extra initial goto.)  This is the form that the vectorizer expects.

David

Forums