• Advertisement
Sign in to follow this  

GCC auto vectorizer alignment issues

This topic is 990 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi everyone,

I'm having a hard time getting the GCC auto vectorizer to auto vectorize. I believe that the problem has to to with its ability to figure out the stride/alignment of pointers. Consider the following minimal (not) working example:
void func(const float *src, float *dst, const float *factors) {
    const float * __restrict__ alignedSrc = (const float *)__builtin_assume_aligned(src, 32);
    float * __restrict__ alignedDst = (float *)__builtin_assume_aligned(dst, 32);
    const float * __restrict__ unaliasedFactors = factors;

    enum {
        NUM_OUTER = 4,
        NUM_INNER = 32
    };

    for (unsigned k = 0; k < NUM_OUTER; k++) {
        const float factor = unaliasedFactors[k];

        const float * __restrict__ srcChunk = alignedSrc + k * NUM_INNER;
        float * __restrict__ dstChunk = alignedDst + k * NUM_INNER;

        for (int j = 0; j < NUM_INNER; j++)
            dstChunk[j] = srcChunk[j] * factor;
    }
}
It is two nested loops, sequentially looping over an array of size 32*4. It gets four factors and multiplies the first 32 elements by the first factor, the next 32 elements by the second and so on. Results are stored sequentially in an output array. Now, I use "__builtin_assume_aligned" and "__restrict__" to tell the compiler that the arrays are 32 byte aligned and not aliased. This should be prime meat for a vectorizer. Sadly, the output looks like this:

(compiled with -march=native -ffast-math -std=c++14 -O3 on gcc 4.9.2)
0000000000000000 <_ZN2ml3mlp4funcEPKfPfS2_>:
   0:	4c 8d 54 24 08       	lea    0x8(%rsp),%r10
   5:	48 83 e4 e0          	and    $0xffffffffffffffe0,%rsp
   9:	49 89 f0             	mov    %rsi,%r8
   c:	41 ff 72 f8          	pushq  -0x8(%r10)
  10:	55                   	push   %rbp
  11:	48 89 f9             	mov    %rdi,%rcx
  14:	45 31 c9             	xor    %r9d,%r9d
  17:	48 89 e5             	mov    %rsp,%rbp
  1a:	41 56                	push   %r14
  1c:	41 55                	push   %r13
  1e:	41 54                	push   %r12
  20:	41 52                	push   %r10
  22:	53                   	push   %rbx
  23:	49 8d 40 20          	lea    0x20(%r8),%rax
  27:	c5 fa 10 02          	vmovss (%rdx),%xmm0
  2b:	48 39 c1             	cmp    %rax,%rcx
  2e:	73 0d                	jae    3d <_ZN2ml3mlp4funcEPKfPfS2_+0x3d>
  30:	48 8d 41 20          	lea    0x20(%rcx),%rax
  34:	49 39 c0             	cmp    %rax,%r8
  37:	0f 82 2b 02 00 00    	jb     268 <_ZN2ml3mlp4funcEPKfPfS2_+0x268>
  3d:	48 89 c8             	mov    %rcx,%rax
  40:	83 e0 1f             	and    $0x1f,%eax
  43:	48 c1 e8 02          	shr    $0x2,%rax
  47:	48 f7 d8             	neg    %rax
  4a:	83 e0 07             	and    $0x7,%eax
  4d:	0f 84 ed 01 00 00    	je     240 <_ZN2ml3mlp4funcEPKfPfS2_+0x240>
  53:	c5 fa 59 09          	vmulss (%rcx),%xmm0,%xmm1
  57:	c4 c1 7a 11 08       	vmovss %xmm1,(%r8)
  5c:	83 f8 01             	cmp    $0x1,%eax
  5f:	0f 84 2b 02 00 00    	je     290 <_ZN2ml3mlp4funcEPKfPfS2_+0x290>
  65:	c5 fa 59 49 04       	vmulss 0x4(%rcx),%xmm0,%xmm1
  6a:	c4 c1 7a 11 48 04    	vmovss %xmm1,0x4(%r8)
  70:	83 f8 02             	cmp    $0x2,%eax
  73:	0f 84 8f 02 00 00    	je     308 <_ZN2ml3mlp4funcEPKfPfS2_+0x308>
  79:	c5 fa 59 49 08       	vmulss 0x8(%rcx),%xmm0,%xmm1
  7e:	c4 c1 7a 11 48 08    	vmovss %xmm1,0x8(%r8)
  84:	83 f8 03             	cmp    $0x3,%eax
  87:	0f 84 63 02 00 00    	je     2f0 <_ZN2ml3mlp4funcEPKfPfS2_+0x2f0>
  8d:	c5 fa 59 49 0c       	vmulss 0xc(%rcx),%xmm0,%xmm1
  92:	c4 c1 7a 11 48 0c    	vmovss %xmm1,0xc(%r8)
  98:	83 f8 04             	cmp    $0x4,%eax
  9b:	0f 84 37 02 00 00    	je     2d8 <_ZN2ml3mlp4funcEPKfPfS2_+0x2d8>
  a1:	c5 fa 59 49 10       	vmulss 0x10(%rcx),%xmm0,%xmm1
  a6:	c4 c1 7a 11 48 10    	vmovss %xmm1,0x10(%r8)
  ac:	83 f8 05             	cmp    $0x5,%eax
  af:	0f 84 0b 02 00 00    	je     2c0 <_ZN2ml3mlp4funcEPKfPfS2_+0x2c0>
  b5:	c5 fa 59 49 14       	vmulss 0x14(%rcx),%xmm0,%xmm1
  ba:	c4 c1 7a 11 48 14    	vmovss %xmm1,0x14(%r8)
  c0:	83 f8 07             	cmp    $0x7,%eax
  c3:	0f 85 df 01 00 00    	jne    2a8 <_ZN2ml3mlp4funcEPKfPfS2_+0x2a8>
  c9:	c5 fa 59 49 18       	vmulss 0x18(%rcx),%xmm0,%xmm1
  ce:	41 bb 19 00 00 00    	mov    $0x19,%r11d
  d4:	41 ba 07 00 00 00    	mov    $0x7,%r10d
  da:	c4 c1 7a 11 48 18    	vmovss %xmm1,0x18(%r8)
  e0:	bb 20 00 00 00       	mov    $0x20,%ebx
  e5:	41 89 c5             	mov    %eax,%r13d
  e8:	41 bc 18 00 00 00    	mov    $0x18,%r12d
  ee:	29 c3                	sub    %eax,%ebx
  f0:	41 be 03 00 00 00    	mov    $0x3,%r14d
  f6:	4b 8d 04 a9          	lea    (%r9,%r13,4),%rax
  fa:	c4 e2 7d 18 c8       	vbroadcastss %xmm0,%ymm1
  ff:	4c 8d 2c 07          	lea    (%rdi,%rax,1),%r13
 103:	48 01 f0             	add    %rsi,%rax
 106:	c4 c1 74 59 55 00    	vmulps 0x0(%r13),%ymm1,%ymm2
 10c:	c5 fc 11 10          	vmovups %ymm2,(%rax)
 110:	c4 c1 74 59 55 20    	vmulps 0x20(%r13),%ymm1,%ymm2
 116:	c5 fc 11 50 20       	vmovups %ymm2,0x20(%rax)
 11b:	c4 c1 74 59 55 40    	vmulps 0x40(%r13),%ymm1,%ymm2
 121:	c5 fc 11 50 40       	vmovups %ymm2,0x40(%rax)
 126:	41 83 fe 04          	cmp    $0x4,%r14d
 12a:	75 0b                	jne    137 <_ZN2ml3mlp4funcEPKfPfS2_+0x137>
 12c:	c4 c1 74 59 4d 60    	vmulps 0x60(%r13),%ymm1,%ymm1
 132:	c5 fc 11 48 60       	vmovups %ymm1,0x60(%rax)
 137:	43 8d 04 22          	lea    (%r10,%r12,1),%eax
 13b:	45 89 da             	mov    %r11d,%r10d
 13e:	45 29 e2             	sub    %r12d,%r10d
 141:	44 39 e3             	cmp    %r12d,%ebx
 144:	0f 84 c5 00 00 00    	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 14a:	4c 63 d8             	movslq %eax,%r11
 14d:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
 151:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
 157:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
 15d:	44 8d 58 01          	lea    0x1(%rax),%r11d
 161:	41 83 fa 01          	cmp    $0x1,%r10d
 165:	0f 84 a4 00 00 00    	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 16b:	4d 63 db             	movslq %r11d,%r11
 16e:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
 172:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
 178:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
 17e:	44 8d 58 02          	lea    0x2(%rax),%r11d
 182:	41 83 fa 02          	cmp    $0x2,%r10d
 186:	0f 84 83 00 00 00    	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 18c:	4d 63 db             	movslq %r11d,%r11
 18f:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
 193:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
 199:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
 19f:	44 8d 58 03          	lea    0x3(%rax),%r11d
 1a3:	41 83 fa 03          	cmp    $0x3,%r10d
 1a7:	74 66                	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 1a9:	4d 63 db             	movslq %r11d,%r11
 1ac:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
 1b0:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
 1b6:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
 1bc:	44 8d 58 04          	lea    0x4(%rax),%r11d
 1c0:	41 83 fa 04          	cmp    $0x4,%r10d
 1c4:	74 49                	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 1c6:	4d 63 db             	movslq %r11d,%r11
 1c9:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
 1cd:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
 1d3:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
 1d9:	44 8d 58 05          	lea    0x5(%rax),%r11d
 1dd:	41 83 fa 05          	cmp    $0x5,%r10d
 1e1:	74 2c                	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 1e3:	4d 63 db             	movslq %r11d,%r11
 1e6:	83 c0 06             	add    $0x6,%eax
 1e9:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
 1ed:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
 1f3:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
 1f9:	41 83 fa 06          	cmp    $0x6,%r10d
 1fd:	74 10                	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 1ff:	48 98                	cltq   
 201:	49 8d 04 81          	lea    (%r9,%rax,4),%rax
 205:	c5 fa 59 04 07       	vmulss (%rdi,%rax,1),%xmm0,%xmm0
 20a:	c5 fa 11 04 06       	vmovss %xmm0,(%rsi,%rax,1)
 20f:	49 83 e9 80          	sub    $0xffffffffffffff80,%r9
 213:	48 83 c2 04          	add    $0x4,%rdx
 217:	49 83 e8 80          	sub    $0xffffffffffffff80,%r8
 21b:	48 83 e9 80          	sub    $0xffffffffffffff80,%rcx
 21f:	49 81 f9 00 02 00 00 	cmp    $0x200,%r9
 226:	0f 85 f7 fd ff ff    	jne    23 <_ZN2ml3mlp4funcEPKfPfS2_+0x23>
 22c:	c5 f8 77             	vzeroupper 
 22f:	5b                   	pop    %rbx
 230:	41 5a                	pop    %r10
 232:	41 5c                	pop    %r12
 234:	41 5d                	pop    %r13
 236:	41 5e                	pop    %r14
 238:	5d                   	pop    %rbp
 239:	49 8d 62 f8          	lea    -0x8(%r10),%rsp
 23d:	c3                   	retq   
 23e:	66 90                	xchg   %ax,%ax
 240:	41 bc 20 00 00 00    	mov    $0x20,%r12d
 246:	41 be 04 00 00 00    	mov    $0x4,%r14d
 24c:	bb 20 00 00 00       	mov    $0x20,%ebx
 251:	45 31 ed             	xor    %r13d,%r13d
 254:	41 bb 20 00 00 00    	mov    $0x20,%r11d
 25a:	45 31 d2             	xor    %r10d,%r10d
 25d:	e9 94 fe ff ff       	jmpq   f6 <_ZN2ml3mlp4funcEPKfPfS2_+0xf6>
 262:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
 268:	31 c0                	xor    %eax,%eax
 26a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
 270:	c5 fa 59 0c 01       	vmulss (%rcx,%rax,1),%xmm0,%xmm1
 275:	c4 c1 7a 11 0c 00    	vmovss %xmm1,(%r8,%rax,1)
 27b:	48 83 c0 04          	add    $0x4,%rax
 27f:	48 3d 80 00 00 00    	cmp    $0x80,%rax
 285:	75 e9                	jne    270 <_ZN2ml3mlp4funcEPKfPfS2_+0x270>
 287:	eb 86                	jmp    20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 289:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 290:	41 bb 1f 00 00 00    	mov    $0x1f,%r11d
 296:	41 ba 01 00 00 00    	mov    $0x1,%r10d
 29c:	e9 3f fe ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
 2a1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 2a8:	41 bb 1a 00 00 00    	mov    $0x1a,%r11d
 2ae:	41 ba 06 00 00 00    	mov    $0x6,%r10d
 2b4:	e9 27 fe ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
 2b9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 2c0:	41 bb 1b 00 00 00    	mov    $0x1b,%r11d
 2c6:	41 ba 05 00 00 00    	mov    $0x5,%r10d
 2cc:	e9 0f fe ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
 2d1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 2d8:	41 bb 1c 00 00 00    	mov    $0x1c,%r11d
 2de:	41 ba 04 00 00 00    	mov    $0x4,%r10d
 2e4:	e9 f7 fd ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
 2e9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 2f0:	41 bb 1d 00 00 00    	mov    $0x1d,%r11d
 2f6:	41 ba 03 00 00 00    	mov    $0x3,%r10d
 2fc:	e9 df fd ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
 301:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 308:	41 bb 1e 00 00 00    	mov    $0x1e,%r11d
 30e:	41 ba 02 00 00 00    	mov    $0x2,%r10d
 314:	e9 c7 fd ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
There is some vectorization happening there, but most of the code is scalar and looks like some kind of duffs device. I played around with this and found out that the following "hint" procduces the output that I want:
 
void func(const float *src, float *dst, const float *factors) {
   const float * __restrict__ alignedSrc = (const float *)__builtin_assume_aligned(src, 32);
   float * __restrict__ alignedDst = (float *)__builtin_assume_aligned(dst, 32);
   const float * __restrict__ unaliasedFactors = factors;

   enum {
        NUM_OUTER = 4,
        NUM_INNER = 32
    };

    for (unsigned k = 0; k < NUM_OUTER; k++) {
        const float factor = unaliasedFactors[k];

        const float * __restrict__ srcChunk = alignedSrc + k * NUM_INNER;
        float * __restrict__ dstChunk = alignedDst + k * NUM_INNER;
 
// <HINT>
        if (NUM_INNER % 8 == 0) { // the gcc tree vectorizer won't recognize this on its own?!?
            srcChunk = (const float *)__builtin_assume_aligned(srcChunk, 32);
            dstChunk = (float *)__builtin_assume_aligned(dstChunk, 32);
        }
// </HINT>

        for (int j = 0; j < NUM_INNER; j++)
            dstChunk[j] = srcChunk[j] * factor;
    }
}

0000000000000000 <_ZN2ml3mlp4funcEPKfPfS2_>:
   0:	48 8d 8f 00 02 00 00 	lea    0x200(%rdi),%rcx
   7:	48 8d 46 20          	lea    0x20(%rsi),%rax
   b:	c5 fa 10 02          	vmovss (%rdx),%xmm0
   f:	48 39 f8             	cmp    %rdi,%rax
  12:	76 09                	jbe    1d <_ZN2ml3mlp4funcEPKfPfS2_+0x1d>
  14:	48 8d 47 20          	lea    0x20(%rdi),%rax
  18:	48 39 f0             	cmp    %rsi,%rax
  1b:	77 43                	ja     60 <_ZN2ml3mlp4funcEPKfPfS2_+0x60>
  1d:	c4 e2 7d 18 c0       	vbroadcastss %xmm0,%ymm0
  22:	c5 fc 59 0f          	vmulps (%rdi),%ymm0,%ymm1
  26:	c5 fc 29 0e          	vmovaps %ymm1,(%rsi)
  2a:	c5 fc 59 4f 20       	vmulps 0x20(%rdi),%ymm0,%ymm1
  2f:	c5 fc 29 4e 20       	vmovaps %ymm1,0x20(%rsi)
  34:	c5 fc 59 4f 40       	vmulps 0x40(%rdi),%ymm0,%ymm1
  39:	c5 fc 29 4e 40       	vmovaps %ymm1,0x40(%rsi)
  3e:	c5 fc 59 47 60       	vmulps 0x60(%rdi),%ymm0,%ymm0
  43:	c5 fc 29 46 60       	vmovaps %ymm0,0x60(%rsi)
  48:	48 83 ef 80          	sub    $0xffffffffffffff80,%rdi
  4c:	48 83 c2 04          	add    $0x4,%rdx
  50:	48 83 ee 80          	sub    $0xffffffffffffff80,%rsi
  54:	48 39 cf             	cmp    %rcx,%rdi
  57:	75 ae                	jne    7 <_ZN2ml3mlp4funcEPKfPfS2_+0x7>
  59:	c5 f8 77             	vzeroupper 
  5c:	c3                   	retq   
  5d:	0f 1f 00             	nopl   (%rax)
  60:	31 c0                	xor    %eax,%eax
  62:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  68:	c5 fa 59 0c 07       	vmulss (%rdi,%rax,1),%xmm0,%xmm1
  6d:	c5 fa 11 0c 06       	vmovss %xmm1,(%rsi,%rax,1)
  72:	48 83 c0 04          	add    $0x4,%rax
  76:	48 3d 80 00 00 00    	cmp    $0x80,%rax
  7c:	75 ea                	jne    68 <_ZN2ml3mlp4funcEPKfPfS2_+0x68>
  7e:	eb c8                	jmp    48 <_ZN2ml3mlp4funcEPKfPfS2_+0x48>
This is more in line with what I wanted and it is actually twice as fast. In my real code, the speed difference is even bigger. Both versions produce correct output.
Note that for NUM_INNER % 8 == 0, alignedSrc + k * NUM_INNER is always 32 byte aligned iff alignedSrc is 32 byte aligned. This is s.th. the compiler should be able to figure out on its own. Or am I missing s.th. here?

Do you have any experience with this, or any advice on how to fix it without resorting to lots of hand crafted "hints" throughout the code? Do I really have to provide such alignment hints for every strided access that's happening?
Thanks in advance for any help or advice with this.

Share this post


Link to post
Share on other sites
Advertisement

If you really want to use those vector instructions and be guaranteed that they are being used, you need to use intrinsics.

 

The auto-vectorization provided by compilers, as Ravyne mentioned, should be viewed as a bonus if it occurs.  If you must get vector instructions it's far simpler to just write the intrinsics yourself than to go through strange hoops to write C/C++ code that the auto-vectorizer will accept, which is a totally backwards (and unreliable) way of getting these instructions to be emitted.  Additionally, I've found that it's just annoying to try to get the compiler to realize things that I know is true for my code (such as alignment or aliasing).  Intrinsics are the way to go here.

 

I tried compiling your code on g++ 4.8.4 with the options you give (except c++14, since my compiler is not new enough) and I get identical assembly for both functions and the assembly is in the aligned form with no prologue/epilogue to deal with unaligned values.

Share this post


Link to post
Share on other sites


If you're relying on this code to be performant, and you have the resources and skills, then you really want to look at vector intrinsics (or, linking a vector routine written with... I forget the name, but its a vector-aware C-like compiler, or perhaps an assembler, depending on which of those dependencies is more comfortable).

 

You might be thinking of ISPC

Share this post


Link to post
Share on other sites
Thank you all for the feedback. ISPC looks interesting, but sadly the code is part of an elaborate template mechanism right now, so ISPC isn't really an option there. But it looks like a tool worth keeping in your toolbox.

I was hoping that auto vectorization had progressed further after seeing some pretty impressive vectorizations for ARM-NEON. But given how fragile it is, also in your experience, I guess I'll go back to intrinsics.

Share this post


Link to post
Share on other sites

Intrinsics are probably your best bet WRT template mechanisms, but I see now the appeal of vectorization in your case -- a hope that auto-vectorization would produce optimal vector routines for different vector ISAs from common code, rather than providing different intrinsic code for different ISAs (I'm actually not sure whether ARM-NEON intrinsics are different from, say, SSE 4.x intrinsics -- but even if commoon functions share an identical name you still have ISA capabilities and ISA-specific optimization patterns, I would assume).

 

There are ways to mitigate the brittle-ness of auto-vectorization -- one would be to create tooling that can determine if auto-vectorization has regressed systematically in new builds, for example, by building that portion of the code against a kind of unit-test where performance profiling would be part of test pass-fail; and where you instruct the compiler to output assembly listings and compare them to prior known-good vectorizations. You can automate that at least to a level of detecting anomalies and notifying a qualified human to take a closer look.

 

Intrinsics will still avoid the fragility of auto-vectorization outright, and will give you full control of code patterns per-ISA, but if you have a great deal of vector ISAs to support, and particularly if you aren't expert in all of them, then perhaps auto-vectorization with tooling-supported defenses against vector code regressions could be at least part of the solution.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement