# GCC auto vectorizer alignment issues

This topic is 1136 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi everyone,

I'm having a hard time getting the GCC auto vectorizer to auto vectorize. I believe that the problem has to to with its ability to figure out the stride/alignment of pointers. Consider the following minimal (not) working example:
void func(const float *src, float *dst, const float *factors) {
const float * __restrict__ alignedSrc = (const float *)__builtin_assume_aligned(src, 32);
float * __restrict__ alignedDst = (float *)__builtin_assume_aligned(dst, 32);
const float * __restrict__ unaliasedFactors = factors;

enum {
NUM_OUTER = 4,
NUM_INNER = 32
};

for (unsigned k = 0; k < NUM_OUTER; k++) {
const float factor = unaliasedFactors[k];

const float * __restrict__ srcChunk = alignedSrc + k * NUM_INNER;
float * __restrict__ dstChunk = alignedDst + k * NUM_INNER;

for (int j = 0; j < NUM_INNER; j++)
dstChunk[j] = srcChunk[j] * factor;
}
}

It is two nested loops, sequentially looping over an array of size 32*4. It gets four factors and multiplies the first 32 elements by the first factor, the next 32 elements by the second and so on. Results are stored sequentially in an output array. Now, I use "__builtin_assume_aligned" and "__restrict__" to tell the compiler that the arrays are 32 byte aligned and not aliased. This should be prime meat for a vectorizer. Sadly, the output looks like this:

(compiled with -march=native -ffast-math -std=c++14 -O3 on gcc 4.9.2)
0000000000000000 <_ZN2ml3mlp4funcEPKfPfS2_>:
0:	4c 8d 54 24 08       	lea    0x8(%rsp),%r10
5:	48 83 e4 e0          	and    $0xffffffffffffffe0,%rsp 9: 49 89 f0 mov %rsi,%r8 c: 41 ff 72 f8 pushq -0x8(%r10) 10: 55 push %rbp 11: 48 89 f9 mov %rdi,%rcx 14: 45 31 c9 xor %r9d,%r9d 17: 48 89 e5 mov %rsp,%rbp 1a: 41 56 push %r14 1c: 41 55 push %r13 1e: 41 54 push %r12 20: 41 52 push %r10 22: 53 push %rbx 23: 49 8d 40 20 lea 0x20(%r8),%rax 27: c5 fa 10 02 vmovss (%rdx),%xmm0 2b: 48 39 c1 cmp %rax,%rcx 2e: 73 0d jae 3d <_ZN2ml3mlp4funcEPKfPfS2_+0x3d> 30: 48 8d 41 20 lea 0x20(%rcx),%rax 34: 49 39 c0 cmp %rax,%r8 37: 0f 82 2b 02 00 00 jb 268 <_ZN2ml3mlp4funcEPKfPfS2_+0x268> 3d: 48 89 c8 mov %rcx,%rax 40: 83 e0 1f and$0x1f,%eax
43:	48 c1 e8 02          	shr    $0x2,%rax 47: 48 f7 d8 neg %rax 4a: 83 e0 07 and$0x7,%eax
4d:	0f 84 ed 01 00 00    	je     240 <_ZN2ml3mlp4funcEPKfPfS2_+0x240>
53:	c5 fa 59 09          	vmulss (%rcx),%xmm0,%xmm1
57:	c4 c1 7a 11 08       	vmovss %xmm1,(%r8)
5c:	83 f8 01             	cmp    $0x1,%eax 5f: 0f 84 2b 02 00 00 je 290 <_ZN2ml3mlp4funcEPKfPfS2_+0x290> 65: c5 fa 59 49 04 vmulss 0x4(%rcx),%xmm0,%xmm1 6a: c4 c1 7a 11 48 04 vmovss %xmm1,0x4(%r8) 70: 83 f8 02 cmp$0x2,%eax
73:	0f 84 8f 02 00 00    	je     308 <_ZN2ml3mlp4funcEPKfPfS2_+0x308>
79:	c5 fa 59 49 08       	vmulss 0x8(%rcx),%xmm0,%xmm1
7e:	c4 c1 7a 11 48 08    	vmovss %xmm1,0x8(%r8)
84:	83 f8 03             	cmp    $0x3,%eax 87: 0f 84 63 02 00 00 je 2f0 <_ZN2ml3mlp4funcEPKfPfS2_+0x2f0> 8d: c5 fa 59 49 0c vmulss 0xc(%rcx),%xmm0,%xmm1 92: c4 c1 7a 11 48 0c vmovss %xmm1,0xc(%r8) 98: 83 f8 04 cmp$0x4,%eax
9b:	0f 84 37 02 00 00    	je     2d8 <_ZN2ml3mlp4funcEPKfPfS2_+0x2d8>
a1:	c5 fa 59 49 10       	vmulss 0x10(%rcx),%xmm0,%xmm1
a6:	c4 c1 7a 11 48 10    	vmovss %xmm1,0x10(%r8)
ac:	83 f8 05             	cmp    $0x5,%eax af: 0f 84 0b 02 00 00 je 2c0 <_ZN2ml3mlp4funcEPKfPfS2_+0x2c0> b5: c5 fa 59 49 14 vmulss 0x14(%rcx),%xmm0,%xmm1 ba: c4 c1 7a 11 48 14 vmovss %xmm1,0x14(%r8) c0: 83 f8 07 cmp$0x7,%eax
c3:	0f 85 df 01 00 00    	jne    2a8 <_ZN2ml3mlp4funcEPKfPfS2_+0x2a8>
c9:	c5 fa 59 49 18       	vmulss 0x18(%rcx),%xmm0,%xmm1
ce:	41 bb 19 00 00 00    	mov    $0x19,%r11d d4: 41 ba 07 00 00 00 mov$0x7,%r10d
da:	c4 c1 7a 11 48 18    	vmovss %xmm1,0x18(%r8)
e0:	bb 20 00 00 00       	mov    $0x20,%ebx e5: 41 89 c5 mov %eax,%r13d e8: 41 bc 18 00 00 00 mov$0x18,%r12d
ee:	29 c3                	sub    %eax,%ebx
f0:	41 be 03 00 00 00    	mov    $0x3,%r14d f6: 4b 8d 04 a9 lea (%r9,%r13,4),%rax fa: c4 e2 7d 18 c8 vbroadcastss %xmm0,%ymm1 ff: 4c 8d 2c 07 lea (%rdi,%rax,1),%r13 103: 48 01 f0 add %rsi,%rax 106: c4 c1 74 59 55 00 vmulps 0x0(%r13),%ymm1,%ymm2 10c: c5 fc 11 10 vmovups %ymm2,(%rax) 110: c4 c1 74 59 55 20 vmulps 0x20(%r13),%ymm1,%ymm2 116: c5 fc 11 50 20 vmovups %ymm2,0x20(%rax) 11b: c4 c1 74 59 55 40 vmulps 0x40(%r13),%ymm1,%ymm2 121: c5 fc 11 50 40 vmovups %ymm2,0x40(%rax) 126: 41 83 fe 04 cmp$0x4,%r14d
12a:	75 0b                	jne    137 <_ZN2ml3mlp4funcEPKfPfS2_+0x137>
12c:	c4 c1 74 59 4d 60    	vmulps 0x60(%r13),%ymm1,%ymm1
132:	c5 fc 11 48 60       	vmovups %ymm1,0x60(%rax)
137:	43 8d 04 22          	lea    (%r10,%r12,1),%eax
13b:	45 89 da             	mov    %r11d,%r10d
13e:	45 29 e2             	sub    %r12d,%r10d
141:	44 39 e3             	cmp    %r12d,%ebx
144:	0f 84 c5 00 00 00    	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
14a:	4c 63 d8             	movslq %eax,%r11
14d:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
151:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
157:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
15d:	44 8d 58 01          	lea    0x1(%rax),%r11d
161:	41 83 fa 01          	cmp    $0x1,%r10d 165: 0f 84 a4 00 00 00 je 20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f> 16b: 4d 63 db movslq %r11d,%r11 16e: 4f 8d 1c 99 lea (%r9,%r11,4),%r11 172: c4 a1 7a 59 0c 1f vmulss (%rdi,%r11,1),%xmm0,%xmm1 178: c4 a1 7a 11 0c 1e vmovss %xmm1,(%rsi,%r11,1) 17e: 44 8d 58 02 lea 0x2(%rax),%r11d 182: 41 83 fa 02 cmp$0x2,%r10d
186:	0f 84 83 00 00 00    	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
18c:	4d 63 db             	movslq %r11d,%r11
18f:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
193:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
199:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
19f:	44 8d 58 03          	lea    0x3(%rax),%r11d
1a3:	41 83 fa 03          	cmp    $0x3,%r10d 1a7: 74 66 je 20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f> 1a9: 4d 63 db movslq %r11d,%r11 1ac: 4f 8d 1c 99 lea (%r9,%r11,4),%r11 1b0: c4 a1 7a 59 0c 1f vmulss (%rdi,%r11,1),%xmm0,%xmm1 1b6: c4 a1 7a 11 0c 1e vmovss %xmm1,(%rsi,%r11,1) 1bc: 44 8d 58 04 lea 0x4(%rax),%r11d 1c0: 41 83 fa 04 cmp$0x4,%r10d
1c4:	74 49                	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
1c6:	4d 63 db             	movslq %r11d,%r11
1c9:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
1cd:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
1d3:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
1d9:	44 8d 58 05          	lea    0x5(%rax),%r11d
1dd:	41 83 fa 05          	cmp    $0x5,%r10d 1e1: 74 2c je 20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f> 1e3: 4d 63 db movslq %r11d,%r11 1e6: 83 c0 06 add$0x6,%eax
1e9:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
1ed:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
1f3:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
1f9:	41 83 fa 06          	cmp    $0x6,%r10d 1fd: 74 10 je 20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f> 1ff: 48 98 cltq 201: 49 8d 04 81 lea (%r9,%rax,4),%rax 205: c5 fa 59 04 07 vmulss (%rdi,%rax,1),%xmm0,%xmm0 20a: c5 fa 11 04 06 vmovss %xmm0,(%rsi,%rax,1) 20f: 49 83 e9 80 sub$0xffffffffffffff80,%r9
213:	48 83 c2 04          	add    $0x4,%rdx 217: 49 83 e8 80 sub$0xffffffffffffff80,%r8
21b:	48 83 e9 80          	sub    $0xffffffffffffff80,%rcx 21f: 49 81 f9 00 02 00 00 cmp$0x200,%r9
226:	0f 85 f7 fd ff ff    	jne    23 <_ZN2ml3mlp4funcEPKfPfS2_+0x23>
22c:	c5 f8 77             	vzeroupper
22f:	5b                   	pop    %rbx
230:	41 5a                	pop    %r10
232:	41 5c                	pop    %r12
234:	41 5d                	pop    %r13
236:	41 5e                	pop    %r14
238:	5d                   	pop    %rbp
239:	49 8d 62 f8          	lea    -0x8(%r10),%rsp
23d:	c3                   	retq
23e:	66 90                	xchg   %ax,%ax
240:	41 bc 20 00 00 00    	mov    $0x20,%r12d 246: 41 be 04 00 00 00 mov$0x4,%r14d
24c:	bb 20 00 00 00       	mov    $0x20,%ebx 251: 45 31 ed xor %r13d,%r13d 254: 41 bb 20 00 00 00 mov$0x20,%r11d
25a:	45 31 d2             	xor    %r10d,%r10d
25d:	e9 94 fe ff ff       	jmpq   f6 <_ZN2ml3mlp4funcEPKfPfS2_+0xf6>
262:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
268:	31 c0                	xor    %eax,%eax
26a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
270:	c5 fa 59 0c 01       	vmulss (%rcx,%rax,1),%xmm0,%xmm1
275:	c4 c1 7a 11 0c 00    	vmovss %xmm1,(%r8,%rax,1)
27b:	48 83 c0 04          	add    $0x4,%rax 27f: 48 3d 80 00 00 00 cmp$0x80,%rax
285:	75 e9                	jne    270 <_ZN2ml3mlp4funcEPKfPfS2_+0x270>
287:	eb 86                	jmp    20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
289:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
290:	41 bb 1f 00 00 00    	mov    $0x1f,%r11d 296: 41 ba 01 00 00 00 mov$0x1,%r10d
29c:	e9 3f fe ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
2a1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
2a8:	41 bb 1a 00 00 00    	mov    $0x1a,%r11d 2ae: 41 ba 06 00 00 00 mov$0x6,%r10d
2b4:	e9 27 fe ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
2b9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
2c0:	41 bb 1b 00 00 00    	mov    $0x1b,%r11d 2c6: 41 ba 05 00 00 00 mov$0x5,%r10d
2cc:	e9 0f fe ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
2d1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
2d8:	41 bb 1c 00 00 00    	mov    $0x1c,%r11d 2de: 41 ba 04 00 00 00 mov$0x4,%r10d
2e4:	e9 f7 fd ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
2e9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
2f0:	41 bb 1d 00 00 00    	mov    $0x1d,%r11d 2f6: 41 ba 03 00 00 00 mov$0x3,%r10d
2fc:	e9 df fd ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
301:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
308:	41 bb 1e 00 00 00    	mov    $0x1e,%r11d 30e: 41 ba 02 00 00 00 mov$0x2,%r10d
314:	e9 c7 fd ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>

There is some vectorization happening there, but most of the code is scalar and looks like some kind of duffs device. I played around with this and found out that the following "hint" procduces the output that I want:

void func(const float *src, float *dst, const float *factors) {
const float * __restrict__ alignedSrc = (const float *)__builtin_assume_aligned(src, 32);
float * __restrict__ alignedDst = (float *)__builtin_assume_aligned(dst, 32);
const float * __restrict__ unaliasedFactors = factors;

enum {
NUM_OUTER = 4,
NUM_INNER = 32
};

for (unsigned k = 0; k < NUM_OUTER; k++) {
const float factor = unaliasedFactors[k];

const float * __restrict__ srcChunk = alignedSrc + k * NUM_INNER;
float * __restrict__ dstChunk = alignedDst + k * NUM_INNER;

// <HINT>
if (NUM_INNER % 8 == 0) { // the gcc tree vectorizer won't recognize this on its own?!?
srcChunk = (const float *)__builtin_assume_aligned(srcChunk, 32);
dstChunk = (float *)__builtin_assume_aligned(dstChunk, 32);
}
// </HINT>

for (int j = 0; j < NUM_INNER; j++)
dstChunk[j] = srcChunk[j] * factor;
}
}


0000000000000000 <_ZN2ml3mlp4funcEPKfPfS2_>:
0:	48 8d 8f 00 02 00 00 	lea    0x200(%rdi),%rcx
7:	48 8d 46 20          	lea    0x20(%rsi),%rax
b:	c5 fa 10 02          	vmovss (%rdx),%xmm0
f:	48 39 f8             	cmp    %rdi,%rax
12:	76 09                	jbe    1d <_ZN2ml3mlp4funcEPKfPfS2_+0x1d>
14:	48 8d 47 20          	lea    0x20(%rdi),%rax
18:	48 39 f0             	cmp    %rsi,%rax
1b:	77 43                	ja     60 <_ZN2ml3mlp4funcEPKfPfS2_+0x60>
1d:	c4 e2 7d 18 c0       	vbroadcastss %xmm0,%ymm0
22:	c5 fc 59 0f          	vmulps (%rdi),%ymm0,%ymm1
26:	c5 fc 29 0e          	vmovaps %ymm1,(%rsi)
2a:	c5 fc 59 4f 20       	vmulps 0x20(%rdi),%ymm0,%ymm1
2f:	c5 fc 29 4e 20       	vmovaps %ymm1,0x20(%rsi)
34:	c5 fc 59 4f 40       	vmulps 0x40(%rdi),%ymm0,%ymm1
39:	c5 fc 29 4e 40       	vmovaps %ymm1,0x40(%rsi)
3e:	c5 fc 59 47 60       	vmulps 0x60(%rdi),%ymm0,%ymm0
43:	c5 fc 29 46 60       	vmovaps %ymm0,0x60(%rsi)
48:	48 83 ef 80          	sub    $0xffffffffffffff80,%rdi 4c: 48 83 c2 04 add$0x4,%rdx
50:	48 83 ee 80          	sub    $0xffffffffffffff80,%rsi 54: 48 39 cf cmp %rcx,%rdi 57: 75 ae jne 7 <_ZN2ml3mlp4funcEPKfPfS2_+0x7> 59: c5 f8 77 vzeroupper 5c: c3 retq 5d: 0f 1f 00 nopl (%rax) 60: 31 c0 xor %eax,%eax 62: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 68: c5 fa 59 0c 07 vmulss (%rdi,%rax,1),%xmm0,%xmm1 6d: c5 fa 11 0c 06 vmovss %xmm1,(%rsi,%rax,1) 72: 48 83 c0 04 add$0x4,%rax
76:	48 3d 80 00 00 00    	cmp    \$0x80,%rax
7c:	75 ea                	jne    68 <_ZN2ml3mlp4funcEPKfPfS2_+0x68>
7e:	eb c8                	jmp    48 <_ZN2ml3mlp4funcEPKfPfS2_+0x48>

This is more in line with what I wanted and it is actually twice as fast. In my real code, the speed difference is even bigger. Both versions produce correct output.
Note that for NUM_INNER % 8 == 0, alignedSrc + k * NUM_INNER is always 32 byte aligned iff alignedSrc is 32 byte aligned. This is s.th. the compiler should be able to figure out on its own. Or am I missing s.th. here?

Do you have any experience with this, or any advice on how to fix it without resorting to lots of hand crafted "hints" throughout the code? Do I really have to provide such alignment hints for every strided access that's happening?

##### Share on other sites

If you really want to use those vector instructions and be guaranteed that they are being used, you need to use intrinsics.

The auto-vectorization provided by compilers, as Ravyne mentioned, should be viewed as a bonus if it occurs.  If you must get vector instructions it's far simpler to just write the intrinsics yourself than to go through strange hoops to write C/C++ code that the auto-vectorizer will accept, which is a totally backwards (and unreliable) way of getting these instructions to be emitted.  Additionally, I've found that it's just annoying to try to get the compiler to realize things that I know is true for my code (such as alignment or aliasing).  Intrinsics are the way to go here.

I tried compiling your code on g++ 4.8.4 with the options you give (except c++14, since my compiler is not new enough) and I get identical assembly for both functions and the assembly is in the aligned form with no prologue/epilogue to deal with unaligned values.

##### Share on other sites

If you're relying on this code to be performant, and you have the resources and skills, then you really want to look at vector intrinsics (or, linking a vector routine written with... I forget the name, but its a vector-aware C-like compiler, or perhaps an assembler, depending on which of those dependencies is more comfortable).

You might be thinking of ISPC

##### Share on other sites

You might be thinking of ISPC

Yep, that's the one.

##### Share on other sites
Thank you all for the feedback. ISPC looks interesting, but sadly the code is part of an elaborate template mechanism right now, so ISPC isn't really an option there. But it looks like a tool worth keeping in your toolbox.

I was hoping that auto vectorization had progressed further after seeing some pretty impressive vectorizations for ARM-NEON. But given how fragile it is, also in your experience, I guess I'll go back to intrinsics.

##### Share on other sites

Intrinsics are probably your best bet WRT template mechanisms, but I see now the appeal of vectorization in your case -- a hope that auto-vectorization would produce optimal vector routines for different vector ISAs from common code, rather than providing different intrinsic code for different ISAs (I'm actually not sure whether ARM-NEON intrinsics are different from, say, SSE 4.x intrinsics -- but even if commoon functions share an identical name you still have ISA capabilities and ISA-specific optimization patterns, I would assume).

There are ways to mitigate the brittle-ness of auto-vectorization -- one would be to create tooling that can determine if auto-vectorization has regressed systematically in new builds, for example, by building that portion of the code against a kind of unit-test where performance profiling would be part of test pass-fail; and where you instruct the compiler to output assembly listings and compare them to prior known-good vectorizations. You can automate that at least to a level of detecting anomalies and notifying a qualified human to take a closer look.

Intrinsics will still avoid the fragility of auto-vectorization outright, and will give you full control of code patterns per-ISA, but if you have a great deal of vector ISAs to support, and particularly if you aren't expert in all of them, then perhaps auto-vectorization with tooling-supported defenses against vector code regressions could be at least part of the solution.

1. 1
2. 2
Rutin
22
3. 3
4. 4
frob
17
5. 5

• 9
• 33
• 13
• 13
• 10
• ### Forum Statistics

• Total Topics
632580
• Total Posts
3007189

×

## Important Information

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!