Jump to content

  • Log In with Google      Sign In   
  • Create Account

Ohforf sake

Member Since 04 Mar 2008
Offline Last Active May 30 2016 02:39 PM

#5244914 GCC auto vectorizer alignment issues

Posted by on 07 August 2015 - 03:25 AM

Hi everyone,

I'm having a hard time getting the GCC auto vectorizer to auto vectorize. I believe that the problem has to to with its ability to figure out the stride/alignment of pointers. Consider the following minimal (not) working example:
void func(const float *src, float *dst, const float *factors) {
    const float * __restrict__ alignedSrc = (const float *)__builtin_assume_aligned(src, 32);
    float * __restrict__ alignedDst = (float *)__builtin_assume_aligned(dst, 32);
    const float * __restrict__ unaliasedFactors = factors;

    enum {
        NUM_OUTER = 4,
        NUM_INNER = 32

    for (unsigned k = 0; k < NUM_OUTER; k++) {
        const float factor = unaliasedFactors[k];

        const float * __restrict__ srcChunk = alignedSrc + k * NUM_INNER;
        float * __restrict__ dstChunk = alignedDst + k * NUM_INNER;

        for (int j = 0; j < NUM_INNER; j++)
            dstChunk[j] = srcChunk[j] * factor;
It is two nested loops, sequentially looping over an array of size 32*4. It gets four factors and multiplies the first 32 elements by the first factor, the next 32 elements by the second and so on. Results are stored sequentially in an output array. Now, I use "__builtin_assume_aligned" and "__restrict__" to tell the compiler that the arrays are 32 byte aligned and not aliased. This should be prime meat for a vectorizer. Sadly, the output looks like this:

(compiled with -march=native -ffast-math -std=c++14 -O3 on gcc 4.9.2)
0000000000000000 <_ZN2ml3mlp4funcEPKfPfS2_>:
   0:	4c 8d 54 24 08       	lea    0x8(%rsp),%r10
   5:	48 83 e4 e0          	and    $0xffffffffffffffe0,%rsp
   9:	49 89 f0             	mov    %rsi,%r8
   c:	41 ff 72 f8          	pushq  -0x8(%r10)
  10:	55                   	push   %rbp
  11:	48 89 f9             	mov    %rdi,%rcx
  14:	45 31 c9             	xor    %r9d,%r9d
  17:	48 89 e5             	mov    %rsp,%rbp
  1a:	41 56                	push   %r14
  1c:	41 55                	push   %r13
  1e:	41 54                	push   %r12
  20:	41 52                	push   %r10
  22:	53                   	push   %rbx
  23:	49 8d 40 20          	lea    0x20(%r8),%rax
  27:	c5 fa 10 02          	vmovss (%rdx),%xmm0
  2b:	48 39 c1             	cmp    %rax,%rcx
  2e:	73 0d                	jae    3d <_ZN2ml3mlp4funcEPKfPfS2_+0x3d>
  30:	48 8d 41 20          	lea    0x20(%rcx),%rax
  34:	49 39 c0             	cmp    %rax,%r8
  37:	0f 82 2b 02 00 00    	jb     268 <_ZN2ml3mlp4funcEPKfPfS2_+0x268>
  3d:	48 89 c8             	mov    %rcx,%rax
  40:	83 e0 1f             	and    $0x1f,%eax
  43:	48 c1 e8 02          	shr    $0x2,%rax
  47:	48 f7 d8             	neg    %rax
  4a:	83 e0 07             	and    $0x7,%eax
  4d:	0f 84 ed 01 00 00    	je     240 <_ZN2ml3mlp4funcEPKfPfS2_+0x240>
  53:	c5 fa 59 09          	vmulss (%rcx),%xmm0,%xmm1
  57:	c4 c1 7a 11 08       	vmovss %xmm1,(%r8)
  5c:	83 f8 01             	cmp    $0x1,%eax
  5f:	0f 84 2b 02 00 00    	je     290 <_ZN2ml3mlp4funcEPKfPfS2_+0x290>
  65:	c5 fa 59 49 04       	vmulss 0x4(%rcx),%xmm0,%xmm1
  6a:	c4 c1 7a 11 48 04    	vmovss %xmm1,0x4(%r8)
  70:	83 f8 02             	cmp    $0x2,%eax
  73:	0f 84 8f 02 00 00    	je     308 <_ZN2ml3mlp4funcEPKfPfS2_+0x308>
  79:	c5 fa 59 49 08       	vmulss 0x8(%rcx),%xmm0,%xmm1
  7e:	c4 c1 7a 11 48 08    	vmovss %xmm1,0x8(%r8)
  84:	83 f8 03             	cmp    $0x3,%eax
  87:	0f 84 63 02 00 00    	je     2f0 <_ZN2ml3mlp4funcEPKfPfS2_+0x2f0>
  8d:	c5 fa 59 49 0c       	vmulss 0xc(%rcx),%xmm0,%xmm1
  92:	c4 c1 7a 11 48 0c    	vmovss %xmm1,0xc(%r8)
  98:	83 f8 04             	cmp    $0x4,%eax
  9b:	0f 84 37 02 00 00    	je     2d8 <_ZN2ml3mlp4funcEPKfPfS2_+0x2d8>
  a1:	c5 fa 59 49 10       	vmulss 0x10(%rcx),%xmm0,%xmm1
  a6:	c4 c1 7a 11 48 10    	vmovss %xmm1,0x10(%r8)
  ac:	83 f8 05             	cmp    $0x5,%eax
  af:	0f 84 0b 02 00 00    	je     2c0 <_ZN2ml3mlp4funcEPKfPfS2_+0x2c0>
  b5:	c5 fa 59 49 14       	vmulss 0x14(%rcx),%xmm0,%xmm1
  ba:	c4 c1 7a 11 48 14    	vmovss %xmm1,0x14(%r8)
  c0:	83 f8 07             	cmp    $0x7,%eax
  c3:	0f 85 df 01 00 00    	jne    2a8 <_ZN2ml3mlp4funcEPKfPfS2_+0x2a8>
  c9:	c5 fa 59 49 18       	vmulss 0x18(%rcx),%xmm0,%xmm1
  ce:	41 bb 19 00 00 00    	mov    $0x19,%r11d
  d4:	41 ba 07 00 00 00    	mov    $0x7,%r10d
  da:	c4 c1 7a 11 48 18    	vmovss %xmm1,0x18(%r8)
  e0:	bb 20 00 00 00       	mov    $0x20,%ebx
  e5:	41 89 c5             	mov    %eax,%r13d
  e8:	41 bc 18 00 00 00    	mov    $0x18,%r12d
  ee:	29 c3                	sub    %eax,%ebx
  f0:	41 be 03 00 00 00    	mov    $0x3,%r14d
  f6:	4b 8d 04 a9          	lea    (%r9,%r13,4),%rax
  fa:	c4 e2 7d 18 c8       	vbroadcastss %xmm0,%ymm1
  ff:	4c 8d 2c 07          	lea    (%rdi,%rax,1),%r13
 103:	48 01 f0             	add    %rsi,%rax
 106:	c4 c1 74 59 55 00    	vmulps 0x0(%r13),%ymm1,%ymm2
 10c:	c5 fc 11 10          	vmovups %ymm2,(%rax)
 110:	c4 c1 74 59 55 20    	vmulps 0x20(%r13),%ymm1,%ymm2
 116:	c5 fc 11 50 20       	vmovups %ymm2,0x20(%rax)
 11b:	c4 c1 74 59 55 40    	vmulps 0x40(%r13),%ymm1,%ymm2
 121:	c5 fc 11 50 40       	vmovups %ymm2,0x40(%rax)
 126:	41 83 fe 04          	cmp    $0x4,%r14d
 12a:	75 0b                	jne    137 <_ZN2ml3mlp4funcEPKfPfS2_+0x137>
 12c:	c4 c1 74 59 4d 60    	vmulps 0x60(%r13),%ymm1,%ymm1
 132:	c5 fc 11 48 60       	vmovups %ymm1,0x60(%rax)
 137:	43 8d 04 22          	lea    (%r10,%r12,1),%eax
 13b:	45 89 da             	mov    %r11d,%r10d
 13e:	45 29 e2             	sub    %r12d,%r10d
 141:	44 39 e3             	cmp    %r12d,%ebx
 144:	0f 84 c5 00 00 00    	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 14a:	4c 63 d8             	movslq %eax,%r11
 14d:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
 151:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
 157:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
 15d:	44 8d 58 01          	lea    0x1(%rax),%r11d
 161:	41 83 fa 01          	cmp    $0x1,%r10d
 165:	0f 84 a4 00 00 00    	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 16b:	4d 63 db             	movslq %r11d,%r11
 16e:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
 172:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
 178:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
 17e:	44 8d 58 02          	lea    0x2(%rax),%r11d
 182:	41 83 fa 02          	cmp    $0x2,%r10d
 186:	0f 84 83 00 00 00    	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 18c:	4d 63 db             	movslq %r11d,%r11
 18f:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
 193:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
 199:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
 19f:	44 8d 58 03          	lea    0x3(%rax),%r11d
 1a3:	41 83 fa 03          	cmp    $0x3,%r10d
 1a7:	74 66                	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 1a9:	4d 63 db             	movslq %r11d,%r11
 1ac:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
 1b0:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
 1b6:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
 1bc:	44 8d 58 04          	lea    0x4(%rax),%r11d
 1c0:	41 83 fa 04          	cmp    $0x4,%r10d
 1c4:	74 49                	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 1c6:	4d 63 db             	movslq %r11d,%r11
 1c9:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
 1cd:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
 1d3:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
 1d9:	44 8d 58 05          	lea    0x5(%rax),%r11d
 1dd:	41 83 fa 05          	cmp    $0x5,%r10d
 1e1:	74 2c                	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 1e3:	4d 63 db             	movslq %r11d,%r11
 1e6:	83 c0 06             	add    $0x6,%eax
 1e9:	4f 8d 1c 99          	lea    (%r9,%r11,4),%r11
 1ed:	c4 a1 7a 59 0c 1f    	vmulss (%rdi,%r11,1),%xmm0,%xmm1
 1f3:	c4 a1 7a 11 0c 1e    	vmovss %xmm1,(%rsi,%r11,1)
 1f9:	41 83 fa 06          	cmp    $0x6,%r10d
 1fd:	74 10                	je     20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 1ff:	48 98                	cltq   
 201:	49 8d 04 81          	lea    (%r9,%rax,4),%rax
 205:	c5 fa 59 04 07       	vmulss (%rdi,%rax,1),%xmm0,%xmm0
 20a:	c5 fa 11 04 06       	vmovss %xmm0,(%rsi,%rax,1)
 20f:	49 83 e9 80          	sub    $0xffffffffffffff80,%r9
 213:	48 83 c2 04          	add    $0x4,%rdx
 217:	49 83 e8 80          	sub    $0xffffffffffffff80,%r8
 21b:	48 83 e9 80          	sub    $0xffffffffffffff80,%rcx
 21f:	49 81 f9 00 02 00 00 	cmp    $0x200,%r9
 226:	0f 85 f7 fd ff ff    	jne    23 <_ZN2ml3mlp4funcEPKfPfS2_+0x23>
 22c:	c5 f8 77             	vzeroupper 
 22f:	5b                   	pop    %rbx
 230:	41 5a                	pop    %r10
 232:	41 5c                	pop    %r12
 234:	41 5d                	pop    %r13
 236:	41 5e                	pop    %r14
 238:	5d                   	pop    %rbp
 239:	49 8d 62 f8          	lea    -0x8(%r10),%rsp
 23d:	c3                   	retq   
 23e:	66 90                	xchg   %ax,%ax
 240:	41 bc 20 00 00 00    	mov    $0x20,%r12d
 246:	41 be 04 00 00 00    	mov    $0x4,%r14d
 24c:	bb 20 00 00 00       	mov    $0x20,%ebx
 251:	45 31 ed             	xor    %r13d,%r13d
 254:	41 bb 20 00 00 00    	mov    $0x20,%r11d
 25a:	45 31 d2             	xor    %r10d,%r10d
 25d:	e9 94 fe ff ff       	jmpq   f6 <_ZN2ml3mlp4funcEPKfPfS2_+0xf6>
 262:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
 268:	31 c0                	xor    %eax,%eax
 26a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
 270:	c5 fa 59 0c 01       	vmulss (%rcx,%rax,1),%xmm0,%xmm1
 275:	c4 c1 7a 11 0c 00    	vmovss %xmm1,(%r8,%rax,1)
 27b:	48 83 c0 04          	add    $0x4,%rax
 27f:	48 3d 80 00 00 00    	cmp    $0x80,%rax
 285:	75 e9                	jne    270 <_ZN2ml3mlp4funcEPKfPfS2_+0x270>
 287:	eb 86                	jmp    20f <_ZN2ml3mlp4funcEPKfPfS2_+0x20f>
 289:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 290:	41 bb 1f 00 00 00    	mov    $0x1f,%r11d
 296:	41 ba 01 00 00 00    	mov    $0x1,%r10d
 29c:	e9 3f fe ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
 2a1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 2a8:	41 bb 1a 00 00 00    	mov    $0x1a,%r11d
 2ae:	41 ba 06 00 00 00    	mov    $0x6,%r10d
 2b4:	e9 27 fe ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
 2b9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 2c0:	41 bb 1b 00 00 00    	mov    $0x1b,%r11d
 2c6:	41 ba 05 00 00 00    	mov    $0x5,%r10d
 2cc:	e9 0f fe ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
 2d1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 2d8:	41 bb 1c 00 00 00    	mov    $0x1c,%r11d
 2de:	41 ba 04 00 00 00    	mov    $0x4,%r10d
 2e4:	e9 f7 fd ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
 2e9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 2f0:	41 bb 1d 00 00 00    	mov    $0x1d,%r11d
 2f6:	41 ba 03 00 00 00    	mov    $0x3,%r10d
 2fc:	e9 df fd ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
 301:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 308:	41 bb 1e 00 00 00    	mov    $0x1e,%r11d
 30e:	41 ba 02 00 00 00    	mov    $0x2,%r10d
 314:	e9 c7 fd ff ff       	jmpq   e0 <_ZN2ml3mlp4funcEPKfPfS2_+0xe0>
There is some vectorization happening there, but most of the code is scalar and looks like some kind of duffs device. I played around with this and found out that the following "hint" procduces the output that I want:
void func(const float *src, float *dst, const float *factors) {
   const float * __restrict__ alignedSrc = (const float *)__builtin_assume_aligned(src, 32);
   float * __restrict__ alignedDst = (float *)__builtin_assume_aligned(dst, 32);
   const float * __restrict__ unaliasedFactors = factors;

   enum {
        NUM_OUTER = 4,
        NUM_INNER = 32

    for (unsigned k = 0; k < NUM_OUTER; k++) {
        const float factor = unaliasedFactors[k];

        const float * __restrict__ srcChunk = alignedSrc + k * NUM_INNER;
        float * __restrict__ dstChunk = alignedDst + k * NUM_INNER;
// <HINT>
        if (NUM_INNER % 8 == 0) { // the gcc tree vectorizer won't recognize this on its own?!?
            srcChunk = (const float *)__builtin_assume_aligned(srcChunk, 32);
            dstChunk = (float *)__builtin_assume_aligned(dstChunk, 32);
// </HINT>

        for (int j = 0; j < NUM_INNER; j++)
            dstChunk[j] = srcChunk[j] * factor;

0000000000000000 <_ZN2ml3mlp4funcEPKfPfS2_>:
   0:	48 8d 8f 00 02 00 00 	lea    0x200(%rdi),%rcx
   7:	48 8d 46 20          	lea    0x20(%rsi),%rax
   b:	c5 fa 10 02          	vmovss (%rdx),%xmm0
   f:	48 39 f8             	cmp    %rdi,%rax
  12:	76 09                	jbe    1d <_ZN2ml3mlp4funcEPKfPfS2_+0x1d>
  14:	48 8d 47 20          	lea    0x20(%rdi),%rax
  18:	48 39 f0             	cmp    %rsi,%rax
  1b:	77 43                	ja     60 <_ZN2ml3mlp4funcEPKfPfS2_+0x60>
  1d:	c4 e2 7d 18 c0       	vbroadcastss %xmm0,%ymm0
  22:	c5 fc 59 0f          	vmulps (%rdi),%ymm0,%ymm1
  26:	c5 fc 29 0e          	vmovaps %ymm1,(%rsi)
  2a:	c5 fc 59 4f 20       	vmulps 0x20(%rdi),%ymm0,%ymm1
  2f:	c5 fc 29 4e 20       	vmovaps %ymm1,0x20(%rsi)
  34:	c5 fc 59 4f 40       	vmulps 0x40(%rdi),%ymm0,%ymm1
  39:	c5 fc 29 4e 40       	vmovaps %ymm1,0x40(%rsi)
  3e:	c5 fc 59 47 60       	vmulps 0x60(%rdi),%ymm0,%ymm0
  43:	c5 fc 29 46 60       	vmovaps %ymm0,0x60(%rsi)
  48:	48 83 ef 80          	sub    $0xffffffffffffff80,%rdi
  4c:	48 83 c2 04          	add    $0x4,%rdx
  50:	48 83 ee 80          	sub    $0xffffffffffffff80,%rsi
  54:	48 39 cf             	cmp    %rcx,%rdi
  57:	75 ae                	jne    7 <_ZN2ml3mlp4funcEPKfPfS2_+0x7>
  59:	c5 f8 77             	vzeroupper 
  5c:	c3                   	retq   
  5d:	0f 1f 00             	nopl   (%rax)
  60:	31 c0                	xor    %eax,%eax
  62:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  68:	c5 fa 59 0c 07       	vmulss (%rdi,%rax,1),%xmm0,%xmm1
  6d:	c5 fa 11 0c 06       	vmovss %xmm1,(%rsi,%rax,1)
  72:	48 83 c0 04          	add    $0x4,%rax
  76:	48 3d 80 00 00 00    	cmp    $0x80,%rax
  7c:	75 ea                	jne    68 <_ZN2ml3mlp4funcEPKfPfS2_+0x68>
  7e:	eb c8                	jmp    48 <_ZN2ml3mlp4funcEPKfPfS2_+0x48>
This is more in line with what I wanted and it is actually twice as fast. In my real code, the speed difference is even bigger. Both versions produce correct output.
Note that for NUM_INNER % 8 == 0, alignedSrc + k * NUM_INNER is always 32 byte aligned iff alignedSrc is 32 byte aligned. This is s.th. the compiler should be able to figure out on its own. Or am I missing s.th. here?

Do you have any experience with this, or any advice on how to fix it without resorting to lots of hand crafted "hints" throughout the code? Do I really have to provide such alignment hints for every strided access that's happening?
Thanks in advance for any help or advice with this.

#5222568 C++ cant find a match for 16 bit float and how to convert 32 bit float to 16...

Posted by on 11 April 2015 - 03:00 AM

For large amounts of data, there are also SIMD intrinsics that can do this:

half -> float: _mm_cvtph_ps and _mm256_cvtph_ps
float -> half: _mm_cvtps_ph and _mm256_cvtps_ph
see https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Oh, I just noticed you aren't doing this on a PC. But some ARM processors support similar conversion functions. See for example: https://gcc.gnu.org/onlinedocs/gcc/Half-Precision.html

#5222283 Relation between TFLOPS and Threads in a GPU?

Posted by on 09 April 2015 - 01:13 PM

Peak performance (in FLoating point OPerations per Second = FLOPS) is the theoretical upper limit on how many computations a device can sustain per second. If a Titan X were doing nothing else than computing 1 + 2 * 3 then it could do that 3 072 000 000 000 times per second and since there are two operations in there (an addition and a multiplication) this amounts to 6 144 000 000 000 FLOPS or about 6.144 TFLOPS. But you only get that speed if you never read any data or write back any results or do anything else other than a multiply followed by an addition.


A "thread" (and Krohm rightfully warned of its use as a marketing buzzword) is generally understood to be an execution context. If a device executes a program, this refers to the current state, such as the current position in the program, the current values of the local variables, etc.


Threads and peak performance are two entirely different things!


Some compute devices (some Intel CPUs, some AMD CPUs, SUN niagara CPUs and most GPUs) can store more than one execution context aka "thread" on the chip so that they can interleave the execution of both/all of them. This sometimes falls under the term of "hardware-threads", at least for CPUs. And this is done for performance reasons. But it does not affect the theoretical peak performance of the device, only how much of that you can actually use. And the direct relationship between the maximum number of hardware threads, the used number of hardware threads, and the achieved performance ... is very complicated. It depends on lots of different factors like memory throughput, memory latency, access patterns, the actual algorithm, and so on.

So if this is what you are asking about, then you might have to look into how GPUs work and how certain algorithms make use of that.

#5203246 How can sample rate be 44000Hz?

Posted by on 10 January 2015 - 06:26 AM

There is more to time-discretization than just the Nyquist theorem.
When you time-discretize a continous signal, you essentially turn it into a stream of impulses. One impulse for each sample. To turn this stream of impulses back into a time-continous signal, you need to low-pass filter it (at least mathematically speaking). Imagine it like the low-pass filter bluring out all the spikes of the impulses, but keeping the general form of the signal intact.
You can show, that if the highest frequencies in the original signal were below half the sampling rate, then all the additional frequencies due to the spiky impulses are above half the sampling rate. So, (again mathematically speaking) the low-pass filter used for perfect reconstruction must let everything below half the sampling frequency pass undisturbed, but completely filter out everything above it. If you had such a filter (you can't build it) and you if you had an infinitely long sample stream (the filter is non-causal and has an infinite response, so you need an infinitely long sample stream) then you can perfectly reconstruct everything if the original signal truly never exceeded half the sampling frequency. As Olof Hedman already pointed out, exactly half the sampling frequency is the point where it breaks apart. At that point, you can no longer differentiate between phase and amplitude. But if the frequency is a smidge lower, due to the infinite amount of samples you can perfectly reconstruct it.

In practice, you can't build a perfect low-pass filter (except, maybe, if the signal is periodic?). Which means, the filter actually being used will have, roughly speaking, three frequency regions. A low frequency region which gets through undisturbed, a middle region, where the amplitudes get damped and a high frequency region where the filter blocks. And depending on the "width" of the middle region, you must keep a margin between the highest frequencies in your original signal and half the sampling rate (essentially what Aressera already said).

Also note, that sampling of a continous signal has nothing to do with the cycles in a synchronous circuit.

#5201743 Linking Words as Conditional Statments

Posted by on 04 January 2015 - 08:51 AM

You have to keep in mind that natural languages are rather universal in their purpose. They can be used to give orders but they can also be used to explain stuff and transfer knowledge. (Most) programming languages only serve a single purpose: give orders. You don't need to explain to the CPU why it should do s.th.

Hence, many of those linking words have no purpose in a programming language. In a language that describes knowledge (see ontology) things might be different.

#5182972 Normal map from height map

Posted by on 25 September 2014 - 02:34 PM

Maybe these
tex2D(image, uv + off.xy).x
should be more along the lines of
tex2D(image, uv + off.xy * float2(1.0/textureWidth, 1.0/textureHeight)).x
at least if you are using normalized texture coordinates.

Also you need to output bump.xyz * 0.5 + 0.5 to get the colors of that image.

#5179652 And this is why you don't change names to lowercase

Posted by on 11 September 2014 - 12:49 PM


Whether you're looking for a long and thin [...] or a thick dark mahogany [...] we have just the one for you.

We Specialize In Wood!

We have been hand-crafting [...] for nearly three decades and our designs have won multiple awards. From single [...] to bulk orders, virgin timber or reclaimed barn wood.

(sry, couldn't resist)

#5178576 Beginners Observation: Fundamental Lack of Source Code Examples?

Posted by on 06 September 2014 - 01:32 PM

From the perspective of a noob pretty much anyone, actual production codes wouldn't help in learning at all. When a noob anyone will see the code, he won't understand half of the things going on and he will be like "what is this sorcery?!" Production codes will scare any beginner -one away.

There I fixed that for you.

#5178343 GOTO, why are you adverse to using it

Posted by on 05 September 2014 - 11:01 AM

I can skip a thousand lines of code with a single command and a single cache hit.

If you have 1000 LoC long functions, that is probably equally bad.

Something that I have seen, and I wonder if it's prejudice or actually founded, is that "older" programmers tend to distrust the compiler, specifically its optimization capabilities. Its probably because they have seen the really bad first generations of high level language compilers. But somehow this seems to stick. I still see people trying to reuse local variables to save the compiler the trouble of expanding and reducing the stack frame (Yes, authors of "Numerical Recipes", I'm looking at you). You don't need to do that anymore.
Similarly I'm pretty sure I have seen a compiler reducing a series of break; statements into a single jump.

This means that you can actually use high level constructs like classes without having to fear immediate performance penalties. And with that comes the realization, as others pointed out, that there are a ton of different mechanisms to choose from so that GOTO simply isn't necessary anymore. I think the only place I ever used it was for error handling in pure C, similarly to what chingo wrote.

#5178277 One Buffer Vs. Multiple Buffers...

Posted by on 05 September 2014 - 06:07 AM

And there are crazy things happening, f. ex. sometimes it's faster to reserve shared memory without using it.

This is actually not that uncommon. The problem is that the cores only have a limited amount of register space (64k per SMx core) which gets divided up by however many threads are running in parallel. So if you are running 1024 threads per SMx, every thread can use up to 64 registers. If you are running the maximum of 2048 threads, every thread only gets to use 32 registers. If more local variables are needed than registers are available, some registers are spilled onto the stack similarly to how it's done on the CPU. But contrary to the CPU, the GPU memory latencies are incredibly high so spilling stuff that is often needed onto the stack can increase the runtime.

Now shared memory is also a restricted resource (64KB on kepler per SMx) but one that can't be spilled. So, if every block only needs less than 2KB, you can get the maximum of 32 resident blocks per SMx. But if you increase the amount of reserved shared memory, lets say to just below 4KB, then you can only have 16 resident blocks. Now, halving the amount of resident blocks also halves the total amount of resident threads, so each thread has twice the amount of registers at its disposal.

So, increasing the amount of reserved shared memory can decrease the number of resident blocks/threads, which increases the number of registers each thread can use, which can reduce register spilling and costly loads from the stack. I don't know about compute shaders, but for cuda I believe the profiler can check for this.

#5178132 Beginners Observation: Fundamental Lack of Source Code Examples?

Posted by on 04 September 2014 - 01:46 PM

I recently had to use SDL2 and to bootstrap things I googled for a minimal example to get things started. Turns out like 90% of all the example code out there is for SDL1 and outdated. A friend of mine, who started to work with OpenGL, had pretty much the same problem. He would search for tutorials/examples and also find them, but only later on realize that they were for OpenGL 1.2 and severely outdated.

The problem is that writing good, clean, and self contained example code, especially for the more advanced stuff, takes a lot of time. And then a couple of years later, the world has moved on and all that work goes to waste.

#5178005 glut glew freeglut, what is diffrence?

Posted by on 04 September 2014 - 03:02 AM

sorry but i have more questions. you said modern coding is about using glew library but i have worked on some project that only used glut or freeglut. and working on that was much easier. what i lose if i stop using glew.

Not using glew, or any other library/code that performs the same, you "just" loose every advance in computer graphics of the last 16 years.
Often times, if you just need a quick visualization of something with a couple of lines, points, and triangles, then the stuff from 16 years ago is completely sufficient.

I think whether or not you *need* the new stuff is irrelevant. Rendering has changed significantly since 16 years ago, and if you invest precious time into learning, you might just as well not waste it on s.th. that is dead and buried.

#5178003 my SIMD implementation is very slow :(

Posted by on 04 September 2014 - 02:50 AM

Instead of using double-indirection [...]

It is actually a quadruple (4x) indirection: 1. idToIntersect[] 2. m_triBuffer[] 3. LocalTri->indiceX 4. m_vertices[]
This is incredibly bad because it "amplifies" your memory problems. CPUs always try to execute instructions either in parallel or at least overlapping if they don't depend on each other. For float addition, you need at least 3 independent (vector) additions (with AVX that is 3x8) to fully saturate an ivybridge ALU. For multiplication it's 5 independent ones. The problem with pointer chasing like the above indirection is that everything depends on the result of a long chain of operations, where in turn every operation depends on the previous one. Computations can not start until step (4), loading m_vertices[], has completed. That however can only start, when the index is known, which means that loading LocalTri->indiceX must fully complete. That again can only start after loading from m_triBuffer[] has fully completed, and so on. Until this chain of operations is done, most of your CPU is idle because there is nothing to execute in parallel.
If you are lucky, everything is in the L1 cache and every load can be serviced in 4 cycles. Then it takes a total of 16 cycles before actual computations start, and remember with AVX 16 cycles are worth 128 floating point operations. Now lets assume that your triangle sizes increase and all the stuff no longer fits into the L1 cache, but has to be loaded from the L2 cache. The L2 latency is 10 cycles, so just a 6 cycle increase, thats not a big deal. But since you have 4x indirection, you actually get 4x that 6 cycle increase. Assuming, that everything is in L2 of course. If your triangle sizes increase even more, you might have to load stuff from the L3. I don't know the latencies for the L3 but lets assume they are just 30 cycles. If all those loads hit the L3, then it takes you 4x30cycles = 120 cycles before you even start computing. That is almost 1000 floating point operations wasted.

Of course some of those loads will probably always hit the L1, but pointer chasing / indirection is extremely bad, and it can severely amplify the effects of cache misses. Getting rid of that would be even higher on my priority list than reducing the size of the data structures.

As a side note, wouldn't it be easier to not simd-vectorize the vector operations, but instead to perform the computations for 8 or 16 rays simultaneously?

#5177655 For-loop-insanity

Posted by on 02 September 2014 - 08:05 AM

Since we are trying to come up with new, obscure, and complicated ways for counting up, how about this:
#include <stdio.h>
template<typename Type>
class Range {
        Range(const Type &first, const Type &last) : m_first(first), m_last(last) { }
        class const_iterator {
                const_iterator(Type curr) : m_current(curr) { }
                inline const Type &operator*() const { return m_current; }                
                inline bool operator!=(const const_iterator &other) { return m_current != other.m_current; }
                inline const_iterator &operator++() { m_current++; return *this; }
                Type m_current;
        inline const_iterator begin() const {
            return const_iterator(m_first);
        inline const_iterator end() const {
            return const_iterator(m_last+1);
        Type m_first, m_last;

int main()
    for (const auto i : Range<int>(0, 255))
        printf("%i\n", i);

#5177236 glut glew freeglut, what is diffrence?

Posted by on 31 August 2014 - 11:50 AM

As a rule of thumb, when you want to use a static library you have to do three things:
1. Tell the compiler in which directory it can find the header files
2. Tell the linker in which directory it can find the library files
3. Tell the linker which library files to use.

Some tool chains allow 2+3 to be combined. Some allow the library files to be specified inside the header files via #pragma so that step 3 can be omitted. But again, as a rule of thumb, those are the three steps.

When the compiler complains that it can't find a header file you missed step 1. When the linker complains that it can't find a library file you either missed step 2 or messed up the library name. When the linker complains that it can't find certain symbols, as in your case, you probably missed part 3.

Looking at the current version of glew, the library files for VisualStudio are under lib/ in the glew-1.11.0-win32.zip file. There are different versions depending, amongst other things, on whether you are going for a 32-bit or 64-bit program.

Modern OpenGL refers to the newer versions of OpenGL. The newer versions have new API functions that you can call, but accessing them is a bit tricky and glew helps with that. Independent of which OpenGL version you are going for, using the newest version of glew is probably the best choice.