• Create Account

# Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Today, 05:58 PM

### In Topic: A quick way to determine if 8 small integers are all equal to each other

Today, 11:05 AM

EDIT: For the 64bit trick, the array better be alignas(uint64_t), and I'm not 100% sure if that reinterpret_cast is actually legit in C++ (it would be undefined behavior in C at least due to strict aliasing requirements). It might be safer to work around it using unions, but the generated code seems fine.

Assuming the array is originally a char*, there is no violation of the strict aliasing rule, as char* is the only datatype that can alias to anything. Though you are right we would have to concern about the compiler not blowing up the performance part; it may be faster/safer to just keep it always as uint64_t and use "& 0xFF" to get the table index due to language shenanigans.

### In Topic: A quick way to determine if 8 small integers are all equal to each other

Yesterday, 07:24 PM

Create a table with 256 entries, use 32-bit values in x86, 64-bit in x64; 128-bit with SSE.
The entries of the table would be:

```#define MAKE_MASK( x ) ((x) << 24u) | ((x) << 16u) | ((x) << 8u) | (x)

const uint32_t table[256] =
{
0,
//...
}; ```

Then just do a 32/64/128-bit compare:

```uint8_t idx = (uint8_t*)_array[0];
if( table[idx] == ((uint32_t*)_array)[0] &&  table[idx] == ((uint32_t*)_array)[1] )
{
//All 8 values are equal
}
```

### In Topic: Matrix 16 byte alignment

21 October 2016 - 10:51 PM

Not in my experience, _mm_loadu_ps() was only a few % slower (maybe 1 cycle at most) than _mm_load_ps() when I did the benchmarks on Intel i7, and that extra cost is not even measurable when the address is aligned. Use aligned loads whenever you can ensure alignment, but it seems like more of a microptimization. You'll save more time by thinking carefully about how to lay out data for better cache utilization so that you don't pay tens of cycles each memory access.

YMMV (your mileage may vary). Expensive, power hungry CPUs like the Intel i7 have the lowest penalty. But on certain architectures the performance hit is big (Atom, AMD CPUs). Also this problem comes back to bite you if you later port to other platforms (i.e. ARM)
Furthermore how much slower depends on how good the CPU was masking the penalty of unaligned access. If you're hitting certain bottlenecks (such as bandwidth limits) the CPU won't be able to mask it well, and thus that 1% grows.

You'll save more time by thinking carefully about how to lay out data for better cache utilization so that you don't pay tens of cycles each memory access.

Ensuring alignment is correct is part of carefully thinking how to lay out the data. Furthermore, ensuring correct alignment takes literally seconds of programming work, if not less, and it doesn't make things unreadable or harder to maintain either.

### In Topic: How to understand GPU profiler data and use it to trace down suspicious abnor...

19 October 2016 - 12:03 PM

For example, my GPU profiler told me that it takes 1.5ms for a compute shader to copy a 1080p R8G8B8A8 image from CPU writable memory to a same format Texture2D on default heap every frame. Does that sound normal given that I am using GTX680m?

You need to do the math. Get the specs of maximum data transfer of your system, the GPU the PCI-E bus, system RAM, etc (theoretical on-paper specs are nice, whether you got it online or via a tool like GPU-Z; but it's much better if you work with data provided by some specialized benchmark tool you ran on your system). Once you've got the transfer speeds of your system, do the math and check whether you're hitting one of the limits. FYI 1080p RGBA8888 needs 8 bits x 4 channels x 1920 x 1080 = 66355200 bits. Which is 8294400 bytes, 8100 kb, 7.9MB.

### In Topic: CPU + GPU rendering with OpenGL

15 October 2016 - 11:08 PM

You'll need to use a PBO, keeping it mapped with persistent storage and using fences for synchronization.

There is one thing I don't get though:

Only GPU rendering with fixed textures takes 4.0 ms which is okay for OpenGL but then I cannot write freely to the depth buffer unless there is an extension for that. Copying back from fake depth buffers all the time would stall the GPU while waiting for the output as the next input texture.

I assume you don't know about gl_FragDepth?

What is that depth buffer manipulation you do? why do you need it? what are you trying to achieve?

PARTNERS