Jump to content

  • Log In with Google      Sign In   
  • Create Account

Awesome job so far everyone! Please give us your feedback on how our article efforts are going. We still need more finished articles for our May contest theme: Remake the Classics

Ohforf sake

Member Since 04 Mar 2008
Online Last Active Today, 04:31 PM
-----

Posts I've Made

In Topic: Physically Correct "Bloom"

Yesterday, 03:09 AM

I disagree, that biological eyes and ccd cameras work the same in this regard. I think the opposite is the case, if you want to get Photo/Eye realistic, you should make up your mind about which one of the two you want to have. For example I have never observed those beautiful lens flares you get from multi lens cameras with my single lens eyes.

 

Photo realism has the great benefit that you can .. well .. take pictures of it. :-)

 

In case you are interested in "eye realism" take a look at this paper. They try to model what is actually going on in biological eyes. And the results actually look the way the world looks to me when I'm drunk, so they can't be that far off ^^

 

http://www.mpi-inf.mpg.de/~ritschel/Papers/TemporalGlare.pdf


In Topic: How do I use the source code I have downloaded?

20 May 2013 - 02:01 PM

You can't build large projects without the build system for which they were intended. Some require strange pre or post processing steps like compiling lua wrappers or such.

 

For small things, you could try the following

 

do {

  compile all files

  if (missing header files)

        install library and add include search directories to compiler arguments

} while (! all files compile)

 

do {

   link all files into a binary

   if (linker is missing function implementations)

        find corresponding library and add it to the linker arguments

} while (! linking was successfull)


In Topic: CUDA function calling, what's the best approach?

14 May 2013 - 11:23 PM

Textures are actually really good in cuda, so I agree with jms bc, you should use them if possible. They are a lot easier to handle then shared memory and you get caching, interpolation and handling of out of bounds sampling for free.

Textures aren't really read only. You can not write to a texture using a global store or surface store, and expect to read that data using a texture fetch within the same kernel call. If you read from texture1 and write to the memory region of texture2 and in the next iteration/kernel call read from texture2 and write to texture1, then you are fine. Also consider using half floats for storing the data in textures. They can cut your bandwidth needs in half, if the precision is sufficient.

 

About the blocks and occupancy thing:

As a rule of thumb, you should watch out for two things in cuda: Occupancy and global loads/stores. For the global loads/stores, textures are a good optimization you should try first. If that doesn't work out, you can still try manual caching in shared memory.

 

Occupancy is the amount of threads, that can be held in the MultiProcessor in relation to it's maximum. A CC 2.x MP can hold up to 1536 threads. These are not actually all running in parallel. Rather think of it the way, SMT/Hyper threading works on Intel CPUs: The thread contexts are held in the chip and the chip tries to keep all of it's execution units busy by fetching instructions from those threads. For example, if a couple of threads are performing a global load/store, which will take multiple cycles, all the other threads can use the ALUs to perform computations.

This means that using only 60% of those 1536 threads does not necessarily mean a decrease in performance. But it does decrease the flexibility, that the warp schedulers have and might make global loads/stores and texture reads more expensive.

 

The amount of threads, that you can put on a MP, is not only limited by that magical number 1536, but also on the amount of registers and shared memory, that the kernel needs.

A CC2.0 MP has a total of 32000 registers. So if your kernel (and thus each thread) needs 32 registers, you can not have more than 1000 threads running on the MP, an occupancy of 65%.

Also, a CC2.0 MP has a total of 48KB of shared memory, that can be split among the blocks. If each block needs 12KB of shared memory (including whatever overhead CUDA produces), then you can have a maximum of 4 blocks present in your MP. If each block only consisty of 16 threads, then you have the horrible situation of only 64 threads being active (Occupancy 1%), which isn't even enough to hide the latency of the arithmetic instructions.

 

All of these numbers are nicely structured in tables on wikipedia:

http://en.wikipedia.org/wiki/CUDA

 

IIRC, NVidia offers a spread sheet for computing the occupancy (I never used it) and the NVidia Visual Profiler will also tell you everything you need to know.


In Topic: CUDA function calling, what's the best approach?

13 May 2013 - 01:16 PM

Oh, I see, I misread your original post. You are not talking about the kernel launch, but about a device function.

 

The actual block size is not as relevant because a MultiProcessor can run several blocks in parallel. The maximal amount of shared memory is actually 16KB for ComputeCapability < 2.0 and 48KB otherwise. Note, that cache and shared memory are two distinct things, although they are implemented by the same hardware.

 

Which hardware level are you actually targeting? How many parameters do you have, and how many "factors" are computed?

 

 

The way I interpret your code, the "parameters" are sampled from a 3D grid. Each of these grid cells are sampled 6 times by 6 different threads so without storing them in shared memory, you have 6 times as many global loads, as you would need in theory. In addition, if I interpret your snippets correctly, you have not one parameter per grid cell but many and you read them one by one. That is, up to 6 threads may each read the same value, and you do this up to N times, where N is the number of parameters per grid cell. So you end up with 6 * N global memory loads per grid cell.

If you use shared memory, you can share those grid cells to some degree among the threads. Also, all the threads in a block can join forces to read the cells parameters sequentially from memory, thus further reducing the number of global loads.


In Topic: CUDA function calling, what's the best approach?

12 May 2013 - 01:42 AM

On Compute Capability < 2.0 devices, all function parameters are stored in shared memory and immediately loaded into registers. That means, that each parameters has a very high cost, because it limits the number of threads that can run in parallel. I had a lot of cases, where it was beneficial to store parameters in constant memory.

 

For Compute Capability >= 2.0, the spec says that parameters are stored in constant memory, but I haven't done any testing, if they still block an entire register for each parameter.

 

I would expect the single pointer to be either equally fast or faster.

 

 

PS: If you have every thread read a different value from the array and cache them in shared memory, you can significantly reduce the number of global memory accesses for the second (pointer based) variant.


PARTNERS