Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 12 Jun 2008
Offline Last Active May 19 2013 05:53 AM

Posts I've Made

In Topic: CUDA function calling, what's the best approach?

15 May 2013 - 08:25 PM

Thanks for the tips Ohforf sake and jms bc. By doing some research I've learned more about the textures and there's a CUDA binding called surface which allows write acess to texture memory, it's seems like a good solution to my problem.


My only issue right know is that I've already started implementing the shared cache version of my code, so changing yet again to another approach would cost me a lot of time (I'd have to implement a work around those 130MB limit for a single 1D texture), so, for now, I'll test shared memory and check the speed-up, if it's not good enough I'll switch to texture memory.


Also, didn't know CUDA supported half float variables (thought that was only an OpenCL feature), so I'll switch to half since it could effectivelly double the reach of a single CUDA block to the mesh values.


Thanks again, learned a lot.

In Topic: CUDA function calling, what's the best approach?

14 May 2013 - 11:08 AM

This is pretty much what cuda's so-called texture memory is for, you should look into it. If I recall correctly, the introduction to texture memory in some of the NVIDIA docs is a similar problem. The short of it, you'll get these neighbors into cache for fast access. Requires a bit of tuning though...


Thanks for the suggestion but, correct me if I'm wrong, texture memory is a read-only (my mesh has to change it's value at each iteration) and special type of global memory. So shared memory acess is still going to be faster than texture memory.


Also, from looking at the simpleCubemapTexture sample, texture memory would seem to suit better for surfaces, not volumes as in my case and there's also memory size limitations I'd have to consider for a single texture, the meshes in my code can easily reach 100MB.

In Topic: CUDA function calling, what's the best approach?

14 May 2013 - 07:42 AM

The actual block size is not as relevant because a MultiProcessor can run several blocks in parallel. The maximal amount of shared memory is actually 16KB for ComputeCapability < 2.0 and 48KB otherwise.


Good, didn't know that. If many blocks are allowed to run in parallel, does it mean each block will have it's unique shared memory?


Which hardware level are you actually targeting? How many parameters do you have, and how many "factors" are computed?


The program output says it a Tesla 2090, it's compute capability is 2.0. The parameters on the kernel are float *mesh3D and int time and the "factor functions" are two actually; Laplace and Chemotaxis; their parameters are the surrouding values (top, bottom, left, right, front, back and middle).


Your guess is correct with the exception that there are 7 samplings. Also, each point which belongs to the grid has 7 cells (the grid is actually a microscopic portion of human skin tissue), so it's 7*7*N redundant acesses.


Laplace and Chemotaxis are calculated on each cell on a mesh position, so there are lots of acesses to global values. I want to move these values to a faster memory before Laplace and Chemotaxis start acessing them.

In Topic: Frustum Culling

13 May 2013 - 07:25 AM

Assuming you know the point of intersection (the corner) between PB and PL (let's call it PI), you can calculate a vector V = (P - PI). If (length(V) < sphereRadius) your point is inside, otherwise, check the dot product of normalized(V) with PB and PL plane normals, if and only if both values are negative, point is outside, if the comparison fails, it's inside.


Take this with a grain of salt since I've never tested or even considered this case when doing frustum culling, but this would be my first guess to solve this issue.


Hope it helped.

In Topic: CUDA function calling, what's the best approach?

12 May 2013 - 01:59 PM

I had a lot of cases, where it was beneficial to store parameters in constant memory.


That's what I'm doing, prior to start executing the program I load static settings (i.e mesh size, mesh length...) as constants using cudaMemcpySymbol and just send the mesh and time parameters with the kernel. I'm trying to minimize transfers from host to device as much as possible.


PS: If you have every thread read a different value from the array and cache them in shared memory, you can significantly reduce the number of global memory accesses for the second (pointer based) variant.


I didn't know about shared memory (I'm basically a starter in CUDA) and, by doing some research over the internet, I learned the maximum cache size for each thread block is around 16-32KB. My kernels run 1024 threads per block, needing 60KB, therefore, to make it work with my kernel I'd have to reduce the number of threads per block from 1024 to 512, is this generally a good tradeoff?