CUDA function calling, what's the best approach?

Started by
12 comments, last by tmarques4 10 years, 11 months ago

Hello,

I'm working on a CUDA kernel and something interesting, I guess, crossed my mind. Maybe you could help me.

Say I have those two kernels:


__device__ float SimpleKernel1(float value1, float value2, float value3 ...float valueN)
{
    return value1 + value2 + value3 ... + valueN;
}


__device__ float SimpleKernel2(float *values)
{
    return value[0] + value[1] + value[2] ... + value[N];
}

Would SimpleKernel2 run faster? I know there are lots of factors in play (i.e memory interface, clock, number of threads) but thinking as generic as possible, kernel1 function call sends sizeof(float)*N bytes while kernel2 function call only sends sizeof(float *) bytes, so maybe this would results in great speedups. Is this right or wrong, does it really matter?

Thanks!

Tiago.MWeb Developer - Aspiring CG Programmer
Advertisement
Compile it and look at the assembled function. My guess is that the second method would be slower, but only the good Lord knows what the compiler will do with it.
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

What?! Kernel2 still needs to upload the entire array into device memory to be able to access value[0], value[1], .. value[N]. It doesn't fetch them from host memory, that would be horribly slow. All data needed by the kernel is always cached into device memory. The transfer may be done before the kernel call in the second case, though (upon device memory allocation and data upload) but it would still be there, however Kernel2 is better in the sense that it can be reused more easily, and I believe there is a limit on the number of kernel arguments you can have. Also, I believe the arguments in Kernel1 are in a different memory address space than in Kernel2.

Although this is based on my knowledge of OpenCL. CUDA should be largely the same at this level, though.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

What?! Kernel2 still needs to upload the entire array into device memory to be able to access value[0], value[1], .. value[N]. It doesn't fetch them from host memory, that would be horribly slow. All data needed by the kernel is always cached into device memory. The transfer may be done before the kernel call in the second case, though (upon device memory allocation and data upload) but it would still be there, however Kernel2 is better in the sense that it can be reused more easily, and I believe there is a limit on the number of kernel arguments you can have. Also, I believe the arguments in Kernel1 are in a different memory address space than in Kernel2.

Although this is based on my knowledge of OpenCL. CUDA should be largely the same at this level, though.

Those values aren't actually global memory, they are private. I'm very sorry for the confusion and will try to provide some context:

I have a 3D mesh with each position representing a float value and I use GPGPU to update this mesh T times.


//This is the function called from host to device.
__global__ void Update3DMesh(float *mesh, int time)
{
    //Retrieve the index from the thread ID that represents the position being updated.

    //Retrieve the center, top, bottom, left, right, front, back position values relative to the index.

    //Compute N factors using these surrounding position values according to time.
    float factor1 = ComputeFactor1(parameters);
    float factor2 = ComputeFactor2(parameters);
    ...
    float factorN = ComputeFactorN(parameters);

   //Update the position.
   float[index] = factor1+factor2 ... +factorN;
}

I'm struggling to figure the best way for ComputeFactor to receive the parameters. I thought that maybe packing all these surrounding positions into a (float *) vector and passing the pointer (sizeof(float *) bytes) instead of passing all positions separatelly (sizeof(float)*7 bytes) would result in a faster program. I'll just have to compare both methods.

Appreciate the responses, Khatharr and Bacterius.

Tiago.MWeb Developer - Aspiring CG Programmer

On Compute Capability < 2.0 devices, all function parameters are stored in shared memory and immediately loaded into registers. That means, that each parameters has a very high cost, because it limits the number of threads that can run in parallel. I had a lot of cases, where it was beneficial to store parameters in constant memory.

For Compute Capability >= 2.0, the spec says that parameters are stored in constant memory, but I haven't done any testing, if they still block an entire register for each parameter.

I would expect the single pointer to be either equally fast or faster.

PS: If you have every thread read a different value from the array and cache them in shared memory, you can significantly reduce the number of global memory accesses for the second (pointer based) variant.

I had a lot of cases, where it was beneficial to store parameters in constant memory.

That's what I'm doing, prior to start executing the program I load static settings (i.e mesh size, mesh length...) as constants using cudaMemcpySymbol and just send the mesh and time parameters with the kernel. I'm trying to minimize transfers from host to device as much as possible.

PS: If you have every thread read a different value from the array and cache them in shared memory, you can significantly reduce the number of global memory accesses for the second (pointer based) variant.

I didn't know about shared memory (I'm basically a starter in CUDA) and, by doing some research over the internet, I learned the maximum cache size for each thread block is around 16-32KB. My kernels run 1024 threads per block, needing 60KB, therefore, to make it work with my kernel I'd have to reduce the number of threads per block from 1024 to 512, is this generally a good tradeoff?

Tiago.MWeb Developer - Aspiring CG Programmer

My kernels run 1024 threads per block, needing 60KB, therefore, to make it work with my kernel I'd have to reduce the number of threads per block from 1024 to 512, is this generally a good tradeoff?

I don't think there is an alternative to run-time profiling of the kernels. Which for me means exhausting all possible block sizes and comparing times.

__device__ float SimpleKernel1(float value1, float value2, float value3 ...float valueN)
{
return value1 + value2 + value3 ... + valueN;
}

I don't know what happens when pass by value, I'm curious though. I'll look at ptx when I get to work. Questions such as these should be answered on the clock.

The pointer version will simply access the thread's local memory, which I'm guessing is better/best. It never dawned on me to pass, say, 6-7 values like that instead of using an array.

I don't see right off how shared memory would provide a benefit unless you are passing data between threads within the block. I'll admit to a lot of voodoo optimization of CUDA code -- buffering in shared memory never worked for me. At best it will mimic the use of private local memory. The use of local memory and registers will force you to vary the block size also.

The Four Horsemen of Happiness have left.

Oh, I see, I misread your original post. You are not talking about the kernel launch, but about a device function.

The actual block size is not as relevant because a MultiProcessor can run several blocks in parallel. The maximal amount of shared memory is actually 16KB for ComputeCapability < 2.0 and 48KB otherwise. Note, that cache and shared memory are two distinct things, although they are implemented by the same hardware.

Which hardware level are you actually targeting? How many parameters do you have, and how many "factors" are computed?

The way I interpret your code, the "parameters" are sampled from a 3D grid. Each of these grid cells are sampled 6 times by 6 different threads so without storing them in shared memory, you have 6 times as many global loads, as you would need in theory. In addition, if I interpret your snippets correctly, you have not one parameter per grid cell but many and you read them one by one. That is, up to 6 threads may each read the same value, and you do this up to N times, where N is the number of parameters per grid cell. So you end up with 6 * N global memory loads per grid cell.

If you use shared memory, you can share those grid cells to some degree among the threads. Also, all the threads in a block can join forces to read the cells parameters sequentially from memory, thus further reducing the number of global loads.

The actual block size is not as relevant because a MultiProcessor can run several blocks in parallel. The maximal amount of shared memory is actually 16KB for ComputeCapability < 2.0 and 48KB otherwise.

Good, didn't know that. If many blocks are allowed to run in parallel, does it mean each block will have it's unique shared memory?

Which hardware level are you actually targeting? How many parameters do you have, and how many "factors" are computed?

The program output says it a Tesla 2090, it's compute capability is 2.0. The parameters on the kernel are float *mesh3D and int time and the "factor functions" are two actually; Laplace and Chemotaxis; their parameters are the surrouding values (top, bottom, left, right, front, back and middle).

Your guess is correct with the exception that there are 7 samplings. Also, each point which belongs to the grid has 7 cells (the grid is actually a microscopic portion of human skin tissue), so it's 7*7*N redundant acesses.

Laplace and Chemotaxis are calculated on each cell on a mesh position, so there are lots of acesses to global values. I want to move these values to a faster memory before Laplace and Chemotaxis start acessing them.

Tiago.MWeb Developer - Aspiring CG Programmer

This is pretty much what cuda's so-called texture memory is for, you should look into it. If I recall correctly, the introduction to texture memory in some of the NVIDIA docs is a similar problem. The short of it, you'll get these neighbors into cache for fast access. Requires a bit of tuning though...

CUDA by Example

Chapter 7. Texture Memory - Simulating Heat Transfer

The Four Horsemen of Happiness have left.

This topic is closed to new replies.

Advertisement