Hello,
I'm working on a CUDA kernel and something interesting, I guess, crossed my mind. Maybe you could help me.
Say I have those two kernels:
__device__ float SimpleKernel1(float value1, float value2, float value3 ...float valueN)
{
return value1 + value2 + value3 ... + valueN;
}
__device__ float SimpleKernel2(float *values)
{
return value[0] + value[1] + value[2] ... + value[N];
}
Would SimpleKernel2 run faster? I know there are lots of factors in play (i.e memory interface, clock, number of threads) but thinking as generic as possible, kernel1 function call sends sizeof(float)*N bytes while kernel2 function call only sends sizeof(float *) bytes, so maybe this would results in great speedups. Is this right or wrong, does it really matter?
Thanks!