Matrix Multiplication in CUDA

Started by
4 comments, last by way2lazy2care 14 years, 1 month ago
Hullo all, I am making a program to generate a bunch of random matrices then solve them using Cuda. Currently I generate a bunch of random matrices on the CPU, pop them over to the device and solve them there. This isn't really that efficient yet as I only run the Cuda part after each matrix is generated and I only run it for an individual matrix then send it back to the host and repeat. What I would like to do is generate the matrices on the device, then solve them on the device before retrieving one of the solutions to be sure it multiplies correctly. I want each matrix multiplication to be done among a block and then the total number of matrix multiplications being done would be the number of blocks. Now my questions. 1. Is it better to do the matrix multiplication using a 1 dimensional array where I can figure out the row/column using a % or some such, or would using a 2 dimensional array be better to represent a matrix? 2. Is it a bad idea to have a kernel to generate a single element of each matrix/vector rather than have the kernel generate all elements of a given matrix/vector? 3. If I am generating matrices on the device, how do I make sure that the solution function stays in the same memory space as the kernel? 4. If I want to fire off the solving function after all the numbers in the given matrices are generated I am assuming I would use syncthreads, but I am not sure where to syncthreads. Can you syncthreads on the device or do you have to call syncthreads from the host? I think that's all I have for now. Answers to any questions would be super.
Advertisement
I got another one.

If I have 3 matrices on the device A_d, B_d, C_d

I want to multiply A_d*B_d = C_d.

Assuming each has been CudaMalloced correctly, when I am filling A_d and B_d or filling C_d, will each block have the memory allocated separately in their own memory space? If not is there a way to have each matrix exist on each block? That in mind, is there a way to know which C_d I am getting back from the device.

Sorry, I know these are Cuda noob questions, but I didn't want to put it in For Beginners because I wouldn't really consider Cuda a beginner subject.
Questions 3, 4 and the unnumbered second post have been answered by digging through the Cuda programming guide.
Hi,

I don't really have time to concentrate on and answer all your questions, but I was just wondering if you've read the CUDA programming guide or gone through the samples in the SDK. There is a matrix multiplication example in the samples and it is explained nicely in the Programming Guide too.

You said that you would like to calculate each matrix multiplication in a separate block? are you aware that blocks have size limits? I think the max # of threads per block is 512, meaning that your matrices would not be allowed more than 512 elements.

In a general case you wouldn't want to limit this to within a threadblock, and would rather split the desired product (represented by a 2D grid) into multiple thread blocks.

I suggest taking a look at the Example I mentioned and then ask any further questions you may have.

I am struggling to understand your questions though, especially in your second post. What do you mean by matrices existing in each block exactly? and what do you mean by getting the _right_ C_d back from the device?
Oh and I stand to be corrected, but afaik kernels usually always only generate one element of a result (question 2).
Quote:Original post by rewolfer
Hi,

I don't really have time to concentrate on and answer all your questions, but I was just wondering if you've read the CUDA programming guide or gone through the samples in the SDK. There is a matrix multiplication example in the samples and it is explained nicely in the Programming Guide too.

You said that you would like to calculate each matrix multiplication in a separate block? are you aware that blocks have size limits? I think the max # of threads per block is 512, meaning that your matrices would not be allowed more than 512 elements.

In a general case you wouldn't want to limit this to within a threadblock, and would rather split the desired product (represented by a 2D grid) into multiple thread blocks.

I suggest taking a look at the Example I mentioned and then ask any further questions you may have.

I am struggling to understand your questions though, especially in your second post. What do you mean by matrices existing in each block exactly? and what do you mean by getting the _right_ C_d back from the device?


Ok. I forgot a major part. I am doing all 4X4 matrices being multiplied by 4X1 vectors. Going for more quantity of multiplications than size of matrix.

It's not a very useful program, but I'm doing it to learn Cuda for a class, so it's kind of isolated in it's usability.

The one in the Cuda programming guide seems better for doing huge huge matrices once. I'm kind of approaching the opposite problem.
_______________________________________________
Here's what I'm doing right now that's different from before.

I am declaring 3 1D float arrays in shared memory then each thread in the block will randomly generate a number to be put in the array. Then I send them to another device function that actually does the multiplying and stores the result into a shared 16 element array.

Then when I get the 16 element array I add every four elements and store them in the result array. EX:

C[0] = C_shared[0] + C_shared[1] + C_shared[2] + C_shared[3];

that last bit feels less than optimal, but I was concentrating on getting the first part working then getting the rest done.

This topic is closed to new replies.

Advertisement