Sign in to follow this  
way2lazy2care

Matrix Multiplication in CUDA

Recommended Posts

Hullo all, I am making a program to generate a bunch of random matrices then solve them using Cuda. Currently I generate a bunch of random matrices on the CPU, pop them over to the device and solve them there. This isn't really that efficient yet as I only run the Cuda part after each matrix is generated and I only run it for an individual matrix then send it back to the host and repeat. What I would like to do is generate the matrices on the device, then solve them on the device before retrieving one of the solutions to be sure it multiplies correctly. I want each matrix multiplication to be done among a block and then the total number of matrix multiplications being done would be the number of blocks. Now my questions. 1. Is it better to do the matrix multiplication using a 1 dimensional array where I can figure out the row/column using a % or some such, or would using a 2 dimensional array be better to represent a matrix? 2. Is it a bad idea to have a kernel to generate a single element of each matrix/vector rather than have the kernel generate all elements of a given matrix/vector? 3. If I am generating matrices on the device, how do I make sure that the solution function stays in the same memory space as the kernel? 4. If I want to fire off the solving function after all the numbers in the given matrices are generated I am assuming I would use syncthreads, but I am not sure where to syncthreads. Can you syncthreads on the device or do you have to call syncthreads from the host? I think that's all I have for now. Answers to any questions would be super.

Share this post


Link to post
Share on other sites
I got another one.

If I have 3 matrices on the device A_d, B_d, C_d

I want to multiply A_d*B_d = C_d.

Assuming each has been CudaMalloced correctly, when I am filling A_d and B_d or filling C_d, will each block have the memory allocated separately in their own memory space? If not is there a way to have each matrix exist on each block? That in mind, is there a way to know which C_d I am getting back from the device.

Sorry, I know these are Cuda noob questions, but I didn't want to put it in For Beginners because I wouldn't really consider Cuda a beginner subject.

Share this post


Link to post
Share on other sites
Hi,

I don't really have time to concentrate on and answer all your questions, but I was just wondering if you've read the CUDA programming guide or gone through the samples in the SDK. There is a matrix multiplication example in the samples and it is explained nicely in the Programming Guide too.

You said that you would like to calculate each matrix multiplication in a separate block? are you aware that blocks have size limits? I think the max # of threads per block is 512, meaning that your matrices would not be allowed more than 512 elements.

In a general case you wouldn't want to limit this to within a threadblock, and would rather split the desired product (represented by a 2D grid) into multiple thread blocks.

I suggest taking a look at the Example I mentioned and then ask any further questions you may have.

I am struggling to understand your questions though, especially in your second post. What do you mean by matrices existing in each block exactly? and what do you mean by getting the _right_ C_d back from the device?

Share this post


Link to post
Share on other sites
Quote:
Original post by rewolfer
Hi,

I don't really have time to concentrate on and answer all your questions, but I was just wondering if you've read the CUDA programming guide or gone through the samples in the SDK. There is a matrix multiplication example in the samples and it is explained nicely in the Programming Guide too.

You said that you would like to calculate each matrix multiplication in a separate block? are you aware that blocks have size limits? I think the max # of threads per block is 512, meaning that your matrices would not be allowed more than 512 elements.

In a general case you wouldn't want to limit this to within a threadblock, and would rather split the desired product (represented by a 2D grid) into multiple thread blocks.

I suggest taking a look at the Example I mentioned and then ask any further questions you may have.

I am struggling to understand your questions though, especially in your second post. What do you mean by matrices existing in each block exactly? and what do you mean by getting the _right_ C_d back from the device?


Ok. I forgot a major part. I am doing all 4X4 matrices being multiplied by 4X1 vectors. Going for more quantity of multiplications than size of matrix.

It's not a very useful program, but I'm doing it to learn Cuda for a class, so it's kind of isolated in it's usability.

The one in the Cuda programming guide seems better for doing huge huge matrices once. I'm kind of approaching the opposite problem.
_______________________________________________
Here's what I'm doing right now that's different from before.

I am declaring 3 1D float arrays in shared memory then each thread in the block will randomly generate a number to be put in the array. Then I send them to another device function that actually does the multiplying and stores the result into a shared 16 element array.

Then when I get the 16 element array I add every four elements and store them in the result array. EX:

C[0] = C_shared[0] + C_shared[1] + C_shared[2] + C_shared[3];

that last bit feels less than optimal, but I was concentrating on getting the first part working then getting the rest done.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this