Jump to content

  • Log In with Google      Sign In   
  • Create Account


Like
18Likes
Dislike

Introduction to Game Programming with CUDA

By Dan Ricart | Published Nov 18 2013 01:55 PM in Game Programming
Peer Reviewed by (Dave Hunt, CRFaithMusic, Bacterius)

cuda c++ opengl direct3d

Intro to CUDA


Modern game engines have a lot going on. With so many different subsystems competing for resources, multi-threading is a way of life. As multi-core CPUs have gotten cheaper and cheaper, game developers have been able to more easily take advantage of parallelism. While Intel and AMD fight to bring more cores and more cores to the CPU, GPUs have been easily surpassing them for raw parallel abilities. Modern GPUs contain thousands of cores, allowing tens of thousands of threads to execute code simultaneously. This presents game developers with yet another opportunity to add parallelism to their programs. In separate threads, an engine may want to perform a search or sort against a large amount of data, pre-process trees, generate a large amount of random data, process an image or perform calculations to be used for a transformation or collision detection. Any parallel computational task can be a good candidate for offloading to the GPU. This article aims to show you one possible way of harnessing that ability in a game using NVidia's CUDA.


CUDA is both a parallel platform and model that allows code to run directly on the processing cores that make up modern GPUs. It was created by NVidia and currently only supported on NVidia's hardware. It is similar to OpenCL in the idea but different in execution. Using CUDA is as simple as having a recent NVidia graphics card and downloading the free SDK. Links for Windows, Linux and Mac OSX can be found here. While it is proprietary to NVidia, the programming model is easy to use and supported by many languages such as C/C++, Java and Python and is even seeing support on ARM7 architectures. The CUDA programming syntax itself is based on C and so pairs well with games written in C or C++. The CUDA code you write is compiled to object code with NVidia's nvcc compiler and then is linked with standard C code using gcc or Visual Studio to produce the final program. For simple programs, the same file can be used to contain both your entry point and your CUDA function(s). After downloading and installing the toolkit, compiling CUDA code can be done from the command line with the nvcc compiler or through Visual Studio using the CUDA Runtime template which makes it easy to combine standard C/C++ and CUDA code files together in one project.


To demonstrate CUDA with C, we can start with a simple addition function. All samples shown in this article were compiled with the CUDA 5.5 toolkit:


__global__ void cudaAdd(int a, int b, int *c)
{
	*c = a + b;
}

This program adds two numbers and stores the result in c. The __global__ identifier marks this function as an entry point for the CUDA program. Now we will see an example of how to call the above program. This can be placed in the same file to create one complete program:


#include <stdio.h>
    
__global__ void cudaAdd(int a, int b, int *c)
{
	*c = a + b;
}

int main()
{
    int a = 4;
    int b = 7;
    int *c;
    int answer;
    cudaMalloc((void**)&c, sizeof(int));
    cudaAdd<<<1,1>>>(a, b, c);
    cudaMemcpy(&answer, c, sizeof(int), cudaMemcpyDeviceToHost);
    printf("%d + %d = %d\n", a, b, answer);
    return 0;
}

Programs on CUDA are executed as kernels, with one kernel executing at a time. The kernel can be run by just one or even thousands of threads at the same time. Since we are retrieving a result from the GPU, we first use CUDA to allocate memory for it. Next we execute our program, using the < >>> syntax to specify how many blocks and threads we want the kernel to use. The number of threads that can run in a block is dependent on the specific architecture of the GPU you have. For Fermi GPUs you can execute up to 1024 threads on a block. For this simple example we are just executing one thread on one block. Once we have the data in our c variable, we need to copy it back to system memory using cudaMemcpy. Finally we can display the result.


Performing a Reduce


With a simple example out of the way, we can look at a more common example. A reduce is a parallel operation where data that exists across many threads is combined over a series of steps until a single value is held by one thread. A common example could be computing a sum where each steps adds the values of two different threads. After each step, fewer and fewer threads are used until only the final thread adds the last two values remaining and holds the sum. For this sample, we will demonstrate a program that has separate threads count the number of 5's in parts of an array and then perform a reduce to get the final total. This sample can be run over any number of blocks and threads:


__global__ void countFives(int *array, int size, int *total)
{
    int index = threadIdx.x;
    int totalThreads = blockDim.x * gridDim.x;
    int totalThreadIndex = (blockIdx.x * blockDim.x) + threadIdx.x;
    __shared__ int sharedCounts[512];

    //first determine how many elements each thread must count
    int chunk = (size / totalThreads);
    if (size % totalThreads > 0)
        chunk++;

    int start = totalThreadIndex * chunk;
    int end = start + chunk;
    if (end >= size)
        end = size;

    sharedCounts[index] = 0;
    *total = 0;

    //have each thread count its own elements and store in shared memory
    for (int i = start; i < end; i ++)
    {
        if (array[i] == 5)
        {
            sharedCounts[index]++;
        }
    }
    __syncthreads();
	
    //now perform a reduce to get the sum of all counts
    //the stride tells us how many elements to include at each level
    //each loop reduces the number of threads needed until only the first thread is used to capture the count
    for (int stride = 1; stride < blockDim.x; stride*=2)
    {
        int offset = index*(stride*2);
        if (offset + stride < blockDim.x)
        {
            sharedCounts[offset]+=sharedCounts[offset+stride];
        }
    }
	
    //now have the first thread of each block sum the results to global memory
    if (index == 0)
    {
        atomicAdd(total, sharedCounts[0]);
    }
}

This program has three basic steps. First we broke up the array into chunks and had each thread look for 5's in its own chunk. Then we performed a simple add reduction across the threads on each block, storing the result in the shared memory of the first thread of each block. For the last step we used an atomic add to update the global total across the different blocks. The atomic add prevents any contention issues between threads. The syncthreads function show here is used to provide a stopping point for the threads. All threads must reach this point before the program can continue. The example on the whole is inefficient as it only uses about half the total threads for the reduction and has potential contention issues when accessing the global memory but hopefully demonstrates the basic concept of a reduction. The following allocates memory for the array and calls the function:


const int size = 11;
int sourceArray[size] = { 1, 4, 5, 2, 5, 6, 8, 9, 5, 12, 5 };
int total; //stores final value we can examine
int *cudaTotal; //value to allocate for cuda to use
int *cudaArray;
cudaMalloc(&cudaTotal, sizeof(int));
cudaMalloc(&cudaArray, sizeof(int)*size);
//copy our source numbers to cuda before calling
cudaMemcpy(cudaArray, sourceArray, sizeof(int)*size, cudaMemcpyHostToDevice);
countFives<<<2,2>>>(cudaArr, size, cudaTotal);
//copy our result from the device to the program's memory
cudaMemcpy(&total, cudaTotal, sizeof(int), cudaMemcpyDeviceToHost);

CUDA Thrust


A really great library that can be used for common CUDA tasks is Thrust. Thrust is a template library for CUDA that allows STL-like syntax to increase developer productivity. The CUDA SDK comes with a version of Thrust that can be easily used in C code. The following demonstrates a sum reduction and a count of fives using the same array as above:


#include <thrust/device_vector.h>
#include <thrust/count.h>

    ...

    thrust::device_vector<int> thrustArray(11);
    thrustArray[0] = 1;	thrustArray[1] = 4; thrustArray[2] = 5; thrustArray[3] = 2;
    thrustArray[4] = 5;	thrustArray[5] = 6; thrustArray[6] = 8; thrustArray[7] = 9;
    thrustArray[8] = 5;	thrustArray[9] = 12; thrustArray[10] = 5;

    //compute the sum of all elements in our array
    int sum = thrust::reduce(thrustArray.begin(), thrustArray.end(), (int) 0, thrust::plus<int>());
    //get a count of just the 5's in our array
    int count = thrust::count(thrustArray.begin(), thrustArray.end(), 5);
    printf("Array sum: %d Count of fives: %d\n", sum, count);

As you can see, the syntax is very similar to the Standard Template Library and makes it very easy to call common functions, saving you lots of coding time. It also integrates well with STL vectors. For useful examples of what Thrust can do, you can go here.


Integrating with OpenGL


A great feature of CUDA is its built-in ability to work with OpenGL directly. This allows a CUDA program easy access to data such as texture, pixel buffers or vertex buffers to perform operations against it quickly. Here we will see how we can use CUDA to alter data in parallel against a vertex buffer. The buffer shown here will be small and simple for demonstration purposes. I won't show all of the basic OpenGL set up or program layout here but this sample will work with code from any basic OpenGL tutorial. I placed all my OpenGL code and the main game loop in one c file and the CUDA kernel function and a wrapper to call it in a separate file with a .cu extension.


To get started, first we need to define our simple data structures to use to create the vertex buffer:


struct vertex {
    float x;
    float y;
    float z;
};

struct VertexType
{
    vertex position;
    //texture coordinate and other information below
    ...
};

Next we want to allocate an array to use for our vertex buffer using the above structures and then generate a buffer. For this sample we will just allocate an array of four vertices to store a quad. We also need a global variable to store the ID of our vertex buffer:


    GLuint vbufferId;
    ...

    VertexType verts[4];
    verts[0].position.x = -1.0f; verts[0].position.y = 1.0f; verts[0].position.z = 0.0f;
    verts[1].position.x = 1.0f; verts[1].position.y = 1.0f; verts[1].position.z = 0.0f;
    verts[2].position.x = 1.0f; verts[2].position.y = -1.0f; verts[2].position.z = 0.0f;
    verts[3].position.x = -1.0f; verts[3].position.y = -1.0f; verts[3].position.z = 0.0f;

    //fill in texture coordinates, etc
    ...

    glGenBuffers( 1, &vbufferId );
    glBindBuffer( GL_ARRAY_BUFFER, vbufferId );
    glBufferData( GL_ARRAY_BUFFER, 4 * sizeof(VertexType), verts, GL_DYNAMIC_DRAW );

With a simple buffer created, we can now create a CUDA resource to store a pointer to our vertex buffer. We need another global variable to store our resource:


struct cudaGraphicsResource *cuda_vb_resource;

Then we can map it to our vertex buffer immediately after the glBufferData call above:


cudaGraphicsGLRegisterBuffer(&cuda_vb_resource, vbufferId, cudaGraphicsMapFlagsWriteDiscard);

The resource now has a pointer to the vertex buffer we created above. This allows us to retrieve and modify them using CUDA. The actual program to modify our vertices is very simple. Since we want to stretch our cube in all directions, we must first get a positive or negative value by dividing the current vertices position by the absolute value of itself. Then we will multiply it by the elapsed time in seconds and by our desired rate of movement of .05 units a second.


__global__ void update_vb(VertexType *verts, double timeElapsed)
{
    int i = threadIdx.x;
    float valx = verts[i].position.x / abs(verts[i].position.x);
    float valy = verts[i].position.y / abs(verts[i].position.y);

    verts[i].position.x += valx * timeElapsed * .05f;
    verts[i].position.y += valy * timeElapsed * .05f;   
}

I placed this code in a file separate from the main c file with the OpenGL code and gave it an extension of .cu. Note that the program assumes that each thread will only act on one vertice. It also assumes one block for simplicity but you could easily execute this over multiple blocks if you had enough vertices. We use the index of our current thread to determine which vertice to operate on. We also use an elapsed time variable to control how much change we want in each loop. This helps keep the movement constant if frame rates vary and our time elapsed delta is constantly changing.


The last step now is to create a function to call our CUDA kernel. We can place this function in the same .cu file. The extern keyword is used so that our main c program is able to find it when compiling and linking.


extern "C" void cuda_kernel(VertexType *verts, double timeElapsed)
{
    update_vb<<<1,4>>>(verts, timeElapsed);
}

All the wrapper needs to do is pass in the arguments and instruct CUDA how many blocks and threads we want to run on. In this example we tell it to run over 4 threads in one block so each thread has its own vertice. With the function in place, we can call it from the main logic loop. You will want to put the above function's signature with the extern keyword in your main c file if using multiple files so it can be found when linking. This code is set to execute once per loop:


    VertexType *verts;
    cudaGraphicsMapResources(1, &cuda_vb_resource, 0);
    cudaGraphicsResourceGetMappedPointer((void **)&verts, &num_bytes, cuda_vb_resource);
    cuda_kernel(verts, timeElapsed);
    cudaGraphicsUnmapResources(1, &cuda_vb_resource, 0);

The code works by getting a pointer to the vertices in the vertex buffer that is mapped to our CUDA resource. Then they are passed to the kernel wrapper to be modified and unmapped so they are released. This sample assumes there is some code for getting the time elapsed delta between this and the previous loop. QueryPerformanceCounter works well for this. After clearing buffers and setting our texture, our render code looks like this:


    glEnableClientState( GL_VERTEX_ARRAY );
    glEnableClientState( GL_TEXTURE_COORD_ARRAY );
    glTexCoordPointer( 2, GL_FLOAT, sizeof(vertexType), (GLvoid*)offsetof( vertexType, texcoord ) );
    glVertexPointer( 3, GL_FLOAT, sizeof(vertexType), (GLvoid*)offsetof( vertexType, vert ) );

    //now draw the array
    glBindBuffer(GL_ARRAY_BUFFER, vbufferId);
    glDrawArrays(GL_QUADS, 0, 4);

    glDisableClientState( GL_TEXTURE_COORD_ARRAY );
    glDisableClientState( GL_VERTEX_ARRAY );

The last step is to free our resources:


    cudaGraphicsUnregisterResource(cuda_vb_resource);
    glBindBuffer(1, vbufferId);
    glDeleteBuffers(1, &vbufferId);

And thats it. OpenGL integration is fairly straightforward when dealing with buffers. This example can be easily extended to cover TextureBuffers, PixelBuffers or RenderBuffers as well.


Integrating with Direct3D


Similar to its integration with OpenGL, CUDA provides the ability to tie in with Direct3D 9, 10 or 11. Here I will demonstrate the Direct3D 11 version of modifying a simple vertex buffer. Just like with the OpenGL example, we will create a simple 2D cube that we can resize in a game loop. We can use the same vertex structure from the OpenGL example which allows us to use the same CUDA kernel function as we did earlier:


    struct cudaGraphicsResource *cuda_vb_resource;
    ...
    D3D11_BUFFER_DESC vertexBufferDesc, indexBufferDesc;
    D3D11_SUBRESOURCE_DATA vertexData, indexData;
    HRESULT result;

    m_vertexCount = 4;
    m_indexCount = 6;

    vertices = new VertexType[m_vertexCount];
    if(!vertices)
    {
        return false;
    }
    indices = new unsigned long[m_indexCount];
    if(!indices)
    {
        return false;
    }

    vertices[0].position.x = -1.0f; vertices[0].position.y = -1.0f; vertices[0].position.z = 0.0f;
    vertices[1].position.x = -1.0f; vertices[1].position.y = 1.0f; vertices[1].position.z = 0.0f;
    vertices[2].position.x = 1.0f; vertices[2].position.y = 1.0f; vertices[2].position.z = 0.0f;
    vertices[3].position.x = 1.0f; vertices[3].position.y = -1.0f; vertices[3].position.z = 0.0f;
    //fill in other properties
    ...
    //fill in indices for 2 triangles
    indices[0] = 0; indices[1] = 1; indices[2] = 2;
    indices[3] = 0; indices[4] = 2; indices[5] = 3;

    //create a dynamic vertex buffer
    vertexBufferDesc.Usage = D3D11_USAGE_DYNAMIC;
    vertexBufferDesc.ByteWidth = sizeof(VertexType) * m_vertexCount;
    vertexBufferDesc.BindFlags = D3D11_BIND_VERTEX_BUFFER;
    vertexBufferDesc.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
    vertexBufferDesc.MiscFlags = 0;
    vertexBufferDesc.StructureByteStride = 0;
    vertexData.pSysMem = vertices;
    vertexData.SysMemPitch = 0;
    vertexData.SysMemSlicePitch = 0;

    result = device->CreateBuffer(&vertexBufferDesc, &vertexData, &m_vertexBuffer);
    if(FAILED(result))
    {
        return false;
    }

    //now create the index buffer
    indexBufferDesc.Usage = D3D11_USAGE_DEFAULT;
    indexBufferDesc.ByteWidth = sizeof(unsigned long) * m_indexCount;
    indexBufferDesc.BindFlags = D3D11_BIND_INDEX_BUFFER;
    indexBufferDesc.CPUAccessFlags = 0;
    indexBufferDesc.MiscFlags = 0;
    indexBufferDesc.StructureByteStride = 0;
    indexData.pSysMem = indices;
    indexData.SysMemPitch = 0;
    indexData.SysMemSlicePitch = 0;

    result = device->CreateBuffer(&indexBufferDesc, &indexData, &m_indexBuffer);
    if(FAILED(result))
    {
        return false;
    }

With the buffers created we can associate the resource and our vertex buffer like we did with OpenGL:


    cudaGraphicsD3D11RegisterResource(&cuda_VB_resource, m_vertexBuffer, cudaGraphicsRegisterFlagsNone);

Finally our rendering code looks like this:


    unsigned int stride = sizeof(VertexType); 
    unsigned int offset = 0;
    
    deviceContext->IASetVertexBuffers(0, 1, &m_vertexBuffer, &stride, &offset);
    deviceContext->IASetIndexBuffer(m_indexBuffer, DXGI_FORMAT_R32_UINT, 0);
    deviceContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);

With that up and running, we can call the update from inside a game loop just like with OpenGL. The example I wrote used the exact same kernel and external wrapper function from the OpenGL example:


    VertexType *verts;
    size_t num_bytes;

    cudaGraphicsMapResources(1, &cuda_vb_resource, 0);
    cudaGraphicsResourceGetMappedPointer((void **)&verts, &num_bytes, cuda_vb_resource);
    cuda_kernel(verts, elapsedTime);
    cudaGraphicsUnmapResources(1, &cuda_vb_resource, 0);

Lastly we need to clean up:


    cudaGraphicsUnregisterResource(cuda_VB_resource);
    if(m_indexBuffer)
    {
        m_indexBuffer->Release();
        m_indexBuffer = 0;
    }

    if(m_vertexBuffer)
    {
        m_vertexBuffer->Release();
        m_vertexBuffer = 0;
    }

    delete [] vertices;
    delete [] indices;

Now we have seen some basic examples of how to create CUDA programs and how they can directly interact with data from OpenGL or Direct3D. These examples are pretty basic but hopefully provide a springboard to more advanced concepts. The SDK is loaded with useful samples that demonstrate the power and flexibility of the toolkit.




About the Author(s)


I am a hobbyist game programmer who also dabbles in HPC and distributed computing.

License


GDOL (Gamedev.net Open License)




Comments

Something I was wondering, are there any advantages to using CUDA vs using OpenCL?

Con: Only Nvidia cards
Pro: Easier to program with compared to OpenCL and easier to learn.

A very interesting article. Keep it up!

 

I want moar! ( My way of saying that I would be deeply in interested in some more ) :)

Great article - I was left wondering what aspects of Game Programming are successfully implemented utilizing GPGPU techniques (over and above cutting-edge graphics of course)...

Con: Only Nvidia cards
Pro: Easier to program with compared to OpenCL and easier to learn.

 

I actually found it easier to learn and code with OpenCL. There are other more important pros, like ability to use inline assembly (PTX) in CUDA. Another important pro is ability to use classes in CUDA (I am not sure if you can do it in OpenCL).

Nice article. I've been meaning to look at CUDA and hadn't gotten around to it. This was a good first look for me.

 

Just one minor nitpick: the singular of vertices is vertex, not vertice.

 

Con: Only Nvidia cards
Pro: Easier to program with compared to OpenCL and easier to learn.

 

I actually found it easier to learn and code with OpenCL. There are other more important pros, like ability to use inline assembly (PTX) in CUDA. Another important pro is ability to use classes in CUDA (I am not sure if you can do it in OpenCL).

 

 

Not in OpenCL 1.2, though there is an AMD-only extension that lets you use C++-like classes and templates (subject to the usual restrictions, e.g. no virtual functions and so on). Perhaps there is something ready for the OpenCL 2.0 standard, though, which I have been told focuses on making better use of the GPU through its relatively generic interface (since remember OpenCL needs to work on many different types of hardware and not just NVIDIA cards, so some API concepts may not easily map 1:1 to how GPU's work under the hood, limiting the potential for optimization).

 

 

Con: Only Nvidia cards
Pro: Easier to program with compared to OpenCL and easier to learn.

 

I actually found it easier to learn and code with OpenCL. There are other more important pros, like ability to use inline assembly (PTX) in CUDA. Another important pro is ability to use classes in CUDA (I am not sure if you can do it in OpenCL).

 

 

Not in OpenCL 1.2, though there is an AMD-only extension that lets you use C++-like classes and templates (subject to the usual restrictions, e.g. no virtual functions and so on). Perhaps there is something ready for the OpenCL 2.0 standard, though, which I have been told focuses on making better use of the GPU through its relatively generic interface (since remember OpenCL needs to work on many different types of hardware and not just NVIDIA cards, so some API concepts may not easily map 1:1 to how GPU's work under the hood, limiting the potential for optimization).

 

 

Thanks for clarification:) I was using mostly 1.1 because of destination hardware of application I was working on, but I can't wait for 2.0 :D

So how to port a CUDA game to android? lol joking :) how can I measure time elapsed between function execution and function end? I want by default test a parallel algorithm, try variations to it and see wich variation is faster..

Not in OpenCL 1.2, though there is an AMD-only extension that lets you use C++-like classes and templates (subject to the usual restrictions, e.g. no virtual functions and so on). Perhaps there is something ready for the OpenCL 2.0 standard, though, which I have been told focuses on making better use of the GPU through its relatively generic interface (since remember OpenCL needs to work on many different types of hardware and not just NVIDIA cards, so some API concepts may not easily map 1:1 to how GPU's work under the hood, limiting the potential for optimization).

 

 

Thanks to this article I've been looking thru the Khronos Group website (CUDA seems fun but is limited to nVidia hardware only, so I began research on OpenCL) for some time and found the official C++ specification for OpenCL 1.2. I'ts just a C++ wrapper for the C API, but at least you get some OOP out of it. Sadly I don't see anything like that for OpenCL 2.0 for the time being. I'm posting a link below:

 

http://www.khronos.org/registry/cl/

 

Not in OpenCL 1.2, though there is an AMD-only extension that lets you use C++-like classes and templates (subject to the usual restrictions, e.g. no virtual functions and so on). Perhaps there is something ready for the OpenCL 2.0 standard, though, which I have been told focuses on making better use of the GPU through its relatively generic interface (since remember OpenCL needs to work on many different types of hardware and not just NVIDIA cards, so some API concepts may not easily map 1:1 to how GPU's work under the hood, limiting the potential for optimization).

 

 

Thanks to this article I've been looking thru the Khronos Group website (CUDA seems fun but is limited to nVidia hardware only, so I began research on OpenCL) for some time and found the official C++ specification for OpenCL 1.2. I'ts just a C++ wrapper for the C API, but at least you get some OOP out of it. Sadly I don't see anything like that for OpenCL 2.0 for the time being. I'm posting a link below:

 

http://www.khronos.org/registry/cl/

 

 

2.0 isn't available yet (the planned spec is up there though) so it's normal to not see any headers for 2.0 yet - no public implementations exist at this time. Soon, hopefully smile.png

So how to port a CUDA game to android? lol joking smile.png how can I measure time elapsed between function execution and function end? I want by default test a parallel algorithm, try variations to it and see wich variation is faster..

 

CUDA provides event classes that can be used to record starting and completion times -- in the docs, look for functions that start with cudaEvent... 

2.0 isn't available yet (the planned spec is up there though) so it's normal to not see any headers for 2.0 yet - no public implementations exist at this time. Soon, hopefully smile.png

 

 

Very soon, I hope. The specification just went official. :)

 

https://www.khronos.org/news/press/khronos-finalizes-opencl-2.0-specification-for-heterogeneous-computing

Nice article, I learned CUDA, have to admit the logic of some CUDA programs can be difficult to follow, however its a pleasure.

Need a bigger realworld example like dynamic collision detection (possibly including ricochet paths??) on mass projectiles - probably against derived 3D terrain data (with dynamic parts added)   May need a temporal continuation persistance for projectiles which take many turns to transit their paths (and dynamic object can move into/outof way)


Note: Please offer only positive, constructive comments - we are looking to promote a positive atmosphere where collaboration is valued above all else.




PARTNERS