Sign in to follow this  
  • entries
    10
  • comments
    9
  • views
    49085

Realtime raytracing with OpenCL I

Sign in to follow this  
ne0_kamen

10268 views

Hello,

As my university's exams are going, I had been busy working on various course projects.
This semester I took a Raytracing course. That's something I've been playing with some time ago, and I thought I could get some credits without paying much attention (which is important due to my lack of time and motivation to go to lectures :D ).

Anyway, my course task is to "Use OpenCL to write a real-time raytracer" - which was very fun in my opinion.
Since that's something I've never done before and I believe some folk over here would find it interesting, I decided to write 2 short journal entries of my experience.




Introduction to OpenCL
OpenCL( Open Computing Language) is a hardware - independent language for performing computations in massively parallel manner.
In general you could run it on anything that has drivers for it - GPU or CPU.
The language itself is very similar to C (not C++) with a few additions and exceptions.
It also provides a small standard library of functions (mainly math functions) and native types for tuples (float3/float4).

In this journal, I will write and explain a small demo of how to use OpenCL for general computations. The actual raytracer will be expained in the next journal.

Compiling and linking OpenCL applications
First you need to decide on what hardware you will run your app. Since I have an ATI 5650 GPU, I downloaded the AMD APP SDK from
http://developer.amd...es/default.aspx
This (by default) installed the OpenCL.lib file n C:\Program Files\AMD APP\lib\x86 and all headers in C:\Program Files\AMD APP\include\CL .

Write the following application and link against the OpenCL.lib

#include
void main()
{
// 1. Get a platform.
cl_platform_id platform;

clGetPlatformIDs( 1, &platform, NULL );
}


If it builds and links proceed to the next step :

Building a minimal application
So now that you can use the OpenCL API, its time to write a short OpenCL app.
We need 2 things : First to write an "OpenCL kernel" - this is a small function written in the OpenCL language that is executed in a separate thread, and second - to init the OpenCL library properly and pass data to the "kernel" - the data that our GPU will be processing in this case.

If you are intrested in detailed description of a OpenCL API function, visit http://www.khronos.o...ocs/man/xhtml/.
I'm gonna keep this as brief as possible, but mostly this is boilerplate code you probably would want to copy/paste every time.

As you already saw, we need to get an OpenCL platform :
// 1. Get a platform.
cl_platform_id platform;

clGetPlatformIDs( 1, &platform, NULL );


The first argument sets how many platforms we want to get (if the second param is an array) and we could use the last parameter to get the maximum number of platforms available.
After that we need to request a "device". A device is the actual hardware we are going to use - in our case our GPU :

// 2. Find a gpu device.
cl_device_id device;

clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

We pass our platform as the 1st parameter and the type of the device as the 2nd.
Now we need a context and a command queue.

// 3. Create a context and command queue on that device.
cl_context context = clCreateContext( NULL, 1, &device, NULL, NULL, NULL);
cl_command_queue queue = clCreateCommandQueue( context, device, 0, NULL );


If you are intrested in the parameters, look at he specification.
Now we need to create the actual OpenCL program. Its advisable to store it in a file and load it from there. Let's call this file kernel.txt.
Here is how we are gonna read its contents and load it :



// 4. Perform runtime source compilation, and obtain kernel entry point.
std::ifstream file("kernel.txt");
std::string source;
while(!file.eof()){
char line[256];
file.getline(line,255);
source += line;
}
const char* str = source.c_str();
cl_program program = clCreateProgramWithSource( context, 1, &str, NULL, NULL );
cl_int result = clBuildProgram( program, 1, &device, NULL, NULL, NULL );
if ( result ){
std::cout << "Error during compilation! (" << result << ")" << std::endl;
}


Note that I haven't done any error checking until know. However, clBuildProgram should always be checked, since if you make a mistake in the OpenCL code itself - it will be detected here.
Lets know load our "kernel" function - if we compiled our program successfully this should work.

cl_kernel kernel = clCreateKernel( program, "main", NULL );


Ok its time to create the data that we are gonna pass into our kernel function and the data we are gonna use as output from the function.
This is important step to remember since you would want to modify it for each application you write.
In this example I'm going to sum the contents of two arrays (of equal size) and display the result on the screen.
So I need three "OpenCL buffers".



// 5. Create data buffers.
cl_mem output = clCreateBuffer( context, CL_MEM_WRITE_ONLY, 10*sizeof(cl_int), NULL, 0 );
cl_mem buffer1 = clCreateBuffer( context, CL_MEM_READ_WRITE, 10*sizeof(cl_int), NULL, 0 );
cl_mem buffer2 = clCreateBuffer( context, CL_MEM_READ_WRITE, 10*sizeof(cl_int), NULL, 0 );
clSetKernelArg(kernel, 0, sizeof(output), (void*) &output);
clSetKernelArg(kernel, 1, sizeof(buffer1), (void*) &buffer1);
clSetKernelArg(kernel, 2, sizeof(buffer2), (void*) &buffer2);


Now we need to fill the input buffers that we are going to process.


// 6. Fill input data buffers
cl_int *buffer1Ptr = (cl_int *) clEnqueueMapBuffer( queue,
buffer1,
CL_TRUE,
CL_MAP_WRITE,
0,
10 * sizeof(cl_int),
0, NULL, NULL, NULL );

cl_int *buffer2Ptr = (cl_int *) clEnqueueMapBuffer( queue,
buffer2,
CL_TRUE,
CL_MAP_WRITE,
0,
10 * sizeof(cl_int),
0, NULL, NULL, NULL );

for(int i = 0; i < 10; i++){
buffer1Ptr = i;
buffer2Ptr = i;
}

clEnqueueUnmapMemObject(queue, buffer1, buffer1Ptr, 0, 0, 0);
clEnqueueUnmapMemObject(queue, buffer2, buffer2Ptr, 0, 0, 0);



We pass the buffer as the second param to the clEnqueueMapBuffer function. The third param denotes we want to "block" until the operation is complete.
This is important if we want simplicity, otherwise we would have to use events to sync. Lastly we need to set the operation we are doing on the buffers and the amount of memory we are going to use

The clEnqueueMapBuffer functions return buffers which we can modify directly (using for loops above). Alternatively we could use memcpy.
Finally we Unmap the buffers, denoting we are done writing information to them.

Next we need to execute our "kernel" function :

size_t global_work_size = 10;

// 7. Execute the kernel
clEnqueueNDRangeKernel( queue,
kernel,
1,
NULL,
&global_work_size,
NULL, 0, NULL, NULL);





We set the work size (which is usually the same as the size of our output buffer)
Finally read the result and print it to the console :


// 8. Look at the results via synchronous buffer map.
cl_int *resultBufferPtr = (cl_int *) clEnqueueMapBuffer( queue,
output,
CL_TRUE,
CL_MAP_READ,
0,
10 * sizeof(cl_int),
0, NULL, NULL, NULL );

for(int i = 0; i < 10; i++){
std::cout << "ptr[" << i << "] = " << resultBufferPtr << std::endl;
}



The full source of our program :



#include
#include
#include
void main()
{
// 1. Get a platform.
cl_platform_id platform;

clGetPlatformIDs( 1, &platform, NULL ); // 2. Find a gpu device.
cl_device_id device;

clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
// 3. Create a context and command queue on that device.
cl_context context = clCreateContext( NULL, 1, &device, NULL, NULL, NULL);
cl_command_queue queue = clCreateCommandQueue( context, device, 0, NULL ); // 4. Perform runtime source compilation, and obtain kernel entry point.
std::ifstream file("kernel.txt");
std::string source;
while(!file.eof()){
char line[256];
file.getline(line,255);
source += line;
}
const char* str = source.c_str();
cl_program program = clCreateProgramWithSource( context,
1,
&str,
NULL, NULL );
cl_int result = clBuildProgram( program, 1, &device, NULL, NULL, NULL );
if ( result ){
std::cout << "Error during compilation! (" << result << ")" << std::endl;
} cl_kernel kernel = clCreateKernel( program, "main", NULL );
// 5. Create data buffers.
cl_mem output = clCreateBuffer( context,
CL_MEM_WRITE_ONLY,
10*sizeof(cl_int),
NULL, 0 );
cl_mem buffer1 = clCreateBuffer( context,
CL_MEM_READ_WRITE,
10 *sizeof(cl_int),
NULL, 0 ); cl_mem buffer2 = clCreateBuffer( context,
CL_MEM_READ_WRITE,
10*sizeof(cl_int),
NULL, 0 );
clSetKernelArg(kernel, 0, sizeof(output), (void*) &output);
clSetKernelArg(kernel, 1, sizeof(buffer1), (void*) &buffer1);
clSetKernelArg(kernel, 2, sizeof(buffer2), (void*) &buffer2); // 6. Fill input data buffers
cl_int *buffer1Ptr = (cl_int *) clEnqueueMapBuffer( queue,
buffer1,
CL_TRUE,
CL_MAP_WRITE,
0,
10 * sizeof(cl_int),
0, NULL, NULL, NULL );
cl_int *buffer2Ptr = (cl_int *) clEnqueueMapBuffer( queue,
buffer2,
CL_TRUE,
CL_MAP_WRITE,
0,
10 * sizeof(cl_int),
0, NULL, NULL, NULL ); for(int i = 0; i < 10; i++){
buffer1Ptr = i;
buffer2Ptr = i;
}
clEnqueueUnmapMemObject(queue, buffer1, buffer1Ptr, 0, 0, 0);
clEnqueueUnmapMemObject(queue, buffer2, buffer2Ptr, 0, 0, 0);
size_t global_work_size = 10;
// 7. Execute the kernel
clEnqueueNDRangeKernel( queue,
kernel,
1,
NULL,
&global_work_size,
NULL, 0, NULL, NULL);

// 8. Look at the results via synchronous buffer map.
cl_int *resultBufferPtr = (cl_int *) clEnqueueMapBuffer( queue,
output,
CL_TRUE,
CL_MAP_READ,
0,
10 * sizeof(cl_int),
0, NULL, NULL, NULL ); for(int i = 0; i < 10; i++){
std::cout << "ptr[" << i << "] = " << resultBufferPtr << std::endl;
}
}


Now we need one more thing : the kernel file.
Open up your text editor and create a kernel.txt with the following contents

__kernel void main( __global int *dst, __global int* buffer1, __global int* buffer2 )
{
dst[get_global_id(0)] = buffer1[get_global_id(0)] + buffer2[get_global_id(0)];
}



There are a couple of strange things above :
get_global_id(0) returns the current id of our job. Since we set our work size to be 10, this means we will get 10 parallel executions of our functions with ids of 0-9.
__kernel denotes that the function is a kernel.
The __global means that the memory is allocated in the global memory pool.

So that was it!
Hope somebody finds this helpful. I will post the Raytracer code with explanations next week.
Sign in to follow this  


1 Comment


Recommended Comments

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now