Making realtime raytracer interactive?

Started by
9 comments, last by BadEggGames 7 years, 8 months ago

Hi

For the past few days I have been writing a realtime raytracer. I get 150fps with blinn-phong shading, one reflection ray and a gradient sky. It drops to about 30fps with 9 times anti aliasing which is to be expected and i am on a really crappy laptop.

Anyway, it has made me want to turn it into a simple ray traced game. Rendering the frames is the easy part, but making it interactive slows it way down. It is fast because of opencl/opengl interop by sharing a texture between the two eliminating the reading and writing from cpu to gpu each frame. As there is no way to update the position of the camera or objects via user input without sending info from the host without slowing it down, i am at a loss.

Is this the bottleneck everyone faces? Are there any workarounds? Do other platforms do things differently/better?

Curious to see what other people have done in this area.

Thanks

Advertisement

As there is no way to update the position of the camera or objects via user input without sending info from the host without slowing it down, i am at a loss.

You can upload your data for next_frame+1, instead for next_frame (or even next_frame+2, if you use triple buffering).

Basically it's a trade off between frequency (FPS) and latency (lag), but usually you can get rid of any slow down in practice without noticeable lag. (It becomes more of an issue with VR, where lag is more likely to be recognized).

Additianally, modern APIs allow to upload data while both CPU and GPU are busy with other things, probably OpenGL has this too - not sure.

Hey there.

Updating positions, and a couple of global / constant variables on the gpu won't be a bottleneck, if you do things properly. Ie updating everything in one go, and such. If you were writing a traditional, rasterized game, the same step (or even more!) would be done anyhow.

Unless you are planning to create a game, with a bouncing reflective sphere and a procedurally textured infinite plane, you'll have other factors that will significantly slow down your refresh rate. Things like updating BVHs, spending way more time on tracing rays and evaluating materials and so on.

I think you are worrying about factors, that won't be a significant reason for slowdowns at later stages.

shaken, not stirred

Updating positions, and a couple of global / constant variables on the gpu won't be a bottleneck, if you do things properly.

I dont see how it can be done any faster.

I did a simple experiment where I ray traced one sphere, no reflections, no shading, no lighting, no antialiasing. Just a ray sphere intersection for each pixel per frame. Doing one readbuffer and one writebuffer each frame in opencl did this simple image at 100fps. By eliminating the read and write and using gl_cl interop, it did 1200fps.

Rendering one sphere and one plane with antialiasing and reflections drops it down to 30fps. Doing it with one read and one write takes it below 1fps. This i thought was good considering the same image took over 2 minutes to render one frame 10 years ago. Obviously realtime raytracing is future technology and computers need to improve a fair bit to have a full rt raytracer with a million triangles and radiosity and the rest but in the mean time, this little test shows that the gpu has no problem rendering. The real problem is moving memory between global and host space as far as this little example is concerned. I understand when the scene gets more complex then the update between host and gpu will become less significant, but it is the most significant factor when it comes to moving this particular camera around the above red sphere.

Rendering one sphere and one plane with antialiasing and reflections drops it down to 30fps. Doing it with one read and one write takes it below 1fps

You mean this 'one read and one write' is just camera and sphere position, but it brings you down to 1 fps? No typo?

There must be something wrong.

I guess you give them as a kernel argument, or upload a buffer accessed by the kernel. Both should go far beyond 60 FPS, and even for that simple scene tracing should be the bottleneck.

I'd look for an OpenCL OpenGL interop example with source, probably available from AMD or NV SDKs, if you expect the reason to be about data sharing.

Host<->Device communication is fast enough, even as high as to be measured in gigabytes per second in some cases.
The problem is that it also has extremely high latencies... And all GL/CL commands that request the device to do some work also have extremely high latencies (milliseconds to dozens of milliseconds).

You need to structure all of your data flow and processing so that it forms a pipeline with no stalls. Data must be sent well ahead of when it will be consumed,and care must be taken that you're not accidentally causing the API to synchronise the host & device, but also ensure that you've synchronised things enough as to avoid a race condition between host & device...

Updating positions, and a couple of global / constant variables on the gpu won't be a bottleneck, if you do things properly.

I dont see how it can be done any faster.

I did a simple experiment where I ray traced one sphere, no reflections, no shading, no lighting, no antialiasing. Just a ray sphere intersection for each pixel per frame. Doing one readbuffer and one writebuffer each frame in opencl did this simple image at 100fps. By eliminating the read and write and using gl_cl interop, it did 1200fps.

Rendering one sphere and one plane with antialiasing and reflections drops it down to 30fps. Doing it with one read and one write takes it below 1fps. This i thought was good considering the same image took over 2 minutes to render one frame 10 years ago. Obviously realtime raytracing is future technology and computers need to improve a fair bit to have a full rt raytracer with a million triangles and radiosity and the rest but in the mean time, this little test shows that the gpu has no problem rendering. The real problem is moving memory between global and host space as far as this little example is concerned. I understand when the scene gets more complex then the update between host and gpu will become less significant, but it is the most significant factor when it comes to moving this particular camera around the above red sphere.

Specify what read and write means in your case. Maybe post a few code snippets here. Using cl_interop in this case is super confusing. We were talking about passing parameters that describe the scene (ie, the ones you have to change based on the user's input) to the kernel. You can't avoid that. User input has to be processed on the CPU, and the scene changes have to be pushed back to the GPU somehow.

shaken, not stirred

cl_gl interop just means that in the above example, i am using an opengl texture, using opencl to change the texture and giving it back to opengl to render it to a quad that covers the screen. No user input, no gpu->cpu reading or writing, it is all done on the gpu and is super fast. The position data is stored in global gpu memory and is updated each frame by a kernel.

If i want to update the position via keyboard i have to send the data to the kernel each frame from the host .

You mean this 'one read and one write' is just camera and sphere position, but it brings you down to 1 fps? No typo?

Correct.

clEnqueueWriteBuffer and clEnqueReadBuffer.

These two commands have a lot of overhead and there isnt anyway around them.

I just made a test in my project (i'm not using interop, also no double buffering).

I upload a buffer (2048 bytes) with a blocking enqueue and immideately use it in a kernel, so worst case - the upload cost is about 5ms per frame.

That's a lot, but your case still seems 10 times worse.

my Upload looks like this:

status |= clEnqueueWriteBuffer (queues[0], buffers[bDBG], CL_TRUE, 0, bufferSize[bDBG], dst, 0, NULL, NULL);

and Buffer creation:

buffers[bDBG] = clCreateBuffer (factory.context, CL_MEM_READ_WRITE, bufferSize[bDBG], NULL, &status);

My hardware is i7930, FuryX.

Maybe the large update lag is indeed normal for your system.

So i suggest you try to use double / triple buffering to hide the latency.

Of course you need to do it non-blocking and handle events to sync upload and usage.

Additianally, modern APIs allow to upload data while both CPU and GPU are busy with other things

I think to do this with OpenCL, you need to create an additional queue and use it for buffer writes.

Somehow i doupt this will be a huge win, especially for small data.

Why do you need the read? In your case you only need the write.

shaken, not stirred

This topic is closed to new replies.

Advertisement