Realtime raytracing on a 2014 GPU

Started by
11 comments, last by jeskeca 4 years, 10 months ago

Heyo!

For the last few months I've been working on a realtime raytracer (like everyone currently), but have been trying to make it work on my graphics card, an NVidia GTX 750 ti - a good card but not an RTX or anything :P... So I figured I'd post my results since they're kinda cool and I'm also interested to see if anyone might have some ideas on how to speed it up further :)

Here's a dreadful video showcasing some of what I have currently:

I've sped it up a tad and fixed reflections since then but 'eh it gets the gist across :P. If you're interested in trying out a demo or checking out the shader source code, I've attached a windows build (FlipperRaytracer_2019_02_25.zip). I develop on Linux so it's not as well tested as I'd like but it works on an iffy laptop I have so hopefully it'll be alright XD. You can change the resolution and whether it starts up in fullscreen in a config file next to it, and in the demo you can fly around, change the lighting setup and adjust various parameters like the frame blending (increase samples) and disabling GI, reflections, etc. If anyone tests it out I'd love to know what sort of timings you get on your GPU :)

But yeah so currently I can achieve about 330million rays a second, enough to shoot 3 incoherent rays per pixel at 1080p at 50fps - so not too bad overall. I'm really hoping to bump this up a bit further to 5 incoherent rays at 60fps...but we'll see :P

 

I'll briefly describe how it works now :). Each render loop it goes through these steps:

  • Render the scene into a 3D texture (Voxelize it)
  • Generate an acceleration structure akin to an octree from that
  • Render GBuffer (I use a deferred renderer approach)
  • Calculate lighting by raytracing a few rays per pixel
  • Blend with previous frames to increase sample count
  • Finally output with motion blur and some tonemapping

Pretty much the most obvious way to do it all :P

So the main reason it's quick enough is the acceleration structure, which is kinda cool in how simple yet effective it is. At first I tried distance fields, which while really efficient to step through, just can't be generated fast enough in real time (I could only get it down to 300ms for a 512x512x512 texture). Besides I wanted voxel accurate casting for some reason anyway (blocky artifacts look so good...), so I figured I'd start there. Doing an unaccelerated raycast against a voxel texture is simple enough, just cast a ray and test against every voxel the ray intersects, by stepping through it pixel by pixel using a line-stepping algorithm like DDA. The cool thing is, by voxelizing the scene at different mipmaps it's possible to take differently sized steps by checking which is the lowest resolution mipmap with empty space. This can be precomputed into a single texture allows that information in 1 sample. I've found this gives pretty similar raytracing speed to the distance fields, but can be generated in 1-2ms, ending up with a texture like this (a 2D slice):

AccelerationThing.png.4b15762709af6eb2a61393edbcb94b87.png

It also has some nice properties, like if the ray is cast directly next to and parallel to a wall, instead of moving tiny amounts each step (due to the distance field saying it's super close to something) it'll move...an arbitrary amount depending on where the wall falls on the grid :P. Still the worst case is the same as the distance field and it's best case is much better so it's pretty neat :P

So then for the raytracing I use some importance sampling, directing the rays towards the lights. I find just picking a random importance sampler per pixel and shooting towards that looks good enough and allows as many as I need without changing the framerate (only noise). Then I throw a random ray to calculate GI/other lights, and a ray for reflections. The global illumination works pretty simply too, when voxelizing the scene I throw some rays out from each voxel, and since they raycast against themselves, each frame I get an additional bounce of light :D. That said, I found that a bit slow, so I have an intermediate step where I actually render the objects into a low resolution lightmap, which is where the raycasts take place, then when voxelizing I just sample the lightmap. This also theoretically gives me a fallback in case a computer can't handle raytracing every pixel or the voxel field isn't large enough to cover an entire scene (although currently the lightmap is...iffy...wouldn't use it for that yet XD).

And yeah then I use the usual temporal anti aliasing technique to increase the sample count and anti-alias the image. I previously had a texture that would keep track of how many samples had been taken per pixel, resetting when viewing a previously unviewed region, and used this to properly average the samples (so it converged much faster/actually did converge...) rather than using the usual exponential blending. That said I had some issues integrating any sort of discarding with anti-aliasing so currently I just let everything smear like crazy XD. I think the idea there though is to just have separate temporal supersampling and temporal anti aliasing, so I might try that out. That should improve the smearing and noise significantly...I think XD

 

Hopefully some of that made sense and was interesting :), please ask any questions you have, I can probably explain it better haha. I'm curious to know what anyone thinks, and of course any ideas to speed it up/develop it further are very much encouraged :D.

Ooh also I'm also working on some physics simulations, so you can create a realtime cloth in it by pressing C - just usual position based dynamics stuff. Anyway questions on that are open too :P.

 

FlipperRaytracer_2019_02_25.zip

Advertisement

Super awesome!

I get 83 fps on FuryX on 1440p screen :O

Keep us updated... :)

You could try Spatiotemporal Variance Guided Filter to make it much 'faster'? Guess you know the Quake2 demo is open source but there's also a project on this site: http://cwyman.org/papers.html

Very impressive work! :D

 

Very cool, thanks for explaining how it works! I played around with some similar techniques (ray-tracing voxels using an SDF to accelerate the trace) about a year ago, and it was a fun little project. I was mainly interested of doing it in the context of real-time/interactive lightmap baking on the GPU, and it seemed to work really well for that. But yeah, even with jump-flooding building that SDF got real slow. The voxel grid could also end up using quite a bit of memory. But still very promising!

How do you voxelize a scene to a texture, i wonder how big this 3d tex is, i cant really imagine one texture handling that much detail

13 hours ago, spike1 said:

The global illumination works pretty simply too, when voxelizing the scene I throw some rays out from each voxel, and since they raycast against themselves, each frame I get an additional bounce of light :D

If you do this, you can get infinite bounces for free (not just a single one).

I do it this way using surfels instead voxels, tracing only from the surfels and caching irradiance to get realtime lightmaps. Only downside is the spatial discretization, so for full details with geometric frequency below the voxel resolution you'd still need to trace from screenspace as well.

On 2/25/2019 at 3:49 AM, JoeJ said:

Super awesome!

I get 83 fps on FuryX on 1440p screen ?

You could try Spatiotemporal Variance Guided Filter to make it much 'faster'?

Thanks for timing, that's awesome :D As for the SVGF, I've kinda avoided using any sort of spatial denoising, I usually find the blobby noise artefacts distracting. That said, I had a go at implementing a precursor to that paper (the A-Trous wavelet filter one, and that combined with the temporal super sampling I have is pretty close to the one you mentioned) and wow it does a great job of smoothing the bounce lighting!

Screenshot_2019-02-26_22-43-41.thumb.png.3d23dfaf4164bdb665ff18fda0e2c789.png

I think denoising that separately to the direct light might work well, if I can speed up the filter (4 passes of a 5x5 kernel at 1080p with the edge awareness costs about 15ms...yikes). Even in that paper the filter costs 10ms on a Titan X, but could be good for fully clearing up the image on higher end GPUs.

On 2/25/2019 at 10:18 AM, _WeirdCat_ said:

How do you voxelize a scene to a texture, i wonder how big this 3d tex is, i cant really imagine one texture handling that much detail

Yeah memory is a problem with these sorts of techniques, so I do cheat a bit. I find I can store the colour data at half resolution without it affecting diffuse lighting much (reflections on the other hand...). On top of that, the voxels are reasonably large (4cm-ish each side, in a 512x512x512 texture), which is really just enough to make the lighting passable - there are still lots of artefacts around :P

On 2/25/2019 at 7:09 AM, MJP said:

Very cool, thanks for explaining how it works! I played around with some similar techniques (ray-tracing voxels using an SDF to accelerate the trace) about a year ago, and it was a fun little project. I was mainly interested of doing it in the context of real-time/interactive lightmap baking on the GPU, and it seemed to work really well for that. But yeah, even with jump-flooding building that SDF got real slow. The voxel grid could also end up using quite a bit of memory. But still very promising!

It's interesting you and JoeJ both mention lightmaps, for a while the intention behind this was also to generate real-time lightmaps. I would start off calculating at a low lightmap resolution then adaptively subdivide to improve shadow edges and whatnot. That adaptive subdivision necessitated that it was noiseless though (or else you get really distracting flickering patches), which for simpler Quake-complexity levels was achievable, but past that...yeah. I've sped up the raytracing since then though, so maybe I could have a look at that again...

On 2/25/2019 at 4:23 PM, JoeJ said:

If you do this, you can get infinite bounces for free (not just a single one).

Sorry yeah that's what I meant to say :P.


Outside of the denoising filter I also got the sample rejection working right again in the temporal super sampling filter, along with initially linear convergence. Now I can increase the number of frames blended, and instead of smearing artefacts when looking at new areas I just get noise, which isn't too bad except for a particular case. Whenever I move the camera small amounts the edges of objects all get rejected, which looks really bad :P.  Here's an exaggerated image (I let it converge fully before moving):

Screenshot_2019-02-26_22-44-21.thumb.png.7eac8bfedf99499baa671a46c669c4c1.png

I'm hoping once I add a second temporal anti aliasing filter on top (which will use neighbourhood clamping rather than complete rejection) this might be less noticeable.

The other option is to throw more samples into new areas, which works but slows it down a bit when the camera is in motion - that said there are likely more efficient ways to implement it than a variable length for loop :P

2 hours ago, spike1 said:

I think denoising that separately to the direct light might work well, if I can speed up the filter (4 passes of a 5x5 kernel at 1080p with the edge awareness costs about 15ms...yikes). Even in that paper the filter costs 10ms on a Titan X

I remember those timings given in the papers, and i wonder Schieds Quake 2 demo works so well and fast now. Anyone knows the runtime of the filter there? I have no RTX yet and can't test myself. Must be much faster than 10ms i guess.

Personally i do compute RT too, but i update the full environment of stochastic light map texels, so i have no noise and no denoising experience either.

2 hours ago, spike1 said:

On top of that, the voxels are reasonably large (4cm-ish each side, in a 512x512x512 texture)

What do you think about the idea to store voxels using only one bit? Ofc for occlusion only, but 4*4*4 voxels in 64 bit value seems interesting, especially if you need no filter. 

I assume only the primary hit is accurate triangles, and all secondary rays for GI trace only against voxels?

7 hours ago, JoeJ said:

I remember those timings given in the papers, and i wonder Schieds Quake 2 demo works so well and fast now. Anyone knows the runtime of the filter there? I have no RTX yet and can't test myself. Must be much faster than 10ms i guess.

Personally i do compute RT too, but i update the full environment of stochastic light map texels, so i have no noise and no denoising experience either.

From a quick look at the source it looks like they're using a pretty full implementation of the paper, but I do notice some differences to my rushed implementation (like they use a depth-normals texture rather than separate ones like I do) so those may account for some of the speed. Is there anywhere I can see your work? It sounds rather interesting :)

8 hours ago, JoeJ said:

What do you think about the idea to store voxels using only one bit? Ofc for occlusion only, but 4*4*4 voxels in 64 bit value seems interesting, especially if you need no filter. 

I assume only the primary hit is accurate triangles, and all secondary rays for GI trace only against voxels?

Yeah I probably should have been more clear about the camera rays, just usual triangle rasterization. If you turn on Visualize Voxels in the demo you can see how crude they are haha XD.

As for the binary voxelization, I've played around with it a bit in the past in some other contexts - a cool property of it is you can precalculate specific ray directions if you store every voxel along that direction in the value, and you can find the intersection with just a single sample, a bit mask and a findLSB. I figured I could use this to precalculate several directions and sample all of them to improve the bounce lighting quality, but I didn't end up using it (I can't remember why to be honest, perhaps the generation of the directions cost too much, and without enough directions the coherency of the rays becomes obvious and looks distracting).

But yeah the idea you describe sounds good, and would fit into my current method pretty smoothly - I actually compute the acceleration structure at half resolution, so once I hit the finest detailed cells I start sampling from the highest resolution voxel texture, an 8-bit texture which really only needs to be 1-bit :P. If I just use the 8-bit as a 2x2x2 block I should be able to halve the size of the voxels without any memory increase, the only difficult thing is to generate the texture.

When voxelizing I use the "project the triangle by the axis with the largest visible surface area in a compute shader, then when rasterizing write manually into the right voxel spot using imageStore" method, and I was hoping that I'd be able to use the imageAtomicOr function to directly create this texture - unfortunately it only works on 32-bit ints.

I'm wondering if maybe texture views, or the format access qualifiers might give me a way to work around this, I don't have any experience with them so I haven't a clue :P

6 hours ago, spike1 said:

Is there anywhere I can see your work? It sounds rather interesting

No, i have no renderer at all yet, just the compute shaders to generate lightmaps. (I visualize texels as quads :) for now) The surfels i'm using have similar properties to voxels, just they form a quadtree over the surface instead a 3D octree in space. That's promising, but hard to do. I have wasted more than a full year of work by implementing quadrangulation papers and improving them, but nothing was good for my requirement to support LOD. Finally i've solved the problem and can move on, but i feel exhausted and pretty dump to loose so much time on this although easier alternatives would have worked as well.

So it will take still some time until i can show something. I use a quake level, and the RTX demo looks very similar to what i have, also the specular reflections (which i get from 4x4 env maps per lightmap texel). But i can do it in 5 ms on first gen GCN with infinite bounces.

I planned to raytrace my surfel hierarchy for sharper reflections, but now it might make more sense to use RTX for that. Not sure. RTX seems totally wrong to me: Restricted to non LOD triangles, single threaded rays, blackboxed BVH. That's not what i would have expected to be good for games and i don't like it. So i should focus on the renderer and wait what happens for next gen consoles (but i guess i can't resist to try it anyways :) )

6 hours ago, spike1 said:

2x2x2 block

Yeah, i'm quite a voxel non-believer, but after another dev brought up some compression ideas to me recently i find them more interesting again. Things like this: http://jcgt.org/published/0006/02/01/paper-lowres.pdf - 0.048 bits per voxel. Streaming large worlds and unpacking them to multilevel bit grids should work well i guess.

6 hours ago, spike1 said:

I'm wondering if maybe texture views, or the format access qualifiers might give me a way to work around this, I don't have any experience with them so I haven't a clue :P

I don't know those things either, but finally VK has support for 64 bit atomics (so 4^3), i guess OpenGL too. Could be useful maybe when rasterization becomes inefficient at distance for lower LOD cascades and splatting the just vertices would suffice. 

 

Impressive work! 

I've implemented Voxel Cone Tracing (VCT) a while back using a technique similar to yours (see here for images and description). And reading your description made me question why didn't you take a step further and made it into a cone tracer to get rid of all that noise? I'm very curious to understand that because your implementation seems to be very close to one.
 

This topic is closed to new replies.

Advertisement