Any options for affordable ray tracing?

Started by
30 comments, last by Krypt0n 10 years, 1 month ago

maybe faster, i think its a matter of doing hardvare acceleration for it and thats all (probably)

Advertisement

maybe faster, i think its a matter of doing hardvare acceleration for it and thats all (probably)

If there would be any easy solutions like that why those are not already out there?

maybe faster, i think its a matter of doing hardvare acceleration for it and thats all (probably)

If there would be any easy solutions like that why those are not already out there?

dont know, they are maybe not "so easy" because heavy hardware must be done and it is not easy maybe for small company, and big ones maybe are not interested maybe they prefer to sold what they have as long as they got it (really idont know in details )

ray-tracing [is] still at least 15-20 years away from becoming generalized solutions fast enough to finally replace rasterization entirely,
Either way, one thing's clear: 30 years from now nobody will be using rasterization anymore and graphics programming will be simpler in many aspects for it.

I'm sure there were people saying the same thing 10, 15, 20 & 30 years ago wink.png

If real-time follows the same path as film VFX, we'll probably see micropolygon renderers / Reyes-style rendering in real-time sooner than full-blown ray-tracing.

The problem with ray-tracing, which keeps it from being "just a few more years away" is that it has terrible memory access patterns. Rasterization on the other hand can be very predictable with it's memory access patterns. This is a big deal because memory is slow, and the main methods for speeding it up rely on coherent access patterns. Many of the tricks employed by real-time ray-tracers are focussed on improving the coherency of rays, or other techniques to reduce random memory accesses.

This problem never gets better.
Say that your program is slow because you can't deliver data from the RAM to the processor fast enough -- the solution: build faster RAM!
Ok, you wait a few years until our RAM 2x is now faster, however, in the same period of time our processors have gotten another 10x faster!
Now, after waiting those few years, the problem is now actually much worse than it was before. The amount of data that you can deliver to the CPU per operation, or bytes/FLOP has actually decreased.

Every year, this performance figure gets lower and lower (blue line divided by red line is bytes per op):
APBD92s.gif

For people to completely ditch rasterisation and fully switch to ray-tracing methods, we need some huge breakthroughs in efficiency, in efficient scene traversal schemes with predictable memory access patterns for groups of rays.

Film guys can get away with this terrible efficiency because they aren't constrained by the size of their computers, the heat output, the cost to run, or the latency from input to image... If you need 100 computers using 100KW of power, inside a climate controlled data-center with 3 hours of latency, so be it!

On the other hand, the main focus of the GPU manufacturers these days seems to be in decreasing the number of joules required for any operation, especially operations involved in physically moving bytes of data from one part of the chip to another... because most consumers can't justify buying a 1KW desktop computer, and they also want 12Hr battery life from their mobile computers... so efficiency is kind of important.


But I'm completely sure that we'll see hybrid solutions, with rasterization based engines using ray tracing for secondary effects this gen. At least on PC, it's basically a certain thing in my opinion, on next gen consoles I'm not so sure.
You're right about this though. The appearance of SSAO in prev-gen was the beginning of a huge "2.5D ray-tracing" explosion, which will continue on next-gen with screen-space reflections and the like.

Other games are using more advanced 3D/world-space kinds of ray-tracing for other AO techniques on prev-gen, however, they're only tracing against extremely simple representations of the scene... e.g. ignoring everything except one character, and representing the character as less than a dozen spheroids...

Fully 3D ray-marching has shown up in other places, such as good fog rendering (shadow and fog interaction), or pre-baked specular reflection (e.g. cube-map) masking by ray-marching against a simplified version of the environment.

The voxel cone tracing stuff might ship on a few next-gen console games that require fancy specular dynamic reflections, but even with their modern GPU's it's a pretty damn costly technique.

The voxel cone tracing stuff might ship on a few next-gen console games that require fancy specular dynamic reflections, but even with their modern GPU's it's a pretty damn costly technique.

the problem with voxel cone tracing is exactly this:

Every year, this performance figure gets lower and lower (blue line divided by red line is bytes per op):
APBD92s.gif

I was playing with voxel tracing for more than 20y now (commanche and later outcast made it so appealing), but it was always memory bound. it works like 10x faster on GPUs solely cause they push 300GB/s instead of 20GB/s that the average cpu does. yet it has a bad scaling if you extrapolate your graph to the future. I remember profiling my animated voxels ( http://twitpic.com/3rm2sa ) and literally 50% of the time when running on one core was in the instruction that was fetching the voxel. running it on 6cores/12threads with perfect algorithmical scaling just made it about 2-3x as fast, that one instruction got to 80%+ as soon as your voxel working-set exceeded the cache size. (yes, that was cpu, but my GPU version are similarly sensitive to memory size)

For people to completely ditch rasterisation and fully switch to ray-tracing methods, we need some huge breakthroughs in efficiency, in efficient scene traversal schemes with predictable memory access patterns for groups of rays.

that's one way to research, but a way more promising direction is to make better use of the rays you can already cast. In my research of realtime path tracing, even problematic materials like chrome ( http://twitpic.com/a2v276 ) ended up quite nice with about 25samples/pixel, mostly you get away with ~5 - ~10 spp. once you reach 100MRay/s+ it really becomes useable.

if someone would invest the money to make a path tracing game, it would be possible today on high end machines. sure it would be a purpose made game, just like quake, doom3 or rage were, but it would work out nowadays.

the reason we don't have ray tracing games is more of a business than tech issue. games like doom2, quake, unreal were running on high end machines "kind of" smooth. that was a small market coverage, but tech sold it.

nowadays, nobody would invest in a game for 'just' 1Million units in sale for the top tear of PC users (aka gamer elite race).

it's really sad, I'm sure I'm not the only one who could create this tech.

ray-tracing [is] still at least 15-20 years away from becoming generalized solutions fast enough to finally replace rasterization entirely,
Either way, one thing's clear: 30 years from now nobody will be using rasterization anymore and graphics programming will be simpler in many aspects for it.

I'm sure there were people saying the same thing 10, 15, 20 & 30 years ago wink.png

If real-time follows the same path as film VFX, we'll probably see micropolygon renderers / Reyes-style rendering in real-time sooner than full-blown ray-tracing.

The problem with ray-tracing, which keeps it from being "just a few more years away" is that it has terrible memory access patterns. Rasterization on the other hand can be very predictable with it's memory access patterns. This is a big deal because memory is slow, and the main methods for speeding it up rely on coherent access patterns. Many of the tricks employed by real-time ray-tracers are focussed on improving the coherency of rays, or other techniques to reduce random memory accesses.

This problem never gets better.
Say that your program is slow because you can't deliver data from the RAM to the processor fast enough -- the solution: build faster RAM!
Ok, you wait a few years until our RAM 2x is now faster, however, in the same period of time our processors have gotten another 10x faster!
Now, after waiting those few years, the problem is now actually much worse than it was before. The amount of data that you can deliver to the CPU per operation, or bytes/FLOP has actually decreased.

Every year, this performance figure gets lower and lower (blue line divided by red line is bytes per op):
APBD92s.gif

For people to completely ditch rasterisation and fully switch to ray-tracing methods, we need some huge breakthroughs in efficiency, in efficient scene traversal schemes with predictable memory access patterns for groups of rays.

Film guys can get away with this terrible efficiency because they aren't constrained by the size of their computers, the heat output, the cost to run, or the latency from input to image... If you need 100 computers using 100KW of power, inside a climate controlled data-center with 3 hours of latency, so be it!

On the other hand, the main focus of the GPU manufacturers these days seems to be in decreasing the number of joules required for any operation, especially operations involved in physically moving bytes of data from one part of the chip to another... because most consumers can't justify buying a 1KW desktop computer, and they also want 12Hr battery life from their mobile computers... so efficiency is kind of important.


But I'm completely sure that we'll see hybrid solutions, with rasterization based engines using ray tracing for secondary effects this gen. At least on PC, it's basically a certain thing in my opinion, on next gen consoles I'm not so sure.
You're right about this though. The appearance of SSAO in prev-gen was the beginning of a huge "2.5D ray-tracing" explosion, which will continue on next-gen with screen-space reflections and the like.

Other games are using more advanced 3D/world-space kinds of ray-tracing for other AO techniques on prev-gen, however, they're only tracing against extremely simple representations of the scene... e.g. ignoring everything except one character, and representing the character as less than a dozen spheroids...

Fully 3D ray-marching has shown up in other places, such as good fog rendering (shadow and fog interaction), or pre-baked specular reflection (e.g. cube-map) masking by ray-marching against a simplified version of the environment.

The voxel cone tracing stuff might ship on a few next-gen console games that require fancy specular dynamic reflections, but even with their modern GPU's it's a pretty damn costly technique.

Im not sure if this is this way...

Isnt raytracing better suited for paralisation than rasterization?

You can trace all the ray independant on another, this is probably no write ram collisions (or I am wrong?) you only need common ram reads

- for rasterization it seem to me it is less nice

Also - I dont know how it looks like today but as someone said

today you can already really do quake game on pathtracer (as far as i know pathtracer is much heavier than simple raytracer so for just raytracer you could get much faster rendering)

(also check up how far was quake2 rasterization engine optymization done by carmack - this engine had no framebuffer pixel overvriting at all, and also a terribly amount of other crazy optymizations - and gone terribly far in optymizations- I am not sure if todays pathtracer people are going so far - though I know you people are good anyways)

If so this is not to far to realy use it - isnt doing tracing-directiona hardware acceleration for this speeded up it yet a couple of times

Im not sure if this is this way...

Isnt raytracing better suited for paralisation than rasterization?

You can trace all the ray independant on another, this is probably no write ram collisions (or I am wrong?) you only need common ram reads

They're both embarrassingly parallel. A common technique to split rasterization over different "cores" and avoid conflicting writes is to use tiled rendering.

My post above wasn't about parallisation though - it was about memory bandwidth, and alleviating the memory-bandwidth-bottleneck via predictable memory access patterns.

As a practical example of the importance of memory access patterns, try this test using a very large value for LARGE_NUMBER (preferably, many times larger than your CPU's L2 or L3 cache size).


// initialization
vector<int> indicesA(LARGE_NUMBER);
vector<int> indicesB(LARGE_NUMBER);
for( int i=0; i!=LARGE_NUMBER; ++i )
{
  indicesA[i] = i;
  indicesB[i] = i;
}
std::random_shuffle( indicesB.begin(), indicesB.end() );

//indicesA contains 1,2,3,4...
//indicesB contains 2,4,3,1... (same as indicesB, but in a random order)

float* values = new float[LARGE_NUMBER];
//fill values with stuff

// test #1
float totalA = 0
for( int i=0; i!=LARGE_NUMBER; ++i )
{
  totalA += sqrtf(values[indicesA[i]]);
}

// test #2
float totalB = 0
for( int i=0; i!=LARGE_NUMBER; ++i )
{
  totalB += sqrtf(values[indicesB[i]]);
}

Both these tests should produce the same value -- both totalA and totalB contain the sum of the square-root of every item inside values.
The differences is that test#1 accesses the values in a predictable, linear order, whereas test#2 accesses the values in a completely random order. The CPU cache will be able to optimize the memory fetches for test #1, but it will provide very minimal help for test #2. Test #1 should be much faster.

The super-high-level view of ray-tracing's memory accesses is:
* For each pixel (predictable access), test for ray collision with scene acceleration structure (e.g. BVH/etc) and with scene objects themselves (random access), write to pixel.

* Also, neighbouring pixels may take completely different paths through the scene (low contiguity), defeating caching.

The super-high-level view of rasterization's memory accesses is:
* For each triangle in the scene (predictable access), compute the covered pixels, write to those pixels (predictable access).

* Also, each triangle generates a list of pixels that are all neighbours (high contiguity) and can share cached values about their triangle.

i.e. the Achilles heel of ray-tracing, is that ray-vs-scene collision testing does not have predictable memory access patterns. This means that your algorithms are extremely likely to be bottlenecked by RAM speeds rather than processor speeds. Increases in processor speeds do very little to speed up your algorithms -- instead you depend on increases in RAM speeds (and your processors spend most of their time sitting idle, waiting for data to arrive from RAM).

Allright but if so this slowdowns should not apply probably to scenes that has not so terribly big ram footprint and are just resonable in ram size

(todays cache is probably about 10MB ? what if a whole scene

would comprise in such ram (this would be relative simple scenes of 200k triangles or something but should be raytraced quickly - or not?)

Are raytraced scenes had damn gigabytes large memory footprint?

Does this bhv structures so large ram footprint (I was not doing this and i am not sure how it works - Is this some kind of spacial 3d grid of boxes with ray/box intersection test routine and some bressenham kind of traversal thru this 3d grid used? This stuff consumer this ram?

just tracing small scenes that fit into your cache, especially the L1 cache, will probably be fast, you could fit some Quake1 level into the cache and it would work ok.

I remember someone traced star trek elite force ages ago, there must be a vid on youtube... there:

but it's still memory limited...

you can do math with SSE on 4 elements and AVX on 8 elements, so in theory you could work on 8 rays at the same time, yet once you come to memory fetching, it's one fetch at a time. it's even worse, if you process 4 or 8 elements at a time, there will be always also 4 or 8 reads in a row. so while your out-of-order cpu can hide some of the fetch latency by processing independent instructions, once you queue 4 or 8 fetches, there is just nothing to do for the other units than to wait for your memory requests.

and while L1 requests are kind of hidden by the instruction decoding etc. if you start queuing up 4 or 8 reads and those go to the L2 cache, with ~12cycle latency on hit, you have ~50 or ~100cycles until you can work on the results.

you can see some results about this straight from intel:

http://embree.github.io/data/embree-siggraph-2013-final.pdf

SSE is barely doing anything to the performance as you can see.

on the other side, rasterization is just a very specialized form of ray tracing. you transform the triangles into the ray space, so the rays end up being 2d dots that you can evaluate before you do the actually intersection. and the intersection is done by interpolating coherent data (usually UVs and Z) and projection. you also exploit the fact that you can limit the primitive vs ray test by a small bounding rect, skipping most of the pixels (or 2d dots). the coherent interpolation also exploits the fact that you can keep your triangle data for a long time in registers (so you don't need to fetch data from memory) and that you don't need to fetch the rays either, as you know they're in a regular grid order and calculating their positions is just natural and fast. I'm not talking about scanline rasterization, but halfspace or even more homogenous rasterization

if you could order rays in some way to be grouped like in rasterization, you'd get close to be as fast with tracing as you are with rasterizing. indeed, there is quite some research going on in how to cluster rays and deferring their processing until enough of them try to touch similar regions. there is research in speculative tracing, e.g. you do triangle tests not with just one ray, but with 4 or 8 at a time and although it's bogus to test random rays, it's also free with SSE/AVX and if your rays are somehow coherent (e.g. for shadow rays or primary rays), it end ups somehow faster.

as I said in my first post here, there is already so much research regarding faster tracing, but there is really a lot room to improve on what you do with those rays. you can assume you'll get about 100-200MRay/s. that's what the caustics RT hardware does, that's what you get with optix, that's what intel can achive with embree. and even if you'd magically get 4x -> 800MRay/s, you'd just reduce monte carlo path tracing noise by 50%. on the other side, you can add importance sampling, improve your random number or ray generator with a few lines of code (and a lot of thinking) and you'll get suddenly to 5% of the noise (it's really in those big steps).

further, ray tracing (and especially path tracing) is still quite an academic area. for rasterized games, we just ignore mostly the proof of correctness and some theories that we violate. texture sampling, mipmapping etc. was done properly done 20years ago in offline renderern, yet even the latest and greatest hardware will produce blurry or noisy results at some angles and while we know exactly how to solve it academically correct, we rather switch on 64x AF and it also solves the problem to some degree.

that's how game devs should also approach tracing. e.g. not being biased is nice in theory (a theory that just hold in theory, as floats are biased per definition), you can get far better/faster results doing biased algorithms like photonmapping or canceling ray depth after a fixed amount of reflections.

all this talk makes me want to work tonight on my path tracer again, .... you guys !

Use less memory. Use compressed containers. Access per thread/core in a sub-tree or coherently. Use less pointers.

My 10 000 USD ;-)

spinningcube

PS - oh do I miss the heydays of ompf...

This topic is closed to new replies.

Advertisement