then your first source of information should be the Quake1 source code.
beside the assembler version, there is also a c version that you could experiment with.might be the best way to learn it
Jump to content
Krypt0n hasn't added any contacts yet.
Posted by Krypt0n on 11 March 2014 - 10:51 AM
then your first source of information should be the Quake1 source code.
beside the assembler version, there is also a c version that you could experiment with.might be the best way to learn it
Posted by Krypt0n on 08 March 2014 - 04:48 PM
it sounds ok'ish, there is nothing particularly bad.
-instead of having a temporary copy of the transform matrices, you could rather group your game objects already by the render mesh name. then you'd process those in one go, fill immediately the InstanceData and when you're done with them, you'd invoke the drawcall.
-it's common in games to have most of the data unchanged, just some objects are dynamic, if you could figure out which objects don't change, you could just re-use the instancedata buffer from the last frame with no update etc. needed
Posted by Krypt0n on 08 March 2014 - 01:35 PM
I think I know why it's happening - my maximum ray depth is 5, and the rays probably get stuck in the cavities and never collect any colour
you are right, that sounds like the reason.
The way to fix it is to not limit your rays to a particular depth, but to reduce the amount of contribution on every recursion.
every surface in the real world ( and I mean also mirrors, glass, air, chrome...) have some loss in transmission of light. this can be simulated the easiest if you reduce the contribution they do by some amount per recursion level. every mirror ball would reduce it to e.g. 98% (you could set this value per material), it would take a while, but with increasing recursion it would end up by e.g. 1%
and once a ray reaches 1%, you could assume it won't contribute anything. (I think you can calculate the maximum recursion by taking the log of the canceling value and divide by the log of the reduction value e.g. log(0.01)/log(0.98) to estimate the worst case and maybe tweak both values to restrict the time consumption, instead of looking up realistic values)
that way you won't trace rays that have barely any contribution and you will trace important rays instead of canceling them due to some recursion limit. and the reduction of contribution might lead to a more realistic look.
Posted by Krypt0n on 06 March 2014 - 10:07 AM
you might reduce the issue by dividing by abs(w), as most of the problem arises due to the flip of x and y when you divide by something negative. it's still not correct, but far less noticeable and you get away with per-vertex cost.
Posted by Krypt0n on 03 March 2014 - 08:40 PM
enable floating point exceptions, that way you'll usually hunt down most of those issues (even some more hidden that you might not even notice visually).
in my tracers I've usually added a ray picking for debugging, at the place you click with the mouse cursor, I simply cast a ray. in the debugger you setup a breakpoint at the trace function and then you can easily track down the source of the problem.
Posted by Krypt0n on 25 February 2014 - 04:09 AM
one birght pixel should not blind the player, that doesn't happen in real live either. it's actually a problem, because as it doesn't blind you, your eye tries not to adapt to it, but it still can burn into your retina (e.g. sparks from welding or eclipse of the sun where tons of ppl watch straight at it and get burnt retinas, every time).
But in games (or especially in movies) you usually you don't tone map a raw screen with few bright pixel, you rather do your post processing first and then you apply some bloom/glow and while we fake it to bloom to some specific radius (using limited kernels to save time), you'd actually define the radius based on the camera settings and the brightness.
so if you'd have a pixel in the night that is 65520, it would 'bleed' to maybe 20% of the screen and if you tone map this, it would work out fine again. (you probably don't want to have just one bright pixel like this as it would end up in hell of an aliasing, having 8x8 pixel of that brightness that bleed out to 20% of all pixels doesn't sound that crazy to me). if you apply motion blur, you further extend the bounds of the bright pixel).
it's really not just about tone mapping, you need a lot of post processing effects to get it somehow right. (I mean, to get it movie alike)
Posted by Krypt0n on 23 February 2014 - 07:54 PM
nowadays the best way is probably to read the depth position in the vertex shader, depending on the occlusion result, you get transform the vertex properly or set some default value, e.g. (0,0) as position and all flares gonna be culled.
that's minimal latency (occlusion queries are probably just available at least one frame later), minimal cost (it's not some per pixel cost, but just few e.g. 4 vertex read per flare) and low implementation time (just set the depth sampler, z-texture and the proper 2d read position to the lens flare shader).
Posted by Krypt0n on 21 February 2014 - 11:07 AM
x and y are -1 to 1, mapped from the screen pixel coordinate. if you want to apply the fov, simply multiply x and y by the horizontal respectably vertical fov.
yes, it's that easy to start
Posted by Krypt0n on 21 February 2014 - 05:20 AM
how do you define slow?
what's the usual speed on your hardware with your opengl programs?
what hardware do you use?
with some proper data, we can give you proper replies, otherwise:
Does the huge amount of consumed memory slow my application down?
yeah, I guess so.
Posted by Krypt0n on 21 February 2014 - 04:11 AM
Rasterizing and pixel-shading them on current hardware? We could do that now by tesselating things enough... but yeah, the pixel shaders will be extremly inefficient with those microscopic polygons. It also wouldn't be proper micropolygon renderer.
And the way hardware rasterizers work now is essentially by dispatching tiles to blocks of compute units. I'll try and find some of the documents I was reading about how the internal rasterization steps work, but the long and short of it is...
you could move the shading to the domain shader. afaik it's also the way reyes handles it, doing per vertex shading (but iteratively during tessellation as displacement is done simultanously) and drawing flat shaded triangles. that leads to either a simple pass-through shader or maybe even software rasterization via opencl/cuda.
We should be tessellation until we have sub-pixel sized polys, then shading per-vertex. Then for each pixel, storing a list of micropolygons that are contained within it, then after each pixel's lists have been filled, determine the coverage of each micropolygon and blend the results.
We need direct support for that, which will require dedicated hardware which is more removed from current GPUs than raytracing, I suspect. Not holding my breath for REYES any time in the vaguely predictable future of real time.
both are kind of orthogonal. Reyes aims to create alias free clean images in reasonable time, while tracing solves mostly shading problems.
so orthogonal, they could be combined. reyes rendering with per-vertex shading using tracing. doing antialiasing the analytical way is way more efficient than tracing billions of samples/pixel. and on the other side, faking shadows etc. by depth maps leads to tons of issues, e.g. alising, peter panning, which is probably ok if you can tweak it per scene, but in games, tracing it would be way more consistent.
Posted by Krypt0n on 20 February 2014 - 04:59 AM
Once you optimize voxel to be not memory bandwidth bound, you'll end up texture-fetch or memory-fetch bound, there are simply not enough texture units to feed all ALUs that's just gonna get worse, simply because it's extremely expensive to ramp up the internal bandwidth. (it's in the TB/s area already)
NVIdia Keppler: 192 CUDA Cores/ 16 TMU per SMX
NVidia Maxwell: 128 CUDA Cores/ 8 TMU per SMM
no it's not really. ram is accessed sequentially. you set an address and you read/write it and then the next access can happen. the setup for the address is quite slow and causes most of the latency. it also has mostly electrical limitations, that's why you see RAM that doubles the frequency, but at the same time nearly doubles the latency, so you end up with the same real time for the access, e.g typical DDR SDRAM CAS latencys
DDR1 200-400 : 2-3cycle
DDR2 400 : 3-4cycle
DDR2 800 : 4-6cycle
DDR3 800 : 5-6cycle
DDR3 1600 : 8-12cycle
you can improve the situation by splitting the memory area across several memory controllers, but that's like hiring more delivery guys, once everyone is his own delivery guy, it's fully parallel, but then your bottleneck will just move to the place where you need to control those. and memory controllers seem to eat up quite a lot of the die space of GPUs, as their size is more dictated by the external interface that they have to provide.
some more on ram:
what GPUs try to do, and that's what makes them so much more efficient for gfx than cpus, is to group memory access and execute requests in batches. so your texture fetch for one pixel will be delayed until there are maybe 100 of those, then a big chunk of memory will be read into the L1 TMU cache and then the sampler units start doing all the interpolation. if you organize your cache in a way, where half of it can be streamed in simultanously to the other half providing data for the samplers, you effectivelly fully parallelize the reads while at the same time being just bandwidth limited.
but those are special cases. this just works out because the rasterizer groups close by pixel together so they generate quite coherent UVs for texture lookups and textures are organized into tiles that just need one addressing operation to get like 32x32 pixel instead of 32 operations to get 32 lines of 1024 pixel where 992 are dropped. and on top of it, you use mipmaps to approximatelly end up with 1Texel:1Pixel.
for raytracing we do similar reorganizations of rays to be more coherent in our access pattern, but it's no way close to that simplicity and efficiency and there is quite some overhead in compute code to do this reorganization.
some basic techniques are wrapped up here: https://mediatech.aalto.fi/~timo/publications/aila2009hpg_paper.pdf
Posted by Krypt0n on 18 February 2014 - 06:06 AM
just tracing small scenes that fit into your cache, especially the L1 cache, will probably be fast, you could fit some Quake1 level into the cache and it would work ok.
I remember someone traced star trek elite force ages ago, there must be a vid on youtube... there:
but it's still memory limited...
you can do math with SSE on 4 elements and AVX on 8 elements, so in theory you could work on 8 rays at the same time, yet once you come to memory fetching, it's one fetch at a time. it's even worse, if you process 4 or 8 elements at a time, there will be always also 4 or 8 reads in a row. so while your out-of-order cpu can hide some of the fetch latency by processing independent instructions, once you queue 4 or 8 fetches, there is just nothing to do for the other units than to wait for your memory requests.
and while L1 requests are kind of hidden by the instruction decoding etc. if you start queuing up 4 or 8 reads and those go to the L2 cache, with ~12cycle latency on hit, you have ~50 or ~100cycles until you can work on the results.
you can see some results about this straight from intel:
SSE is barely doing anything to the performance as you can see.
on the other side, rasterization is just a very specialized form of ray tracing. you transform the triangles into the ray space, so the rays end up being 2d dots that you can evaluate before you do the actually intersection. and the intersection is done by interpolating coherent data (usually UVs and Z) and projection. you also exploit the fact that you can limit the primitive vs ray test by a small bounding rect, skipping most of the pixels (or 2d dots). the coherent interpolation also exploits the fact that you can keep your triangle data for a long time in registers (so you don't need to fetch data from memory) and that you don't need to fetch the rays either, as you know they're in a regular grid order and calculating their positions is just natural and fast. I'm not talking about scanline rasterization, but halfspace or even more homogenous rasterization
if you could order rays in some way to be grouped like in rasterization, you'd get close to be as fast with tracing as you are with rasterizing. indeed, there is quite some research going on in how to cluster rays and deferring their processing until enough of them try to touch similar regions. there is research in speculative tracing, e.g. you do triangle tests not with just one ray, but with 4 or 8 at a time and although it's bogus to test random rays, it's also free with SSE/AVX and if your rays are somehow coherent (e.g. for shadow rays or primary rays), it end ups somehow faster.
as I said in my first post here, there is already so much research regarding faster tracing, but there is really a lot room to improve on what you do with those rays. you can assume you'll get about 100-200MRay/s. that's what the caustics RT hardware does, that's what you get with optix, that's what intel can achive with embree. and even if you'd magically get 4x -> 800MRay/s, you'd just reduce monte carlo path tracing noise by 50%. on the other side, you can add importance sampling, improve your random number or ray generator with a few lines of code (and a lot of thinking) and you'll get suddenly to 5% of the noise (it's really in those big steps).
further, ray tracing (and especially path tracing) is still quite an academic area. for rasterized games, we just ignore mostly the proof of correctness and some theories that we violate. texture sampling, mipmapping etc. was done properly done 20years ago in offline renderern, yet even the latest and greatest hardware will produce blurry or noisy results at some angles and while we know exactly how to solve it academically correct, we rather switch on 64x AF and it also solves the problem to some degree.
that's how game devs should also approach tracing. e.g. not being biased is nice in theory (a theory that just hold in theory, as floats are biased per definition), you can get far better/faster results doing biased algorithms like photonmapping or canceling ray depth after a fixed amount of reflections.
all this talk makes me want to work tonight on my path tracer again, .... you guys !
Posted by Krypt0n on 17 February 2014 - 05:36 PM
The voxel cone tracing stuff might ship on a few next-gen console games that require fancy specular dynamic reflections, but even with their modern GPU's it's a pretty damn costly technique.
the problem with voxel cone tracing is exactly this:
Every year, this performance figure gets lower and lower (blue line divided by red line is bytes per op):
I was playing with voxel tracing for more than 20y now (commanche and later outcast made it so appealing), but it was always memory bound. it works like 10x faster on GPUs solely cause they push 300GB/s instead of 20GB/s that the average cpu does. yet it has a bad scaling if you extrapolate your graph to the future. I remember profiling my animated voxels ( http://twitpic.com/3rm2sa ) and literally 50% of the time when running on one core was in the instruction that was fetching the voxel. running it on 6cores/12threads with perfect algorithmical scaling just made it about 2-3x as fast, that one instruction got to 80%+ as soon as your voxel working-set exceeded the cache size. (yes, that was cpu, but my GPU version are similarly sensitive to memory size)
For people to completely ditch rasterisation and fully switch to ray-tracing methods, we need some huge breakthroughs in efficiency, in efficient scene traversal schemes with predictable memory access patterns for groups of rays.
that's one way to research, but a way more promising direction is to make better use of the rays you can already cast. In my research of realtime path tracing, even problematic materials like chrome ( http://twitpic.com/a2v276 ) ended up quite nice with about 25samples/pixel, mostly you get away with ~5 - ~10 spp. once you reach 100MRay/s+ it really becomes useable.
if someone would invest the money to make a path tracing game, it would be possible today on high end machines. sure it would be a purpose made game, just like quake, doom3 or rage were, but it would work out nowadays.
the reason we don't have ray tracing games is more of a business than tech issue. games like doom2, quake, unreal were running on high end machines "kind of" smooth. that was a small market coverage, but tech sold it.
nowadays, nobody would invest in a game for 'just' 1Million units in sale for the top tear of PC users (aka gamer elite race).
it's really sad, I'm sure I'm not the only one who could create this tech.
Posted by Krypt0n on 06 February 2014 - 09:04 AM
Shouldn't the scene colors converge to a certain value instead of getting full white?
Yes, it should definitely converge if you're doing it right. I would suspect that you have a bug, or that you're integrating incorrectly. Are you doing hemicube rendering?
it doesn't have to be a bug, it can converge to something that is just brighter than what represents 'white' in that particular case. in every pass energy gets distributed further, but energy doesn't 'leak' anywhere, that's why everything just can get brighter or stay the same.
I would suggest to
1. switch to floats (halfs) lightmaps for the generation time.
2. tonemap those to LDR if you want to use LDR lightmaps for visualization (you might even do that per lightmap and save the mid tone value so you could 'reconstruct' the actual brightness for HDR rendering)
3. calculate the average brightness in every pass and output it, see if the step size of 'brightening' reduces every pass, then you could even estimate how long till the error reduces to a value you want (this might be very scene dependent, sometimes 5 passes, sometimes 50 passes).
Posted by Krypt0n on 22 January 2014 - 10:07 AM
the basic traversal is no different than on CPU, modern GPUs are flexible enough to handle it.
then you need to optimize it, two important points for performance are
means you organize data in a way that leads to as much cache hits as possible e.g
-you can group ray queries by position and orientation to access similar data during traversal
-specialize data in regard to specific aspects e.g. special tree for shadow raycast, special tree for ambient occlusion casts, special trees depending on the ray direction (with clustered backface culling of triangles)
-as you write code for a single thread (on opencl/cuda), those still run in groups of 32 or 64, if threads take different branches, it will lead to a lot of threads that don't work and reduce performance. so your optimization job is to reduce thread divergence as far as possible.
those two points are independent of the hierarchy you use, so I'd suggest to also read that paper: http://www.eng.utah.edu/~cs7940/papers09/Timo_GPU_rt_HPG09.pdf
GameDev.net™, the GameDev.net logo, and GDNet™ are trademarks of GameDev.net, LLC