• # GPU Performance for Game Artists

Graphics and GPU Programming

Performance is everybody's responsibility, no matter what your role. When it comes to the GPU, 3D programmers have a lot of control over performance; we can optimize shaders, trade image quality for performance, use smarter rendering techniques... we have plenty of tricks up our sleeves. But there's one thing we don't have direct control over, and that's the game's art. We rely on artists to produce assets that not only look good but are also efficient to render. For artists, a little knowledge of what goes on under the hood can make a big impact on a game's framerate. If you're an artist and want to understand why things like draw calls, LODs, and mipmaps are important for performance, read on! To appreciate the impact that your art has on the game's performance, you need to know how a mesh makes its way from your modelling package onto the screen in the game. That means having an understanding of the GPU - the chip that powers your graphics card and makes real-time 3D rendering possible in the first place. Armed with that knowledge, we'll look at some common art-related performance issues, why they're a problem, and what you can do about it. Things are quickly going to get pretty technical, but if anything is unclear I'll be more than happy to answer questions in the comments section. Before we start, I should point out that I am going to deliberately simplify a lot of things for the sake of brevity and clarity. In many cases I'm generalizing, describing only the typical case, or just straight up leaving things out. In particular, for the sake of simplicity the idealized version of the GPU I describe below more closely matches that of the previous (DX9-era) generation. However when it comes to performance, all of the considerations below still apply to the latest PC & console hardware (although not necessarily all mobile GPUs). Once you understand everything described here, it will be much easier to get to grips with the variations and complexities you'll encounter later, if and when you start to dig deeper.

# Part 1: The rendering pipeline from 10,000 feet

For a mesh to be displayed on the screen, it must pass through the GPU to be processed and rendered. Conceptually, this path is very simple: the mesh is loaded, vertices are grouped together as triangles, the triangles are converted into pixels, each pixel is given a colour, and that's the final image. Let's look a little closer at what happens at each stage. After you export a mesh from your DCC tool of choice (Digital Content Creation - Maya, Max, etc.), the geometry is typically loaded into the game engine in two pieces; a Vertex Buffer (VB) that contains a list of the mesh's vertices and their associated properties (position, UV coordinates, normal, color etc.), and an Index Buffer (IB) that lists which vertices in the VB are connected to form triangles. Along with these geometry buffers, the mesh will also have been assigned a material to determine what it looks like and how it behaves under different lighting conditions. To the GPU this material takes the form of custom-written shaders - programs that determine how the vertices are processed, and what colour the resulting pixels will be. When choosing the material for the mesh, you will have set various material parameters (eg. setting a base color value or picking a texture for various maps like albedo, roughness, normal etc.) - these are passed to the shader programs as inputs. The mesh and material data get processed by various stages of the GPU pipeline in order to produce pixels in the final render target (an image to which the GPU writes). That render target can then be used as a texture in subsequent shader programs and/or displayed on screen as the final image for the frame. For the purposes of this article, here are the important parts of the GPU pipeline from top to bottom:
• Input Assembly. The GPU reads the vertex and index buffers from memory, determines how the vertices are connected to form triangles, and feeds the rest of the pipeline.
• Vertex Shading. The vertex shader gets executed once for every vertex in the mesh, running on a single vertex at a time. Its main purpose is to transform the vertex, taking its position and using the current camera and viewport settings to calculate where it will end up on the screen.
• Rasterization. Once the vertex shader has been run on each vertex of a triangle and the GPU knows where it will appear on screen, the triangle is rasterized - converted into a collection of individual pixels. Per-vertex values - UV coordinates, vertex color, normal, etc. - are interpolated across the triangle's pixels. So if one vertex of a triangle has a black vertex color and another has white, a pixel rasterized in the middle of the two will get the interpolated vertex color grey.
• Pixel Shading. Each rasterized pixel is then run through the pixel shader (although technically at this stage it's not yet a pixel but 'fragment', which is why you'll see the pixel shader sometimes called a fragment shader). This gives the pixel a color by combining material properties, textures, lights, and other parameters in the programmed way to get a particular look. Since there are so many pixels (a 1080p render target has over two million) and each one needs to be shaded at least once, the pixel shader is usually where the GPU spends a lot of its time.
• Render Target Output. Finally the pixel is written to the render target - but not before undergoing some tests to make sure it's valid. For example in normal rendering you want closer objects to appear in front of farther objects; the depth test can reject pixels that are further away than the pixel already in the render target. But if the pixel passes all the tests (depth, alpha, stencil etc.), it gets written to the render target in memory.
There's much more to it, but that's the basic flow: the vertex shader is executed on each vertex in the mesh, each 3-vertex triangle is rasterized into pixels, the pixel shader is executed on each rasterized pixel, and the resulting colors are written to a render target. Under the hood, the shader programs that represent the material are written in a shader programming language such as HLSL. These shaders run on the GPU in much the same way that regular programs run on the CPU - taking in data, running a bunch of simple instructions to change the data, and outputting the result. But while CPU programs are generalized to work on any type of data, shader programs are specifically designed to work on vertices and pixels. These programs are written to give the rendered object the look of the desired material - plastic, metal, velvet, leather, etc. To give you a concrete example, here's a simple pixel shader that does Lambertian lighting (ie. simple diffuse-only, no specular highlights) with a material color and a texture. As shaders go it's one of the most basic, but you don't need to understand it - it just helps to see what shaders can look like in general. float3 MaterialColor; Texture2D MaterialTexture; SamplerState TexSampler; float3 LightDirection; float3 LightColor; float4 MyPixelShader( float2 vUV : TEXCOORD0, float3 vNorm : NORMAL0 ) : SV_Target { float3 vertexNormal = normalize(vNorm); float3 lighting = LightColor * dot( vertexNormal, LightDirection ); float3 material = MaterialColor * MaterialTexture.Sample( TexSampler, vUV ).rgb; float3 color = material * lighting; float alpha = 1; return float4(color, alpha); } A simple pixel shader that does basic lighting. The inputs at the top like MaterialTexture and LightColor are filled in by the CPU, while vUV and vNorm are both vertex properties that were interpolated across the triangle during rasterization. And the generated shader instructions: dp3 r0.x, v1.xyzx, v1.xyzx rsq r0.x, r0.x mul r0.xyz, r0.xxxx, v1.xyzx dp3 r0.x, r0.xyzx, cb0[1].xyzx mul r0.xyz, r0.xxxx, cb0[2].xyzx sample_indexable(texture2d)(float,float,float,float) r1.xyz, v0.xyxx, t0.xyzw, s0 mul r1.xyz, r1.xyzx, cb0[0].xyzx mul o0.xyz, r0.xyzx, r1.xyzx mov o0.w, l(1.000000) ret  The shader compiler takes the above program and generates these instructions which are run on the GPU; a longer program produces more instructions which means more work for the GPU to do. As an aside, you might notice how isolated the shader steps are - each shader works on a single vertex or pixel without needing to know anything about the surrounding vertices/pixels. This is intentional and allows the GPU to process huge numbers of independent vertices and pixels in parallel, which is part of what makes GPUs so fast at doing graphics work compared to CPUs. We'll return to the pipeline shortly to see where things might slow down, but first we need to back up a bit and look at how the mesh and material got to the GPU in the first place. This is also where we meet our first performance hurdle - the draw call.

## The CPU and Draw Calls

The GPU cannot work alone; it relies on the game code running on the machine's main processor - the CPU - to tell it what to render and how. The CPU and GPU are (usually) separate chips, running independently and in parallel. To hit our target frame rate - most commonly 30 frames per second - both the CPU and GPU have to do all the work to produce a single frame within the time allowed (at 30fps that's just 33 milliseconds per frame).
To achieve this, frames are often pipelined; the CPU will take the whole frame to do its work (process AI, physics, input, animation etc.) and then send instructions to the GPU at the end of the frame so it can get to work on the next frame. This gives each processor a full 33ms to do its work at the expense of introducing a frame's worth of latency (delay). This may be an issue for extremely time-sensitive twitchy games like first person shooters - the Call of Duty series for example runs at 60fps to reduce the latency between player input and rendering - but in general the extra frame is not noticeable to the player. Every 33ms the final render target is copied and displayed on the screen at VSync - the interval during which the monitor looks for a new frame to display. But if the GPU takes longer than 33ms to finish rendering the frame, it will miss this window of opportunity and the monitor won't have any new frame to display. That results in either screen tearing or stuttering and an uneven framerate that we really want to avoid. We also get the same result if the CPU takes too long - it has a knock-on effect since the GPU doesn't get commands quickly enough to do its job in the time allowed. In short, a solid framerate relies on both the CPU and GPU performing well.
Here the CPU takes too long to produce rendering commands for the second frame, so the GPU starts rendering late and thus misses VSync.
A texture atlas from Unreal Engine's Infiltrator demo
Many engines also support instancing, also known as batching or clustering. This is the ability to use a single draw call to render multiple objects that are mostly identical in terms of shaders and state, and only differ in a restricted set of ways (typically their position and rotation in the world). The engine will usually recognize when multiple identical objects can be rendered using instancing, so it's always preferable to use the same object multiple times in a scene when possible, instead of multiple different objects that will need to be rendered with separate draw calls. Another common technique for reducing draw calls is manually merging many different objects that share the same material into a single mesh. This can be effective, but care must be taken to avoid excessive merging which can actually worsen performance by increasing the amount of work for the GPU. Before any draw call gets issued, the engine's visibility system will determine whether or not the object will even appear on screen. If not, it's very cheap to just ignore the object at this early stage and not pay for any draw call or GPU work (also known as visibility culling). This is usually done by checking if the object's bounding volume is visible from the camera's point of view, and that it is not completely blocked from view (occluded) by any other objects. However, when multiple meshes are merged into a single object, their individual bounding volumes must be combined into a single large volume that is big enough to enclose every mesh. This increases the likelihood that the visibility system will be able to see some part of the volume, and so will consider the entire collection visible. That means that it becomes a draw call, and so the vertex shader must be executed on every vertex in the object - even if very few of those vertices actually appear on the screen. This can lead to a lot of GPU time being wasted because the vertices end up not contributing anything to the final image. For these reasons, mesh merging is the most effective when it is done on groups of small objects that are close to each other, as they will probably be on-screen at the same time anyway.
A frame from XCOM 2 as captured with RenderDoc. The wireframe (bottom) shows in grey all the extra geometry submitted to the GPU that is outside the view of the in-game camera.

# Part 2: Common GPU bottlenecks

The very first step in optimization is to identify the current bottleneck so you can take steps to reduce or eliminate it. A bottleneck refers to the section of the pipeline that is slowing everything else down. In the above case where too many draw calls are costing too much, the CPU is the bottleneck. Even if we performed other optimizations that made the GPU faster, it wouldn't matter to the framerate because the CPU is still running too slowly to produce a frame in the required amount of time.
4 draw calls going through the pipeline, each being the rendering of a full mesh containing many triangles. The stages overlap because as soon as one piece of work is finished it can be immediately passed to the next stage (eg. when three vertices are processed by the vertex shader then the triangle can proceed to be rasterized).
You can think of the GPU pipeline as an assembly line. As each stage finishes with its data, it forwards the results to the following stage and proceeds with the next piece of work. Ideally every stage is busy working all the time, and the hardware is being utilized fully and efficiently as represented in the above image - the vertex shader is constantly processing vertices, the rasterizer is constantly rasterizing pixels, and so on. But consider what happens if one stage takes much longer than the others:
What happens here is that an expensive vertex shader can't feed the following stages fast enough, and so becomes the bottleneck. If you had a draw call that behaved like this, making the pixel shader faster is not going to make much of a difference to the time it takes for the entire draw call to be rendered. The only way to make things faster is to reduce the time spent in the vertex shader. How we do that depends on what in the vertex shader stage is actually causing the bottleneck. You should keep in mind that there will almost always be a bottleneck of some kind - if you eliminate one, another will just take its place. The trick is knowing when you can do something about it, and when you have to live with it because that's just what it costs to render what you want to render. When you optimize, you're really trying to get rid of unnecessary bottlenecks. But how do you identify what the bottleneck is?

## Profiling

Profiling tools are absolutely essential for figuring out where all the GPU's time is being spent, and good ones will point you at exactly what you need to change in order for things to go faster. They do this in a variety of ways - some explicitly show a list of bottlenecks, others let you run 'experiments' to see what happens (eg. "how does my draw time change if all the textures are tiny", which can tell you if you're bound by memory bandwidth or cache usage). Unfortunately this is where things get a bit hand-wavy, because some of the best performance tools available are only available for the consoles and therefore under NDA. If you're developing for Xbox or Playstation, bug your friendly neighbourhood graphics programmer to show you these tools. We love it when artists get involved in performance, and will be happy to answer questions and even host tutorials on how to use the tools effectively.
Unity's basic built-in GPU profiler

A frame capture from PIX with the corresponding overdraw visualization mode
Sometimes overdraw is unavoidable, such as when rendering translucent objects like particles or glass-like materials; the background object is visible through the foreground, so both need to be rendered. But for opaque objects, overdraw is completely unnecessary because the pixel shown in the buffer at the end of rendering is the only one that actually needs to be processed. In this case, every overdrawn pixel is just wasted GPU time. Steps are taken by the GPU to reduce overdraw in opaque objects. The early depth test (which happens before the pixel shader - see the initial pipeline diagram) will skip pixel shading if it determines that the pixel will be hidden by another object. It does that by comparing the pixel being shaded to the depth buffer - a render target where the GPU stores the entire frame's depth so that objects occlude each other properly. But for the early depth test to be effective, the other object must have already been rendered so it is present in the depth buffer. That means that the rendering order of objects is very important. Ideally every scene would be rendered front-to-back (ie. objects closest to the camera first), so that only the foreground pixels get shaded and the rest get killed by the early depth test, eliminating overdraw entirely. But in the real world that's not always possible because you can't reorder the triangles inside a draw call during rendering. Complex meshes can occlude themselves multiple times, or mesh merging can result in many overlapping objects being rendered in the "wrong" order causing overdraw. There's no easy answer for avoiding these cases, and in the latter case it's just another thing to take into consideration when deciding whether or not to merge meshes. To help early depth testing, some games do a partial depth prepass. This is a preliminary pass where certain large objects that are known to be effective occluders (large buildings, terrain, the main character etc.) are rendered with a simple shader that only outputs to the depth buffer, which is relatively fast as it avoids doing any pixel shader work such as lighting or texturing. This 'primes' the depth buffer and increases the amount of pixel shader work that can be skipped during the full rendering pass later in the frame. The drawback is that rendering the occluding objects twice (once in the depth-only pass and once in the main pass) increases the number of draw calls, plus there's always a chance that the time it takes to render the depth pass itself is more than the time it saves from increased early depth test efficiency. Only profiling in a variety of cases can determine whether or not it's worth it for any given scene.
Particle overdraw visualization of an explosion in Prototype 2
One place where overdraw is a particular concern is particle rendering, given that particles are transparent and often overlap a lot. Artists working on particle effects should always have overdraw in mind when producing effects. A dense cloud effect can be produced by emitting lots of small faint overlapping particles, but that's going to drive up the rendering cost of the effect; a better-performing alternative would be to emit fewer large particles, and instead rely more on the texture and texture animation to convey the density of the effect. The overall result is often more visually effective anyway because offline software like FumeFX and Houdini can usually produce much more interesting effects through texture animation, compared to real-time simulated behaviour of individual particles. The engine can also take steps to avoid doing more GPU work than necessary for particles. Every rendered pixel that ends up completely transparent is just wasted time, so a common optimization is to perform particle trimming: instead of rendering the particle with two triangles, a custom-fitted polygon is generated that minimizes the empty areas of the texture that are used.
Particle 'cutout' tool in Unreal Engine 4
A 10x8 pixel buffer with 5x4 quads. The two triangles have poor quad utilization -- left is too small, right is too thin. The 10 red quads touched by the triangles need to be completely shaded, even though the 12 green pixels are the only ones that are actually needed. Overall, 70% of the GPU's work is wasted.
(Random trivia: quad overshading is also the reason you'll sometimes see fullscreen post effects use a single large triangle to cover the screen instead of two back-to-back triangles. With two triangles, quads that straddle the shared edge would be wasting some of their work, so avoiding that saves a minor amount of GPU time.) Beyond overshading, tiny triangles are also a problem because GPUs can only process and rasterize triangles at a certain rate, which is usually relatively low compared to how many pixels it can process in the same amount of time. With too many small triangles, it can't produce pixels fast enough to keep the shader units busy, resulting in stalls and idle time - the real enemy of GPU performance. Similarly, long thin triangles are bad for performance for another reason beyond quad usage: GPUs rasterize pixels in square or rectangular blocks, not in long strips. Compared to a more regular-shaped triangle with even sides, a long thin triangle ends up making the GPU do a lot of extra unnecessary work to rasterize it into pixels, potentially causing a bottleneck at the rasterization stage. This is why it's usually recommended that meshes are tessellated into evenly-shaped triangles, even if it increases the polygon count a bit. As with everything else, experimentation and profiling will show the best balance.

## Memory Bandwidth and Textures

A texture on two quads, one close to the camera and one much further away
The same texture with a corresponding mipmap chain, each mip being half the size of the previous one
Lastly, texture compression is an important way of reducing bandwidth and cache usage (in addition to the obvious memory savings from storing less texture data). Using BC (Block Compression, previously known as DXT compression), textures can be reduced to a quarter or even a sixth of their original size in exchange for a minor hit in quality. This is a significant reduction in the amount of data that needs to be transferred and processed, and most GPUs even keep the textures compressed in the cache, leaving more room to store other texture data and increasing overall cache efficiency. All of the above information should lead to some obvious steps for reducing or eliminating bandwidth bottlenecks when it comes to texture optimization on the art side. Make sure the textures have mips and are compressed. Don't use heavy 8x or 16x anisotropic filtering if 2x is enough, or even trilinear or bilinear if possible. Reduce texture resolution, particularly if the top-level mip is often displayed. Don't use material features that cause texture accesses unless the feature is really needed. And make sure all the data being fetched is actually used - don't sample four RGBA textures when you actually only need the data in the red channels of each; merge those four channels into a single texture and you've removed 75% of the bandwidth usage. While textures are the primary users of memory bandwidth, they're by no means the only ones. Mesh data (vertex and index buffers) also need to be loaded from memory. You'll also notice in first GPU pipeline diagram that the final render target output is a write to memory. All these transfers usually share the same memory bandwidth. In normal rendering these costs typically aren't noticeable as the amount of data is relatively small compared to the texture data, but this isn't always the case. Compared to regular draw calls, shadow passes behave quite differently and are much more likely to be bandwidth bound.
A frame from GTA V with shadow maps, courtesy of Adrian Courr?ges' great frame analysis

# And there's more...

As you've probably gathered by now, GPUs are complex pieces of hardware. When fed properly, they are capable of processing an enormous amount of data and performing billions of calculations every second. On the other hand, bad data and poor usage can slow them down to a crawl, having a devastating effect on the game's framerate. There are many more things that could be discussed or expanded upon, but what's above is a good place to start for any technically-minded artist. Having an understanding of how the GPU works can help you produce art that not only looks great but also performs well... and better performance can let you improve your art even more, making the game look better too. There's a lot to take in here, but remember that your 3D programming team is always happy to sit down with you and discuss anything that needs more explanation - as am I in the comments section below!

Note: This article was originally published on fragmentbuffer.com, and is republished here with kind permission from the author Keith O'Conor. You can read more of Keith's writing on Twitter (@keithoconor).

Report Article

## User Feedback

Good article. I think it would be worth mentioning Unity has Frame Debugger tool better than the mentioned general-purpose Profiler tool for GPU-related debugging.

## Create an account

Register a new account

• 13
• 13
• 35
• 39
• ### Similar Content

• By G-Dot
Hello everyone! I was working on my spawn algorithm for enemies and come up with the question. Wchich way is the best to actually spawn them in Unreal Engine 4? I want to describe my situation. In my game there are a lot of moments when enemies spawn constantly, for example there are always more then five enemies (aka AI Actors) on the level. To keep them always on map I got an algorithm but I don't know how I should do it properly. There are two ways that I found out:
Just a normal lifecycle: enemy spawns, he dies from player and actor and controller get destroyed. A little bit tricky. I spawn a good amount of enemies at the begging of level at one hided place (for example 20 enemies under the map). Thaen I spawn some amount of them and after death instead of being destroyed Actor returns to his initial place and wait till his next spawn. I don't know which way is better from technical point that's why I can't decide which one to use. But my algorithm can handle both of them.

• Hey there, I've been trying for quite a while now to recreate the rendering system of the game called 'tibia'. It's a 2D grid-based game which consists of several 'height levels' each of those levels gets rendered on its own respectable layer, and there's a total of 14 layers, which are split into 2 sections, the underground layers (0 - 6) and overworld layers (7 - 13), which means that only 7 layers get rendered at a time. I've recreated the actual rendering but it's more or less a brute force approach and the FPS drops quite a lot as more layers are turned into a render. I'm using GMS2 which has pretty poor optimization (starting from the variables themselves, almost everything is declared as an int64 even 'booleans'). This is what I've done so far:
First of all, I'm saving all the map data inside of a buffer (buffer size is map_width * map_height * floors(12) * 4 BYTES). The first of the 4 BYTES is split into three groups [00][00][0000] respectively where the first group holds the multiplier of the sprite_data for the GROUND (terrain) the second group holds the multiplier for the TRANSITION(e.g. water -> grass, grass -> dirt) and the third group has 15 possible combinations which would hold Monsters/Npcs/Objects. Now then, the second BYTE holds the ID of the GROUND floor inside the spritesheet. There's a total of 255 combinations * 4 (the multiplier from the first BYTE), the third BYTE and fourth BYTE basically represent the same thing with different data(TRANSITIONS and OBJECTS). I've also created another buffer which holds the collision_map data, basically it's a map_width*map_height*floors*1BIT and it's a 1 or a 0 per tile. Now then here comes the problematic part, the rendering itself. Basically when the game starts everything gets rendered first time via brute force    for(z=0 z<ground_floors; z++)  for(i=0; i<width; i++)  for(j=0;j<height;j++) draw_tile(x,y,tile_sprite);
When the camera moves(or the player if you will) depending on the direction I'm rendering only the row/column which is being revealed as he moves on, but the thing is I'm rendering it 6 times (because there are 6 floors). I've realized that I might render from top to bottom, and do something like:   for(z=max_floor; z>6; z--){ for(i=0; i<row_size; i++){ if(tile(z) != 0)            draw_tile(x,y,tile_sprite);         else break;     } }
Which would save quite a lot of calculations but only if there are tiles on higher floors, otherwise it would result in a drawing of floor_count * row_size calls. I've also been considering about checking if I'm currently at a floor which is greater or lower than ground_floors/2 and based on that I'd render either from the bottom up or from the top-down but I'd probably still have a lot of issues while drawing. My actual problem is that my FPS drops quite a lot when drawing on the last 2 top-most floors, for example it's something like 240 FPS on the lowest floor, while it's 35-40 FPS at the top-most floor.    I hope that you can understand my gibberish code. Thanks in advance,    Alex
• By spike1
Heyo!
For the last few months I've been working on a realtime raytracer (like everyone currently), but have been trying to make it work on my graphics card, an NVidia GTX 750 ti - a good card but not an RTX or anything ... So I figured I'd post my results since they're kinda cool and I'm also interested to see if anyone might have some ideas on how to speed it up further
Here's a dreadful video showcasing some of what I have currently:
I've sped it up a tad and fixed reflections since then but 'eh it gets the gist across . If you're interested in trying out a demo or checking out the shader source code, I've attached a windows build (FlipperRaytracer_2019_02_25.zip). I develop on Linux so it's not as well tested as I'd like but it works on an iffy laptop I have so hopefully it'll be alright XD. You can change the resolution and whether it starts up in fullscreen in a config file next to it, and in the demo you can fly around, change the lighting setup and adjust various parameters like the frame blending (increase samples) and disabling GI, reflections, etc. If anyone tests it out I'd love to know what sort of timings you get on your GPU
But yeah so currently I can achieve about 330million rays a second, enough to shoot 3 incoherent rays per pixel at 1080p at 50fps - so not too bad overall. I'm really hoping to bump this up a bit further to 5 incoherent rays at 60fps...but we'll see

I'll briefly describe how it works now :). Each render loop it goes through these steps:
Render the scene into a 3D texture (Voxelize it) Generate an acceleration structure akin to an octree from that Render GBuffer (I use a deferred renderer approach) Calculate lighting by raytracing a few rays per pixel Blend with previous frames to increase sample count Finally output with motion blur and some tonemapping Pretty much the most obvious way to do it all
So the main reason it's quick enough is the acceleration structure, which is kinda cool in how simple yet effective it is. At first I tried distance fields, which while really efficient to step through, just can't be generated fast enough in real time (I could only get it down to 300ms for a 512x512x512 texture). Besides I wanted voxel accurate casting for some reason anyway (blocky artifacts look so good...), so I figured I'd start there. Doing an unaccelerated raycast against a voxel texture is simple enough, just cast a ray and test against every voxel the ray intersects, by stepping through it pixel by pixel using a line-stepping algorithm like DDA. The cool thing is, by voxelizing the scene at different mipmaps it's possible to take differently sized steps by checking which is the lowest resolution mipmap with empty space. This can be precomputed into a single texture allows that information in 1 sample. I've found this gives pretty similar raytracing speed to the distance fields, but can be generated in 1-2ms, ending up with a texture like this (a 2D slice):

It also has some nice properties, like if the ray is cast directly next to and parallel to a wall, instead of moving tiny amounts each step (due to the distance field saying it's super close to something) it'll move...an arbitrary amount depending on where the wall falls on the grid :P. Still the worst case is the same as the distance field and it's best case is much better so it's pretty neat
So then for the raytracing I use some importance sampling, directing the rays towards the lights. I find just picking a random importance sampler per pixel and shooting towards that looks good enough and allows as many as I need without changing the framerate (only noise). Then I throw a random ray to calculate GI/other lights, and a ray for reflections. The global illumination works pretty simply too, when voxelizing the scene I throw some rays out from each voxel, and since they raycast against themselves, each frame I get an additional bounce of light :D. That said, I found that a bit slow, so I have an intermediate step where I actually render the objects into a low resolution lightmap, which is where the raycasts take place, then when voxelizing I just sample the lightmap. This also theoretically gives me a fallback in case a computer can't handle raytracing every pixel or the voxel field isn't large enough to cover an entire scene (although currently the lightmap is...iffy...wouldn't use it for that yet XD).
And yeah then I use the usual temporal anti aliasing technique to increase the sample count and anti-alias the image. I previously had a texture that would keep track of how many samples had been taken per pixel, resetting when viewing a previously unviewed region, and used this to properly average the samples (so it converged much faster/actually did converge...) rather than using the usual exponential blending. That said I had some issues integrating any sort of discarding with anti-aliasing so currently I just let everything smear like crazy XD. I think the idea there though is to just have separate temporal supersampling and temporal anti aliasing, so I might try that out. That should improve the smearing and noise significantly...I think XD

Hopefully some of that made sense and was interesting :), please ask any questions you have, I can probably explain it better haha. I'm curious to know what anyone thinks, and of course any ideas to speed it up/develop it further are very much encouraged .
Ooh also I'm also working on some physics simulations, so you can create a realtime cloth in it by pressing C - just usual position based dynamics stuff. Anyway questions on that are open too :P.

FlipperRaytracer_2019_02_25.zip

• So I want to add lightmaps to my mobile game, so the surroundings will look better, but I want to know what performance on different devices my app will have.

Is there a simple tool where you can at least simulate how your app will work on multiple devices. So that I could say: "Hey, test the performance of my app on iPhone X, Galaxy S8 and ASUS ZenFone Max Pro and give the appropriate performance for them". Because I don't have dozens of smartphones on which I could test it.

I would also love to hear suggestions for any mobile performance tools that are easy to use, or maybe even have some integration with unity.

• OK so I am making a game and am using a thing called Novelty. I'm trying to create codes you can find when you get to different endings of the game. The problem is I can't lock the user input so it will ONLY give the variable number of points if you enter the correct code. Also it uses ANGELSCRIPT which is a completely new type of code I don't know. Please help send anything you can to help
×