Jump to content
  • Advertisement
  • 03/15/17 12:25 PM

    GPU Performance for Game Artists

    Graphics and GPU Programming


    Performance is everybody's responsibility, no matter what your role. When it comes to the GPU, 3D programmers have a lot of control over performance; we can optimize shaders, trade image quality for performance, use smarter rendering techniques... we have plenty of tricks up our sleeves. But there's one thing we don't have direct control over, and that's the game's art. We rely on artists to produce assets that not only look good but are also efficient to render. For artists, a little knowledge of what goes on under the hood can make a big impact on a game's framerate. If you're an artist and want to understand why things like draw calls, LODs, and mipmaps are important for performance, read on! To appreciate the impact that your art has on the game's performance, you need to know how a mesh makes its way from your modelling package onto the screen in the game. That means having an understanding of the GPU - the chip that powers your graphics card and makes real-time 3D rendering possible in the first place. Armed with that knowledge, we'll look at some common art-related performance issues, why they're a problem, and what you can do about it. Things are quickly going to get pretty technical, but if anything is unclear I'll be more than happy to answer questions in the comments section. Before we start, I should point out that I am going to deliberately simplify a lot of things for the sake of brevity and clarity. In many cases I'm generalizing, describing only the typical case, or just straight up leaving things out. In particular, for the sake of simplicity the idealized version of the GPU I describe below more closely matches that of the previous (DX9-era) generation. However when it comes to performance, all of the considerations below still apply to the latest PC & console hardware (although not necessarily all mobile GPUs). Once you understand everything described here, it will be much easier to get to grips with the variations and complexities you'll encounter later, if and when you start to dig deeper.

    Part 1: The rendering pipeline from 10,000 feet

    For a mesh to be displayed on the screen, it must pass through the GPU to be processed and rendered. Conceptually, this path is very simple: the mesh is loaded, vertices are grouped together as triangles, the triangles are converted into pixels, each pixel is given a colour, and that's the final image. Let's look a little closer at what happens at each stage. After you export a mesh from your DCC tool of choice (Digital Content Creation - Maya, Max, etc.), the geometry is typically loaded into the game engine in two pieces; a Vertex Buffer (VB) that contains a list of the mesh's vertices and their associated properties (position, UV coordinates, normal, color etc.), and an Index Buffer (IB) that lists which vertices in the VB are connected to form triangles. Along with these geometry buffers, the mesh will also have been assigned a material to determine what it looks like and how it behaves under different lighting conditions. To the GPU this material takes the form of custom-written shaders - programs that determine how the vertices are processed, and what colour the resulting pixels will be. When choosing the material for the mesh, you will have set various material parameters (eg. setting a base color value or picking a texture for various maps like albedo, roughness, normal etc.) - these are passed to the shader programs as inputs. The mesh and material data get processed by various stages of the GPU pipeline in order to produce pixels in the final render target (an image to which the GPU writes). That render target can then be used as a texture in subsequent shader programs and/or displayed on screen as the final image for the frame. For the purposes of this article, here are the important parts of the GPU pipeline from top to bottom:
    • Input Assembly. The GPU reads the vertex and index buffers from memory, determines how the vertices are connected to form triangles, and feeds the rest of the pipeline.
    • Vertex Shading. The vertex shader gets executed once for every vertex in the mesh, running on a single vertex at a time. Its main purpose is to transform the vertex, taking its position and using the current camera and viewport settings to calculate where it will end up on the screen.
    • Rasterization. Once the vertex shader has been run on each vertex of a triangle and the GPU knows where it will appear on screen, the triangle is rasterized - converted into a collection of individual pixels. Per-vertex values - UV coordinates, vertex color, normal, etc. - are interpolated across the triangle's pixels. So if one vertex of a triangle has a black vertex color and another has white, a pixel rasterized in the middle of the two will get the interpolated vertex color grey.
    • Pixel Shading. Each rasterized pixel is then run through the pixel shader (although technically at this stage it's not yet a pixel but 'fragment', which is why you'll see the pixel shader sometimes called a fragment shader). This gives the pixel a color by combining material properties, textures, lights, and other parameters in the programmed way to get a particular look. Since there are so many pixels (a 1080p render target has over two million) and each one needs to be shaded at least once, the pixel shader is usually where the GPU spends a lot of its time.
    • Render Target Output. Finally the pixel is written to the render target - but not before undergoing some tests to make sure it's valid. For example in normal rendering you want closer objects to appear in front of farther objects; the depth test can reject pixels that are further away than the pixel already in the render target. But if the pixel passes all the tests (depth, alpha, stencil etc.), it gets written to the render target in memory.
    There's much more to it, but that's the basic flow: the vertex shader is executed on each vertex in the mesh, each 3-vertex triangle is rasterized into pixels, the pixel shader is executed on each rasterized pixel, and the resulting colors are written to a render target. Under the hood, the shader programs that represent the material are written in a shader programming language such as HLSL. These shaders run on the GPU in much the same way that regular programs run on the CPU - taking in data, running a bunch of simple instructions to change the data, and outputting the result. But while CPU programs are generalized to work on any type of data, shader programs are specifically designed to work on vertices and pixels. These programs are written to give the rendered object the look of the desired material - plastic, metal, velvet, leather, etc. To give you a concrete example, here's a simple pixel shader that does Lambertian lighting (ie. simple diffuse-only, no specular highlights) with a material color and a texture. As shaders go it's one of the most basic, but you don't need to understand it - it just helps to see what shaders can look like in general. float3 MaterialColor; Texture2D MaterialTexture; SamplerState TexSampler; float3 LightDirection; float3 LightColor; float4 MyPixelShader( float2 vUV : TEXCOORD0, float3 vNorm : NORMAL0 ) : SV_Target { float3 vertexNormal = normalize(vNorm); float3 lighting = LightColor * dot( vertexNormal, LightDirection ); float3 material = MaterialColor * MaterialTexture.Sample( TexSampler, vUV ).rgb; float3 color = material * lighting; float alpha = 1; return float4(color, alpha); } A simple pixel shader that does basic lighting. The inputs at the top like MaterialTexture and LightColor are filled in by the CPU, while vUV and vNorm are both vertex properties that were interpolated across the triangle during rasterization. And the generated shader instructions: dp3 r0.x, v1.xyzx, v1.xyzx rsq r0.x, r0.x mul r0.xyz, r0.xxxx, v1.xyzx dp3 r0.x, r0.xyzx, cb0[1].xyzx mul r0.xyz, r0.xxxx, cb0[2].xyzx sample_indexable(texture2d)(float,float,float,float) r1.xyz, v0.xyxx, t0.xyzw, s0 mul r1.xyz, r1.xyzx, cb0[0].xyzx mul o0.xyz, r0.xyzx, r1.xyzx mov o0.w, l(1.000000) ret The shader compiler takes the above program and generates these instructions which are run on the GPU; a longer program produces more instructions which means more work for the GPU to do. As an aside, you might notice how isolated the shader steps are - each shader works on a single vertex or pixel without needing to know anything about the surrounding vertices/pixels. This is intentional and allows the GPU to process huge numbers of independent vertices and pixels in parallel, which is part of what makes GPUs so fast at doing graphics work compared to CPUs. We'll return to the pipeline shortly to see where things might slow down, but first we need to back up a bit and look at how the mesh and material got to the GPU in the first place. This is also where we meet our first performance hurdle - the draw call.

    The CPU and Draw Calls

    The GPU cannot work alone; it relies on the game code running on the machine's main processor - the CPU - to tell it what to render and how. The CPU and GPU are (usually) separate chips, running independently and in parallel. To hit our target frame rate - most commonly 30 frames per second - both the CPU and GPU have to do all the work to produce a single frame within the time allowed (at 30fps that's just 33 milliseconds per frame).
    To achieve this, frames are often pipelined; the CPU will take the whole frame to do its work (process AI, physics, input, animation etc.) and then send instructions to the GPU at the end of the frame so it can get to work on the next frame. This gives each processor a full 33ms to do its work at the expense of introducing a frame's worth of latency (delay). This may be an issue for extremely time-sensitive twitchy games like first person shooters - the Call of Duty series for example runs at 60fps to reduce the latency between player input and rendering - but in general the extra frame is not noticeable to the player. Every 33ms the final render target is copied and displayed on the screen at VSync - the interval during which the monitor looks for a new frame to display. But if the GPU takes longer than 33ms to finish rendering the frame, it will miss this window of opportunity and the monitor won't have any new frame to display. That results in either screen tearing or stuttering and an uneven framerate that we really want to avoid. We also get the same result if the CPU takes too long - it has a knock-on effect since the GPU doesn't get commands quickly enough to do its job in the time allowed. In short, a solid framerate relies on both the CPU and GPU performing well.
    missed-vsync.png Here the CPU takes too long to produce rendering commands for the second frame, so the GPU starts rendering late and thus misses VSync.
    To display a mesh, the CPU issues a draw call which is simply a series of commands that tells the GPU what to draw and how to draw it. As the draw call goes through the GPU pipeline, it uses the various configurable settings specified in the draw call - mostly determined by the mesh's material and its parameters - to decide how the mesh is rendered. These settings, called GPU state, affect all aspects of rendering, and consist of everything the GPU needs to know in order to render an object. Most significantly for us, GPU state includes the current vertex/index buffers, the current vertex/pixel shader programs, and all the shader inputs (eg. MaterialTexture or LightColor in the above shader code example). This means that to change a piece of GPU state (for example changing a texture or switching shaders), a new draw call must be issued. This matters because these draw calls are not free for the CPU. It costs a certain amount of time to set up the desired GPU state changes and then issue the draw call. Beyond whatever work the game engine needs to do for each call, extra error checking and bookkeeping cost is introduced by the graphics driver, an intermediate layer of code written by the GPU vendor (NVIDIA, AMD etc.) that translates the draw call into low-level hardware instructions. Too many draw calls can put too much of a burden on the CPU and cause serious performance problems. Due to this overhead, we generally set an upper limit to the number of draw calls that are acceptable per frame. If this limit is exceeded during gameplay testing, steps must be taken such as reducing the number of objects, reducing draw distance, etc. Console games will typically try to keep draw calls in the 2000-3000 range (eg. on Far Cry Primal we tried to keep it below 2500 per frame). That might sound like a lot, but it also includes any special rendering techniques that might be employed - cascaded shadows for example can easily double the number of draw calls in a frame. As mentioned above, GPU state can only be changed by issuing a new draw call. This means that although you may have created a single mesh in your modelling package, if one half of the mesh uses one texture for the albedo map and the other half uses a different texture, it will be rendered as two separate draw calls. The same goes if the mesh is made up of multiple materials; different shaders need to be set, so multiple draw calls must be issued. In practice, a very common source of state change - and therefore extra draw calls - is switching texture maps. Typically the whole mesh will use the same material (and therefore the same shaders), but different parts of the mesh will use different sets of albedo/normal/roughness maps. With a scene of hundreds or even thousands of objects, using many draw calls for each object will cost a considerable amount of CPU time and so will have a noticeable impact on the framerate of the game. To avoid this, a common solution is to combine all the different texture maps used on a mesh into a single big texture, often called an atlas. The UVs of the mesh are then adjusted to look up the right part of the atlas, and the entire mesh (or even multiple meshes) can be rendered in a single draw call. Care must be taken when constructing the atlas so that adjacent textures don't bleed into each other at lower mips, but these problems are relatively minor compared to the gains that can be had in terms of performance.
    atlas-infiltrator-signs.jpg A texture atlas from Unreal Engine's Infiltrator demo
    Many engines also support instancing, also known as batching or clustering. This is the ability to use a single draw call to render multiple objects that are mostly identical in terms of shaders and state, and only differ in a restricted set of ways (typically their position and rotation in the world). The engine will usually recognize when multiple identical objects can be rendered using instancing, so it's always preferable to use the same object multiple times in a scene when possible, instead of multiple different objects that will need to be rendered with separate draw calls. Another common technique for reducing draw calls is manually merging many different objects that share the same material into a single mesh. This can be effective, but care must be taken to avoid excessive merging which can actually worsen performance by increasing the amount of work for the GPU. Before any draw call gets issued, the engine's visibility system will determine whether or not the object will even appear on screen. If not, it's very cheap to just ignore the object at this early stage and not pay for any draw call or GPU work (also known as visibility culling). This is usually done by checking if the object's bounding volume is visible from the camera's point of view, and that it is not completely blocked from view (occluded) by any other objects. However, when multiple meshes are merged into a single object, their individual bounding volumes must be combined into a single large volume that is big enough to enclose every mesh. This increases the likelihood that the visibility system will be able to see some part of the volume, and so will consider the entire collection visible. That means that it becomes a draw call, and so the vertex shader must be executed on every vertex in the object - even if very few of those vertices actually appear on the screen. This can lead to a lot of GPU time being wasted because the vertices end up not contributing anything to the final image. For these reasons, mesh merging is the most effective when it is done on groups of small objects that are close to each other, as they will probably be on-screen at the same time anyway.
    xcom2-mesh-merging.png A frame from XCOM 2 as captured with RenderDoc. The wireframe (bottom) shows in grey all the extra geometry submitted to the GPU that is outside the view of the in-game camera.
    As an illustrative example take the above capture of XCOM 2, one of my favourite games of the last couple of years. The wireframe shows the entire scene as submitted to the GPU by the engine, with the black area in the middle being the geometry that's actually visible by the game camera. All the surrounding geometry in grey is not visible and will be culled after the vertex shader is executed, which is all wasted GPU time. In particular, note the highlighted red geometry which is a series of bush meshes, combined and rendered in just a few draw calls. Since the visibility system determined that at least some of the bushes are visible on the screen, they are all rendered and so must all have their vertex shader executed before determining which can be culled... which turns out to be most of them. Please note this isn't an indictment of XCOM 2 in particular, I just happened to be playing it while writing this article! Every game has this problem, and it's a constant battle to balance the CPU cost of doing more accurate visibility tests, the GPU cost of culling the invisible geometry, and the CPU cost of having more draw calls. Things are changing when it comes to the cost of draw calls however. As mentioned above, a significant reason for their expense is the overhead of the driver doing translation and error checking. This has long been the case, but the most modern graphics APIs (eg. Direct3D 12 and Vulkan) have been restructured in order to avoid most of this overhead. While this does introduce extra complexity to the game's rendering engine, it can also result in cheaper draw calls, allowing us to render many more objects than before possible. Some engines (most notably the latest version used by Assassin's Creed) have even gone in a radically different direction, using the capabilities of the latest GPUs to drive rendering and effectively doing away with draw calls altogether. The performance impact of having too many draw calls is mostly on the CPU; pretty much all other performance issues related to art assets are on the GPU. We'll now look at what a bottleneck is, where they can happen, and what we can do about them.

    Part 2: Common GPU bottlenecks

    The very first step in optimization is to identify the current bottleneck so you can take steps to reduce or eliminate it. A bottleneck refers to the section of the pipeline that is slowing everything else down. In the above case where too many draw calls are costing too much, the CPU is the bottleneck. Even if we performed other optimizations that made the GPU faster, it wouldn't matter to the framerate because the CPU is still running too slowly to produce a frame in the required amount of time.
    bottlenecks-ideal.png 4 draw calls going through the pipeline, each being the rendering of a full mesh containing many triangles. The stages overlap because as soon as one piece of work is finished it can be immediately passed to the next stage (eg. when three vertices are processed by the vertex shader then the triangle can proceed to be rasterized).
    You can think of the GPU pipeline as an assembly line. As each stage finishes with its data, it forwards the results to the following stage and proceeds with the next piece of work. Ideally every stage is busy working all the time, and the hardware is being utilized fully and efficiently as represented in the above image - the vertex shader is constantly processing vertices, the rasterizer is constantly rasterizing pixels, and so on. But consider what happens if one stage takes much longer than the others:
    What happens here is that an expensive vertex shader can't feed the following stages fast enough, and so becomes the bottleneck. If you had a draw call that behaved like this, making the pixel shader faster is not going to make much of a difference to the time it takes for the entire draw call to be rendered. The only way to make things faster is to reduce the time spent in the vertex shader. How we do that depends on what in the vertex shader stage is actually causing the bottleneck. You should keep in mind that there will almost always be a bottleneck of some kind - if you eliminate one, another will just take its place. The trick is knowing when you can do something about it, and when you have to live with it because that's just what it costs to render what you want to render. When you optimize, you're really trying to get rid of unnecessary bottlenecks. But how do you identify what the bottleneck is?


    Profiling tools are absolutely essential for figuring out where all the GPU's time is being spent, and good ones will point you at exactly what you need to change in order for things to go faster. They do this in a variety of ways - some explicitly show a list of bottlenecks, others let you run 'experiments' to see what happens (eg. "how does my draw time change if all the textures are tiny", which can tell you if you're bound by memory bandwidth or cache usage). Unfortunately this is where things get a bit hand-wavy, because some of the best performance tools available are only available for the consoles and therefore under NDA. If you're developing for Xbox or Playstation, bug your friendly neighbourhood graphics programmer to show you these tools. We love it when artists get involved in performance, and will be happy to answer questions and even host tutorials on how to use the tools effectively.
    unity-gpu-profiler.png Unity's basic built-in GPU profiler
    The PC already has some pretty good (albeit hardware-specific) profiling tools which you can get directly from the GPU vendors, such as NVIDIA's Nsight, AMD's GPU PerfStudio, and Intel's GPA. Then there's RenderDoc which is currently the best tool for graphics debugging on PC, but doesn't have any advanced profiling features. Microsoft is also starting to release its awesome Xbox profiling tool PIX for Windows too, albeit only for D3D12 applications. Assuming they also plan to provide the same bottleneck analysis tools as the Xbox version (tricky with the wide variety of hardware out there), it should be a huge asset to PC developers going forward. These tools can give you more information about the performance of your art than you will ever need. They can also give you a lot of insight into how a frame is put together in your engine, as well as being awesome debugging tools for when things don't look how they should. Being able to use them is important, as artists need to be responsible for the performance of their art. But you shouldn't be expected to figure it all out on your own - any good engine should provide its own custom tools for analyzing performance, ideally providing metrics and guidelines to help determine if your art assets are within budget. If you want to be more involved with performance but feel you don't have the necessary tools, talk to your programming team. Chances are they already exist - and if they don't, they should be created! Now that you know how GPUs work and what a bottleneck is, we can finally get to the good stuff. Let's dig into the most common real-world bottlenecks that can show up in the pipeline, how they happen, and what can be done about them.

    Shader instructions

    Since most of the GPU's work is done with shaders, they're often the source of many bottlenecks of the you'll see. When a bottleneck is identified as shader instructions (sometimes referred to as ALUs from Arithmetic Logic Units, the hardware that actually does the calculations), it's simply a way of saying the vertex or pixel shader is doing a lot of work and the rest of the pipeline is waiting for that work to finish. Often the vertex or pixel shader program itself is just too complex, containing many instructions and taking a long time to execute. Or maybe the vertex shader is reasonable but the mesh you're rendering has too many vertices which adds up to a lot of time spent executing the vertex shader. Or the draw call covers a large area of the screen touching many pixels, and so spends a lot of time in the pixel shader. Unsurprisingly, the best way to optimize a shader instruction bottleneck is to execute less instructions! For pixel shaders that means choosing a simpler material with less features to reduce the number of instructions executed per pixel. For vertex shaders it means simplifying your mesh to reduce the number of vertices that need to be processed, as well as being sure to use LODs (Level Of Detail - simplified versions of your mesh for use when the object is far away and small on the screen). Sometimes however, shader instruction bottlenecks are instead just an indication of problems in some other area. Issues such as too much overdraw, a misbehaving LOD system, and many others can cause the GPU to do a lot more work than necessary. These problems can be either on the engine side or the content side; careful profiling, examination, and experience will help you to figure out what's really going on. One of the most common of these issues - overdraw - is when the same pixel on the screen needs to be shaded multiple times, because it's touched by multiple draw calls. Overdraw is a problem because it decreases the overall time the GPU has to spend on rendering. If every pixel on the screen has to be shaded twice, the GPU can only spend half the amount of time on each pixel and still maintain the same framerate.
    pix-overdraw.png A frame capture from PIX with the corresponding overdraw visualization mode
    Sometimes overdraw is unavoidable, such as when rendering translucent objects like particles or glass-like materials; the background object is visible through the foreground, so both need to be rendered. But for opaque objects, overdraw is completely unnecessary because the pixel shown in the buffer at the end of rendering is the only one that actually needs to be processed. In this case, every overdrawn pixel is just wasted GPU time. Steps are taken by the GPU to reduce overdraw in opaque objects. The early depth test (which happens before the pixel shader - see the initial pipeline diagram) will skip pixel shading if it determines that the pixel will be hidden by another object. It does that by comparing the pixel being shaded to the depth buffer - a render target where the GPU stores the entire frame's depth so that objects occlude each other properly. But for the early depth test to be effective, the other object must have already been rendered so it is present in the depth buffer. That means that the rendering order of objects is very important. Ideally every scene would be rendered front-to-back (ie. objects closest to the camera first), so that only the foreground pixels get shaded and the rest get killed by the early depth test, eliminating overdraw entirely. But in the real world that's not always possible because you can't reorder the triangles inside a draw call during rendering. Complex meshes can occlude themselves multiple times, or mesh merging can result in many overlapping objects being rendered in the "wrong" order causing overdraw. There's no easy answer for avoiding these cases, and in the latter case it's just another thing to take into consideration when deciding whether or not to merge meshes. To help early depth testing, some games do a partial depth prepass. This is a preliminary pass where certain large objects that are known to be effective occluders (large buildings, terrain, the main character etc.) are rendered with a simple shader that only outputs to the depth buffer, which is relatively fast as it avoids doing any pixel shader work such as lighting or texturing. This 'primes' the depth buffer and increases the amount of pixel shader work that can be skipped during the full rendering pass later in the frame. The drawback is that rendering the occluding objects twice (once in the depth-only pass and once in the main pass) increases the number of draw calls, plus there's always a chance that the time it takes to render the depth pass itself is more than the time it saves from increased early depth test efficiency. Only profiling in a variety of cases can determine whether or not it's worth it for any given scene.
    particle-overdraw-p2.png Particle overdraw visualization of an explosion in Prototype 2
    One place where overdraw is a particular concern is particle rendering, given that particles are transparent and often overlap a lot. Artists working on particle effects should always have overdraw in mind when producing effects. A dense cloud effect can be produced by emitting lots of small faint overlapping particles, but that's going to drive up the rendering cost of the effect; a better-performing alternative would be to emit fewer large particles, and instead rely more on the texture and texture animation to convey the density of the effect. The overall result is often more visually effective anyway because offline software like FumeFX and Houdini can usually produce much more interesting effects through texture animation, compared to real-time simulated behaviour of individual particles. The engine can also take steps to avoid doing more GPU work than necessary for particles. Every rendered pixel that ends up completely transparent is just wasted time, so a common optimization is to perform particle trimming: instead of rendering the particle with two triangles, a custom-fitted polygon is generated that minimizes the empty areas of the texture that are used.
    unreal-particle-cutouts.jpg Particle 'cutout' tool in Unreal Engine 4
    The same can be done for other partially transparent objects such as vegetation. In fact for vegetation it's even more important to use custom geometry to eliminate the large amount of empty texture space, as vegetation often uses alpha testing. This is when the alpha channel of the texture is used to decide whether or not to discard the pixel during the pixel shader stage, effectively making it transparent. This is a problem because alpha testing can also have the side effect of disabling the early depth test completely (because it invalidates certain assumptions that the GPU can make about the pixel), leading to much more unnecessary pixel shader work. Combine this with the fact that vegetation often contains a lot of overdraw anyway - think of all the overlapping leaves on a tree - and it can quickly become very expensive to render if you're not careful. A close relative of overdraw is overshading, which is caused by tiny or thin triangles and can really hurt performance by wasting a significant portion of the GPU's time. Overshading is a consequence of how GPUs process pixels during pixel shading: not one at a time, but instead in 'quads' which are blocks of four pixels arranged in a 2x2 pattern. It's done like this so the hardware can do things like comparing UVs between pixels to calculate appropriate mipmap levels. This means that if a triangle only touches a single pixel of a quad (because the triangle is tiny or very thin), the GPU still processes the whole quad and just throws away the other three pixels, wasting 75% of the work. That wasted time can really add up, and is particularly painful for forward (ie. not deferred) renderers that do all lighting and shading in a single pass in the pixel shader. This penalty can be reduced by using properly-tuned LODs; besides saving on vertex shader processing, they can also greatly reduce overshading by having triangles cover more of each quad on average.'
    quad-coverage.png A 10x8 pixel buffer with 5x4 quads. The two triangles have poor quad utilization -- left is too small, right is too thin. The 10 red quads touched by the triangles need to be completely shaded, even though the 12 green pixels are the only ones that are actually needed. Overall, 70% of the GPU's work is wasted.
    (Random trivia: quad overshading is also the reason you'll sometimes see fullscreen post effects use a single large triangle to cover the screen instead of two back-to-back triangles. With two triangles, quads that straddle the shared edge would be wasting some of their work, so avoiding that saves a minor amount of GPU time.) Beyond overshading, tiny triangles are also a problem because GPUs can only process and rasterize triangles at a certain rate, which is usually relatively low compared to how many pixels it can process in the same amount of time. With too many small triangles, it can't produce pixels fast enough to keep the shader units busy, resulting in stalls and idle time - the real enemy of GPU performance. Similarly, long thin triangles are bad for performance for another reason beyond quad usage: GPUs rasterize pixels in square or rectangular blocks, not in long strips. Compared to a more regular-shaped triangle with even sides, a long thin triangle ends up making the GPU do a lot of extra unnecessary work to rasterize it into pixels, potentially causing a bottleneck at the rasterization stage. This is why it's usually recommended that meshes are tessellated into evenly-shaped triangles, even if it increases the polygon count a bit. As with everything else, experimentation and profiling will show the best balance.

    Memory Bandwidth and Textures

    As illustrated in the above diagram of the GPU pipeline, meshes and textures are stored in memory that is physically separate from the GPU's shader processors. That means that whenever the GPU needs to access some piece of data, like a texture being fetched by a pixel shader, it needs to retrieve it from memory before it can actually use it as part of its calculations. Memory accesses are analogous to downloading files from the internet. File downloads take a certain amount of time due to the internet connection's bandwidth - the speed at which data can be transferred. That bandwidth is also shared between all downloads - if you can download one file at 6MB/s, two files only download at 3MB/s each. The same is true of memory accesses; index/vertex buffers and textures being accessed by the GPU take time, and must share memory bandwidth. The speeds are obviously much higher than internet connections - on paper the PS4's GPU memory bandwidth is 176GB/s - but the idea is the same. A shader that accesses many textures will rely heavily on having enough bandwidth to transfer all the data it needs in the time it needs it. Shaders programs are executed by the GPU with these restrictions in mind. A shader that needs to access a texture will try to start the transfer as early as possible, then do other unrelated work (for example lighting calculations) and hope that the texture data has arrived from memory by the time it gets to the part of the program that needs it. If the data hasn't arrived in time - because the transfer is slowed down by lots of other transfers, or because it runs out of other work to do (especially likely for dependent texture fetches) - execution will stop and it will just sit there and wait. This is a memory bandwidth bottleneck; making the rest of the shader faster will not matter if it still needs to stop and wait for data to arrive from memory. The only way to optimize this is to reduce the amount of bandwidth being used, or the amount of data being transferred, or both. Memory bandwidth might even have to be shared with the CPU or async compute work that the GPU is doing at the same time. It's a very precious resource. The majority of memory bandwidth is usually taken up by texture transfers, since textures contain so much data. As a result, there are a few different mechanisms in place to reduce the amount of texture data that needs to be shuffled around. Memory bandwidth might even have to be shared with the CPU or async compute work that the GPU is doing at the same time. It's a very precious resource. The majority of memory bandwidth is usually taken up by texture transfers, since textures contain so much data. As a result, there are a few different mechanisms in place to reduce the amount of texture data that needs to be shuffled around. First and foremost is a cache. This is a small piece of high-speed memory that the GPU has very fast access to, and is used to keep chunks of memory that have been accessed recently in case the GPU needs them again. In the internet connection analogy, the cache is your computer's hard drive that stores the downloaded files for faster access in the future. When a piece of memory is accessed, like a single texel in a texture, the surrounding texels are also pulled into the cache in the same memory transfer. The next time the GPU looks for one of those texels, it doesn't need to go all the way to memory and can instead fetch it from the cache extremely quickly. This is actually often the common case - when a texel is displayed on the screen in one pixel, it's very likely that the pixel beside it will need to show the same texel, or the texel right beside it in the texture. When that happens, nothing needs to be transferred from memory, no bandwidth is used, and the GPU can access the cached data almost instantly. Caches are therefore vitally important for avoiding memory-related bottlenecks. Especially when you take filtering into account - bilinear, trilinear, and anisotropic filtering all require multiple texels to be accessed for each lookup, putting an extra burden on bandwidth usage. High-quality anisotropic filtering is particularly bandwidth-intensive. Now think about what happens in the cache if you try to display a large texture (eg. 2048x2048) on an object that's very far away and only takes up a few pixels on the screen. Each pixel will need to fetch from a very different part of the texture, and the cache will be completely ineffective since it only keeps texels that were close to previous accesses. Every texture access will try to find its result in the cache and fail (called a 'cache miss') and so the data must be fetched from memory, incurring the dual costs of bandwidth usage and the time it takes for the data to be transferred. A stall may occur, slowing the whole shader down. It will also cause other (potentially useful) data to be 'evicted' from the cache in order to make room for the surrounding texels that will never even be used, reducing the overall efficiency of the cache. It's bad news all around, and that's not to even mention the visual quality issues - tiny movements of the camera will cause completely different texels to be sampled, causing aliasing and sparkling. This is where mipmapping comes to the rescue. When a texture fetch is issued, the GPU can analyze the texture coordinates being used at each pixel, determining when there is a large gap between texture accesses. Instead of incurring the costs of a cache miss for every texel, it instead accesses a lower mip of the texture that matches the resolution it's looking for. This greatly increases the effectiveness of the cache, reducing memory bandwidth usage and the potential for a bandwidth-related bottleneck. Lower mips are also smaller and need less data to be transferred from memory, further reducing bandwidth usage. And finally, since mips are pre-filtered, their use also vastly reduces aliasing and sparkling. For all of these reasons, it's almost always a good idea to use mipmaps - the advantages are definitely worth the extra memory usage.
    minimized-texture.png A texture on two quads, one close to the camera and one much further away
    mipmap-example.jpg The same texture with a corresponding mipmap chain, each mip being half the size of the previous one
    Lastly, texture compression is an important way of reducing bandwidth and cache usage (in addition to the obvious memory savings from storing less texture data). Using BC (Block Compression, previously known as DXT compression), textures can be reduced to a quarter or even a sixth of their original size in exchange for a minor hit in quality. This is a significant reduction in the amount of data that needs to be transferred and processed, and most GPUs even keep the textures compressed in the cache, leaving more room to store other texture data and increasing overall cache efficiency. All of the above information should lead to some obvious steps for reducing or eliminating bandwidth bottlenecks when it comes to texture optimization on the art side. Make sure the textures have mips and are compressed. Don't use heavy 8x or 16x anisotropic filtering if 2x is enough, or even trilinear or bilinear if possible. Reduce texture resolution, particularly if the top-level mip is often displayed. Don't use material features that cause texture accesses unless the feature is really needed. And make sure all the data being fetched is actually used - don't sample four RGBA textures when you actually only need the data in the red channels of each; merge those four channels into a single texture and you've removed 75% of the bandwidth usage. While textures are the primary users of memory bandwidth, they're by no means the only ones. Mesh data (vertex and index buffers) also need to be loaded from memory. You'll also notice in first GPU pipeline diagram that the final render target output is a write to memory. All these transfers usually share the same memory bandwidth. In normal rendering these costs typically aren't noticeable as the amount of data is relatively small compared to the texture data, but this isn't always the case. Compared to regular draw calls, shadow passes behave quite differently and are much more likely to be bandwidth bound.
    gta-v-shadows.png A frame from GTA V with shadow maps, courtesy of Adrian Courr?ges' great frame analysis
    This is because shadow maps are simply depth buffer that represent the distance from the light to the closest mesh, so most of the work that needs to be done for shadow rendering consists of transferring data to and from memory: fetch the vertex/index buffers, do some simple calculations to determine position, and then write the depth of the mesh to the shadow map. Most of the time, a pixel shader isn't even executed because all the necessary depth information comes from just the vertex data. This leaves very little work to hide the overhead of all the memory transfers, and the likely bottleneck is that the shader just ends up waiting for memory transfers to complete. As a result, shadow passes are particularly sensitive to both vertex/triangle counts and shadow map resolution, as they directly affect the amount of bandwidth that is needed. The last thing worth mentioning with regards to memory bandwidth is a special case - the Xbox. Both the Xbox 360 and Xbox One have a particular piece of memory embedded close to the GPU, called EDRAM on 360 and ESRAM on XB1. It's a relatively small amount of memory (10MB on 360 and 32MB on XB1), but big enough to store a few render targets and maybe some frequently-used textures, and with a much higher bandwidth than regular system memory (aka DRAM). Just as important as the speed is the fact that this bandwidth uses a dedicated path, so doesn't have to be shared with DRAM transfers. It adds complexity to the engine, but when used efficiently it can give some extra headroom in bandwidth-limited situations. As an artist you generally won't have control over what goes into EDRAM/ESRAM, but it's worth knowing of its existence when it comes to profiling. The 3D programming team can give you more details on its use in your particular engine.

    And there's more...

    As you've probably gathered by now, GPUs are complex pieces of hardware. When fed properly, they are capable of processing an enormous amount of data and performing billions of calculations every second. On the other hand, bad data and poor usage can slow them down to a crawl, having a devastating effect on the game's framerate. There are many more things that could be discussed or expanded upon, but what's above is a good place to start for any technically-minded artist. Having an understanding of how the GPU works can help you produce art that not only looks great but also performs well... and better performance can let you improve your art even more, making the game look better too. There's a lot to take in here, but remember that your 3D programming team is always happy to sit down with you and discuss anything that needs more explanation - as am I in the comments section below!

    Further Technical Reading

    Note: This article was originally published on fragmentbuffer.com, and is republished here with kind permission from the author Keith O'Conor. You can read more of Keith's writing on Twitter (@keithoconor).

      Report Article

    User Feedback

    Good article. I think it would be worth mentioning Unity has Frame Debugger tool better than the mentioned general-purpose Profiler tool for GPU-related debugging.

    Share this comment

    Link to comment
    Share on other sites

    Create an account or sign in to comment

    You need to be a member in order to leave a comment

    Create an account

    Sign up for a new account in our community. It's easy!

    Register a new account

    Sign in

    Already have an account? Sign in here.

    Sign In Now

  • Advertisement
  • Advertisement
  • intellogo.png

    Are you ready to promote your game?

    Submit your game for Intel® certification by December 21, 2018 and you could win big! 

    Click here to learn more.

  • Latest Featured Articles

  • Featured Blogs

  • Advertisement
  • Popular Now

  • Similar Content

    • By CyberFlash
      First of, I think what I'm trying to do is going to be super duper simple and I'm just missing one key piece of information.... So the best way to illustrate/explain my end goal is to look at Bloons Tower Defense 3 (It probably exists in a lot of other examples also)... When you grab a tower from the menu and place it, It's 'highlighted'/'selected' and you can see the upgrade information... Okay I can do that... Mouse Left Pressed - Clicked = true; and then Draw - //Upgrade Info ...
      My question is how to deactivate that selection.. I was thinking it would be a simple If ! clicking on the tower... which makes sense.. If you're clicking but its not on the object you want to turn it off.. cool BUT what if you are attempting to click the upgrade button on the menu??? That is NOT the object but you DO want to keep it selected... This is what I'm trying to achieve.. And I don't know how to explain it...
      I was thinking that I could do a trigger like
      if clicked = true { if (!mouse_x >= 50 && !mouse_x <= 100) && (!mouse_y >= 50 && !mouse_y <= 100){ //Example Coordinates - Pretend this is upgrade button if (mouse_check_button_pressed(mb_left)){ //Clicking Left Button clicked = false; } } } And this is where I'm getting stuck... Like I want to click the object and have it trigger clicked = true; I've got that, that happens.. I now want to have it so if you are NOT clicking the object and NOT clicking the upgrade button (this is drawn via a sprite so I'm using coordinates) then I want clicked to go false....
      I just feel like all the && && && && is wrong... is there any sort of command that could be called that means like 'anything else' ? I'm thinking that might not work either though to be honest as I want to be able to click the upgrade button... 
      Maybe having the use of 'or' could apply here.. maybe having the coordinates of 'self' (this will be triggered by one tower that has been clicked not all towers at same time obviously,,,) and the coordinates of where the upgrade button will be in the same line so like
      If you are NOT clicking tower or NOT clicking upgrade button then clicked = false?? 
      I feel that would be it... I'm just not sure.. Any ideas? I'm not used to trying to share code so I might have made a mess above. But I hope this helps explain what I'm trying to achieve.
      Also I'm stuck outside for a while so I'm having a nightmare of a time just trying to think of this but not able to test it right now!!! 
    • By slayemin
      What exactly is an Octree? If you're completely unfamiliar with them, I recommend reading the Wikipedia article (read time: ~5 minutes). This is a sufficient description of what it is but is barely enough to give any ideas on what it's used for and how to actually implement one. In this article, I will do my best to take you through the steps necessary to create an octree data structure through conceptual explanations, pictures, and code, and show you the considerations to be made at each step along the way. I don't expect this article to be the authoritative way to do octrees, but it should give you a really good start and act as a good reference.
      Before we dive in, I'm going to be making a few assumptions about you as a reader:
      You are very comfortable with programming in a C-syntax-style language (I will be using C# with XNA). You have programmed some sort of tree-like data structure in the past, such as a binary search tree and are familiar with recursion and its strengths and pitfalls. You know how to do collision detection with bounding rectangles, bounding spheres, and bounding frustums. You have a good grasp of common data structures (arrays, lists, etc) and understand Big-O notation (you can also learn about Big-O in this GDnet article). You have a development environment project which contains spatial objects which need collision tests.  
      Setting the stage
      Let's suppose that we are building a very large game world which can contain thousands of physical objects of various types, shapes and sizes, some of which must collide with each other. Each frame we need to find out which objects are intersecting with each other and have some way to handle that intersection. How do we do it without killing performance?
      Brute force collision detection
      The simplest method is to just compare each object against every other object in the world. Typically, you can do this with two for loops. The code would look something like this:
      foreach(gameObject myObject in ObjList) { foreach(gameObject otherObject in ObjList) { if(myObject == otherObject) continue; //avoid self collision check if(myObject.CollidesWith(otherObject)) { //code to handle the collision } } }  
      Conceptually, this is what we're doing in our picture:
      Each red line is an expensive CPU test for intersection. Naturally, you should feel horrified by this code because it is going to run in O(N^2) time. If you have 10,000 objects, then you're going to be doing 100,000,000 collision checks (hundred million). I don't care how fast your CPU is or how well you've tuned your math code, this code would reduce your computer to a sluggish crawl. If you're running your game at 60 frames per second, you're looking at 60 * 100 million calculations per second! It's nuts. It's insane. It's crazy. Let's not do this if we can avoid it, at least not with a large set of objects. This would only be acceptable if we're only checking, say, 10 items against each other (100 checks is palatable). If you know in advance that your game is only going to have a very small number of objects (i.e., Asteroids), you can probably get away with using this brute force method for collision detection and ignore octrees altogether. If/when you start noticing performance problems due to too many collision checks per frame, consider some simple targeted optimizations:
      How much computation does your current collision routine take? Do you have a square root hidden away in there (ie, a distance check)? Are you doing a granular collision check (pixel vs pixel, triangle vs triangle, etc)? One common technique is to perform a rough, coarse check for collision before testing for a granular collision check. You can give your objects an enclosing bounding rectangle or bounding sphere and test for intersection with these before testing against a granular check which may involve a lot more math and computation time. Can you get away with calculating fewer collision checks? If your game runs at 60 frames per second, could you skip a few frames? If you know certain objects behave deterministically, can you "solve" for when they will collide ahead of time (ie, pool ball vs. side of pool table). Can you reduce the number of objects which need to be checked for collisions? A technique for this would be to separate objects into several lists. One list could be your "stationary" objects list. They never have to test for collision against each other. The other list could be your "moving" objects, which need to be tested against all other moving objects and against all stationary objects. This could reduce the number of necessary collision tests to reach an acceptable performance level. Can you get away with removing some object collision tests when performance becomes an issue? For example, a smoke particle could interact with a surface object and follow its contours to create a nice aesthetic effect, but it wouldn't break gameplay if you hit a predefined limit for collision checks and decided to stop ignoring smoke particles for collision. Ignoring essential game object movement would certainly break gameplay though (i.e., player bullets stop intersecting with monsters). So, perhaps maintaining a priority list of collision checks to compute would help. First, you handle the high priority collision tests, and if you're not at your threshold, you can handle lower priority collision tests. When the threshold is reached, you dump the rest of the items in the priority list or defer them for testing at a later time.  Can you use a faster but still simplistic method for collision detection to get away from an O(N^2) runtime? If you eliminate the objects you've already checked for collisions against, you can reduce the runtime to O(N(N+1)/2), which is much faster and still easy to implement. (technically, it's still O(N^2)) In terms of software engineering, you may end up spending more time than it's worth fine-tuning a bad algorithm & data structure choice to squeeze out a few more ounces of performance. The cost vs. benefit ratio becomes increasingly unfavourable and it becomes time to choose a better data structure to handle collision detection. Spatial partitioning algorithms are the proverbial nuke to solving the runtime problem for collision detection. At a small upfront cost to performance, they'll reduce your collision detection tests to logarithmic runtime. The upfront costs of development time and CPU overhead are easily outweighed by the scalability benefits and performance gains.  
      Conceptual background on spatial partitioning
      Let's take a step back and look at spatial partitioning and trees in general before diving into Octrees. If we don't understand the conceptual idea, we have no hope of implementing it by sweating over the code. Looking at the brute force implementation above, we're essentially taking every object in the game and comparing their positions against all other objects in the game to see if any are touching. All of these objects are contained spatially within our game world. Well, if we create an enclosing box around our game world and figure out which objects are contained within this enclosing box, then we've got a region of space with a list of contained objects within it. In this case, it would contain every object in the game.
      We can notice that if we have an object on one corner of the world and another object way on the other side, we don't really need to, or want to, calculate a collision check against them every frame. It'd be a waste of precious CPU time.
      So, let's try something interesting! If we divide our world exactly in half, we can create three separate lists of objects. The first list of objects, List A, contains all objects on the left half of the world. The second list, List B, contains objects on the right half of the world. Some objects may touch the dividing line such that they're on each side of the line, so we'll create a third list, List C, for these objects.
      We can notice that with each subdivision, we're spatially reducing the world in half and collecting a list of objects in that resulting half. We can elegantly create a binary search tree to contain these lists. Conceptually, this tree should look something like so:
      In terms of pseudo code, the tree data structure would look something like this:
      public class BinaryTree { //This is a list of all of the objects contained within this node of the tree private List m_objectList; //These are pointers to the left and right child nodes in the tree private BinaryTree m_left, m_right; //This is a pointer to the parent object (for upward tree traversal). private BinaryTree m_parent; } We know that all objects in List A will never intersect with any objects in List B, so we can almost eliminate half of the number of collision checks. We've still got the objects in List C which could touch objects in either list A or B, so we'll have to check all objects in List C against all objects in Lists A, B & C. If we continue to sub-divide the world into smaller and smaller parts, we can further reduce the number of necessary collision checks by half each time. This is the general idea behind spatial partitioning. There are many ways to subdivide a world into a tree-like data structure (BSP trees, Quad Trees, K-D trees, OctTrees, etc).
      Now, by default, we're just assuming that the best division is a cut in half, right down the middle, since we're assuming that all of our objects will be somewhat uniformly distributed throughout the world. It's not a bad assumption to make, but some spatial division algorithms may decide to make a cut such that each side has an equal amount of objects (a weighted cut) so that the resulting tree is more balanced. However, what happens if all of these objects move around? In order to maintain a nearly even division, you'd have to either shift the splitting plane or completely rebuild the tree each frame. It'd be a bit of a mess with a lot of complexity. So, for my implementation of a spatial partitioning tree, I decided to cut right down the middle every time. As a result, some trees may end up being a bit more sparse than others, but that's okay -- it doesn't cost much.
      To subdivide or not to subdivide? That is the question.
      Let's assume that we have a somewhat sparse region with only a few objects. We could continue subdividing our space until we've found the smallest possible enclosing area for that object. But is that really necessary? Let's remember that the whole reason we're creating a tree is to reduce the number of collision checks we need to perform each frame -- not to create a perfectly enclosing region of space for every object. Here are the rules I use for deciding whether to subdivide or not:
      If we create a subdivision which only contains one object, we can stop subdividing even though we could keep dividing further. This rule will become an important part of the criteria for what defines a "leaf node" in our octree. The other important criteria is to set a minimum size for a region. If you have an extremely small object which is nanometers in size (or, god forbid, you have a bug and forgot to initialize an object size!), you're going to keep subdividing to the point where you potentially overflow your call stack. For my own implementation, I defined the smallest containing region to be a 1x1x1 cube. Any objects in this teeny cube will just have to be run with the O(N^2) brute force collision test (I don't anticipate many objects anyways!). If a containing region doesn't contain any objects, we shouldn't try to include it in the tree. We can take our subdivision by half one step further and divide the 2D world space into quadrants. The logic is essentially the same, but now we're testing for collision with four squares instead of two rectangles. We can continue subdividing each square until our rules for termination are met. The representation of the world space and corresponding data structure for a quadtree would look something like this:
      If the quadtree subdivision and data structure makes sense, then an octree should be pretty straightforward as well. We're just adding a third dimension, using bounding cubes instead of bounding squares, and have eight possible child nodes instead of four. Some of you might wonder what should happen if you have a game world with non-cubic dimensions, say 200x300x400. You can still use an octree with cubic dimensions -- some child nodes will just end up empty if the game world doesn't have anything there. Obviously, you'll want to set the dimensions of your octree to at least the largest dimension of your game world.
      Octree Construction
      So, as you've read, an octree is a special type of subdividing tree commonly used for objects in 3D space (or anything with 3 dimensions). Our enclosing region is going to be a three dimensional rectangle (commonly a cube). We will then apply our subdivision logic above, and cut our enclosing region into eight smaller rectangles. If a game object completely fits within one of these subdivided regions, we'll push it down the tree into that node's containing region. We'll then recursively continue subdividing each resulting region until one of our breaking conditions is met. In the end, we should expect to have a nice tree-like data structure.
      My implementation of the octree can contain objects which have either a bounding sphere and/or a bounding rectangle. You'll see a lot of code I use to determine which is being used.
      In terms of our Octree class data structure, I decided to do the following for each tree:
      Each node has a bounding region which defines the enclosing region Each node has a reference to the parent node Contains an array of eight child nodes (use arrays for code simplicity and cache performance) Contains a list of objects contained within the current enclosing region I use a byte-sized bitmask for figuring out which child nodes are actively being used (the optimization benefits at the cost of additional complexity is somewhat debatable) I use a few static variables to indicate the state of the tree Here is the code for my Octree class outline:
      public class OctTree { BoundingBox m_region; List m_objects; /// /// These are items which we're waiting to insert into the data structure. /// We want to accrue as many objects in here as possible before we inject them into the tree. This is slightly more cache friendly. /// static Queue m_pendingInsertion = new Queue(); /// /// These are all of the possible child octants for this node in the tree. /// OctTree[] m_childNode = new OctTree[8]; /// /// This is a bitmask indicating which child nodes are actively being used. /// It adds slightly more complexity, but is faster for performance since there is only one comparison instead of 8. /// byte m_activeNodes = 0; /// /// The minumum size for enclosing region is a 1x1x1 cube. /// const int MIN_SIZE = 1; /// /// this is how many frames we'll wait before deleting an empty tree branch. Note that this is not a constant. The maximum lifespan doubles /// every time a node is reused, until it hits a hard coded constant of 64 /// int m_maxLifespan = 8; // int m_curLife = -1; //this is a countdown time showing how much time we have left to live /// /// A reference to the parent node is nice to have when we're trying to do a tree update. /// OctTree _parent; static bool m_treeReady = false; //the tree has a few objects which need to be inserted before it is complete static bool m_treeBuilt = false; //there is no pre-existing tree yet. }  
      Initializing the enclosing region
      The first step in building an octree is to define the enclosing region for the entire tree. This will be the bounding box for the root node of the tree which initially contains all objects in the game world. Before we go about initializing this bounding volume, we have a few design decisions we need to make:
      What should happen if an object moves outside of the bounding volume of the root node? Do we want to resize the entire octree so that all objects are enclosed? If we do, we'll have to completely rebuild the octree from scratch. If we don't, we'll need to have some way to either handle out of bounds objects or ensure that objects never go out of bounds. How do we want to create the enclosing region for our octree? Do we want to use a preset dimension, such as a 200x400x200 (X,Y,Z) rectangle? Or do we want to use a cubic dimension which is a power of 2? What should be the smallest allowable enclosing region which cannot be subdivided? Personally, I decided that I would use a cubic enclosing region with dimensions which are a power of 2, and sufficiently large to completely enclose my world. The smallest allowable cube is a 1x1x1 unit region. With this, I know that I can always cleanly subdivide my world and get integer numbers (even though the Vector3 uses floats). I also decided that my enclosing region would enclose the entire game world, so if an object leaves this region, it should be quietly destroyed. At the smallest octant, I will have to run a brute force collision check against all other objects, but I don't realistically expect more than 3 objects to occupy that small of an area at a time, so the performance costs of O(N^2) are completely acceptable. So, I normally just initialize my octree with a constructor which takes a region size and a list of items to insert into the tree. I feel it's barely worth showing this part of the code since it's so elementary, but I'll include it for completeness. Here are my constructors:
      /*Note: we want to avoid allocating memory for as long as possible since there can be lots of nodes.*/ /// /// Creates an oct tree which encloses the given region and contains the provided objects. /// ///The bounding region for the oct tree. ///The list of objects contained within the bounding region private OctTree(BoundingBox region, List objList) { m_region = region; m_objects = objList; m_curLife = -1; } public OctTree() { m_objects = new List(); m_region = new BoundingBox(Vector3.Zero, Vector3.Zero); m_curLife = -1; } /// /// Creates an octTree with a suggestion for the bounding region containing the items. /// ///The suggested dimensions for the bounding region. ///Note: if items are outside this region, the region will be automatically resized. public OctTree(BoundingBox region) { m_region = region; m_objects = new List(); m_curLife = -1; }  
      Building an initial octree
      I'm a big fan of lazy initialization. I try to avoid allocating memory or doing work until I absolutely have to. In the case of my octree, I avoid building the data structure as long as possible. We'll accept a user's request to insert an object into the data structure, but we don't actually have to build the tree until someone runs a query against it.
      What does this do for us? Well, let's assume that the process of constructing and traversing our tree is somewhat computationally expensive. If a user wants to give us 1,000 objects to insert into the tree, does it make sense to recompute every subsequent enclosing area a thousand times? Or, can we save some time and do a bulk blast? I created a "pending" queue of items and a few flags to indicate the build state of the tree. All of the inserted items get put into the pending queue and when a query is made, those pending requests get flushed and injected into the tree. This is especially handy during a game loading sequence since you'll most likely be inserting thousands of objects at once. After the game world has been loaded, the number of objects injected into the tree is orders of magnitude fewer. My lazy initialization routine is contained within my UpdateTree() method. It checks to see if the tree has been built, and builds the data structure if it doesn't exist and has pending objects.
      /// /// Processes all pending insertions by inserting them into the tree. /// /// Consider deprecating this? private void UpdateTree() //complete & tested { if (!m_treeBuilt) { while (m_pendingInsertion.Count != 0) m_objects.Add(m_pendingInsertion.Dequeue()); BuildTree(); } else { while (m_pendingInsertion.Count != 0) Insert(m_pendingInsertion.Dequeue()); } m_treeReady = true; } As for building the tree itself, this can be done recursively. So for each recursive iteration, I start off with a list of objects contained within the bounding region. I check my termination rules, and if we pass, we create eight subdivided bounding areas which are perfectly contained within our enclosed region. Then, I go through every object in my given list and test to see if any of them will fit perfectly within any of my octants. If they do fit, I insert them into a corresponding list for that octant. At the very end, I check the counts on my corresponding octant lists and create new octrees and attach them to our current node, and mark my bitmask to indicate that those child octants are actively being used. All of the leftover objects have been pushed down to us from our parent, but can't be pushed down to any children, so logically, this must be the smallest octant which can contain the object.
      /// /// Naively builds an oct tree from scratch. /// private void BuildTree() //complete & tested { //terminate the recursion if we're a leaf node if (m_objects.Count <= 1) return; Vector3 dimensions = m_region.Max - m_region.Min; if (dimensions == Vector3.Zero) { FindEnclosingCube(); dimensions = m_region.Max - m_region.Min; } //Check to see if the dimensions of the box are greater than the minimum dimensions if (dimensions.X <= MIN_SIZE && dimensions.Y <= MIN_SIZE && dimensions.Z <= MIN_SIZE) { return; } Vector3 half = dimensions / 2.0f; Vector3 center = m_region.Min + half; //Create subdivided regions for each octant BoundingBox[] octant = new BoundingBox[8]; octant[0] = new BoundingBox(m_region.Min, center); octant[1] = new BoundingBox(new Vector3(center.X, m_region.Min.Y, m_region.Min.Z), new Vector3(m_region.Max.X, center.Y, center.Z)); octant[2] = new BoundingBox(new Vector3(center.X, m_region.Min.Y, center.Z), new Vector3(m_region.Max.X, center.Y, m_region.Max.Z)); octant[3] = new BoundingBox(new Vector3(m_region.Min.X, m_region.Min.Y, center.Z), new Vector3(center.X, center.Y, m_region.Max.Z)); octant[4] = new BoundingBox(new Vector3(m_region.Min.X, center.Y, m_region.Min.Z), new Vector3(center.X, m_region.Max.Y, center.Z)); octant[5] = new BoundingBox(new Vector3(center.X, center.Y, m_region.Min.Z), new Vector3(m_region.Max.X, m_region.Max.Y, center.Z)); octant[6] = new BoundingBox(center, m_region.Max); octant[7] = new BoundingBox(new Vector3(m_region.Min.X, center.Y, center.Z), new Vector3(center.X, m_region.Max.Y, m_region.Max.Z)); //This will contain all of our objects which fit within each respective octant. List[] octList = new List[8]; for (int i = 0; i < 8; i++) octList = new List(); //this list contains all of the objects which got moved down the tree and can be delisted from this node. List delist = new List(); foreach (Physical obj in m_objects) { if (obj.BoundingBox.Min != obj.BoundingBox.Max) { for (int a = 0; a < 8; a++) { if (octant[a].Contains(obj.BoundingBox) == ContainmentType.Contains) { octList[a].Add(obj); delist.Add(obj); break; } } } else if (obj.BoundingSphere.Radius != 0) { for (int a = 0; a < 8; a++) { if (octant[a].Contains(obj.BoundingSphere) == ContainmentType.Contains) { octList[a].Add(obj); delist.Add(obj); break; } } } } //delist every moved object from this node. foreach (Physical obj in delist) m_objects.Remove(obj); //Create child nodes where there are items contained in the bounding region for (int a = 0; a < 8; a++) { if (octList[a].Count != 0) { m_childNode[a] = CreateNode(octant[a], octList[a]); m_activeNodes |= (byte)(1 << a); m_childNode[a].BuildTree(); } } m_treeBuilt = true; m_treeReady = true; } private OctTree CreateNode(BoundingBox region, List objList) //complete & tested { if (objList.Count == 0) return null; OctTree ret = new OctTree(region, objList); ret._parent = this; return ret; } private OctTree CreateNode(BoundingBox region, Physical Item) { List objList = new List(1); //sacrifice potential CPU time for a smaller memory footprint objList.Add(Item); OctTree ret = new OctTree(region, objList); ret._parent = this; return ret; }  
      Updating a tree
      Let's imagine that our tree has a lot of moving objects in it. If any object moves, there is a good chance that the object has moved outside of its enclosing octant. How do we handle changes in object position while maintaining the integrity of our tree structure?
      Technique 1: Keep it super simple, trash & rebuild everything.
      Some implementations of an Octree will completely rebuild the entire tree every frame and discard the old one. This is super simple and it works, and if this is all you need, then prefer the simple technique. The general consensus is that the upfront CPU cost of rebuilding the tree every frame is much cheaper than running a brute force collision check, and programmer time is too valuable to be spent on an unnecessary optimization. For those of us who like challenges and to over-engineer things, the "trash & rebuild" technique comes with a few small problems:
      You're constantly allocating and deallocating memory each time you rebuild your tree. Allocating new memory comes at a small cost. If possible, you want to minimize the amount of memory being allocated and reallocated over time by reusing memory you've already got. Most of the tree is unchanging, so it's a waste of CPU time to rebuild the same branches over and over again. Technique 2: Keep the existing tree, update the changed branches
      I noticed that most branches of a tree don't need to be updated. They just contain stationary objects. Wouldn't it be nice if, instead of rebuilding the entire tree every frame, we just updated the parts of the tree which needed an update? This technique keeps the existing tree and updates only the branches which had an object which moved. It's a bit more complex to implement, but it's a lot more fun too, so let's really get into that!
      During my first attempt at this, I mistakenly thought that an object in a child node could only go up or down one traversal of the tree. This is wrong. If an object in a child node reaches the edge of that node, and that edge also happens to be an edge for the enclosing parent node, then that object needs to be inserted above its parent, and possibly up even further. So, the bottom line is that we don't know how far up an object needs to be pushed up the tree. Just as well, an object can move such that it can be neatly enclosed in a child node, or that child's child node. We don't know how far down the tree we can go.
      Fortunately, since we include a reference to each node's parent, we can easily solve this problem recursively with minimal computation! The general idea behind the update algorithm is to first let all objects in the tree update themselves. Some may move or change in size. We want to get a list of every object which moved, so the object update method should return to us a boolean value indicating if its bounding area changed. Once we've got a list of all of our moved objects, we want to start at our current node and try to traverse up the tree until we find a node which completely encloses the moved object (most of the time, the current node still encloses the object). If the object isn't completely enclosed by the current node, we keep moving it up to its next parent node. In the worst case, our root node will be guaranteed to contain the object.
      After we've moved our object as far up the tree as possible, we'll try to move it as far down the tree as we can. Most of the time, if we moved the object up, we won't be able to move it back down. But, if the object moved so that a child node of the current node could contain it, we have the chance to push it back down the tree. It's important to be able to move objects down the tree as well, or else all moving objects would eventually migrate to the top and we'd start getting some performance problems during collision detection routines.
      Branch Removal
      In some cases, an object will move out of a node and that node will no longer have any objects contained within it, nor have any children which contain objects. If this happens, we have an empty branch and we need to mark it as such and prune this dead branch off the tree.
      There is an interesting question hiding here: When do you want to prune the dead branches off a tree? Allocating new memory costs time, so if we're just going to reuse this same region in a few cycles, why not keep it around for a bit? How long can we keep it around before it becomes more expensive to maintain the dead branch? I decided to give each of my nodes a countdown timer which activates when the branch is dead. If an object moves into this nodes octant while the death timer is active, I double the lifespan and reset the death timer. This ensures that octants which are frequently used are hot and stick around, and nodes which are infrequently used are removed before they start to cost more than they're worth.
      A practical example of this usefulness would be apparent when you have a machine gun shooting a stream of bullets. Those bullets follow in close succession of each other, so it'd be a shame to immediately delete a node as soon as the first bullet leaves it, only to recreate it a fraction of a second later as the second bullet re-enters it. And if there's a lot of bullets, we can probably keep these octants around for a little while. If a child branch is empty and hasn't been used in a while, it's safe to prune it out of our tree.
      Anyways, let's look at the code which does all of this magic. First up, we have the Update() method. This is a method which is recursively called on all child trees. It moves all objects around, does some housekeeping work for the data structure, and then moves each moved object into its correct node (parent or child).
      public void Update(coreTime time) { if (m_treeBuilt == true && m_treeReady == true) { //Start a count down death timer for any leaf nodes which don't have objects or children. //when the timer reaches zero, we delete the leaf. If the node is reused before death, we double its lifespan. //this gives us a "frequency" usage score and lets us avoid allocating and deallocating memory unnecessarily if (m_objects.Count == 0) { if (HasChildren == false) { if (m_curLife == -1) m_curLife = m_maxLifespan; else if (m_curLife > 0) { m_curLife--; } } } else { if (m_curLife != -1) { if (m_maxLifespan <= 64) m_maxLifespan *= 2; m_curLife = -1; } } List<Physical> movedObjects = new List<Physical>(m_objects.Count); //go through and update every object in the current tree node foreach (Physical gameObj in m_objects) { //we should figure out if an object actually moved so that we know whether we need to update this node in the tree. if (gameObj.Update(time) == 1) { movedObjects.Add(gameObj); } } //prune any dead objects from the tree. int listSize = m_objects.Count; for (int a = 0; a < listSize; a++) { if (!m_objects[a].Alive) { if (movedObjects.Contains(m_objects[a])) movedObjects.Remove(m_objects[a]); m_objects.RemoveAt(a--); listSize--; } } //prune out any dead branches in the tree for (int flags = m_activeNodes, index = 0; flags > 0; flags >>= 1, index++) if ((flags & 1) == 1 && m_childNode[index].m_curLife == 0) { if (m_childNode[index].m_objects.Count > 0) { //throw new Exception("Tried to delete a used branch!"); m_childNode[index].m_curLife = -1; } else { m_childNode[index] = null; m_activeNodes ^= (byte)(1 << index); //remove the node from the active nodes flag list } } //recursively update any child nodes. for (int flags = m_activeNodes, index = 0; flags > 0; flags >>= 1, index++) { if ((flags & 1) == 1) { if(m_childNode!=null && m_childNode[index] != null) m_childNode[index].Update(time); } } //If an object moved, we can insert it into the parent and that will insert it into the correct tree node. //note that we have to do this last so that we don't accidentally update the same object more than once per frame. foreach (Physical movedObj in movedObjects) { OctTree current = this; //figure out how far up the tree we need to go to reinsert our moved object //we are either using a bounding rect or a bounding sphere //try to move the object into an enclosing parent node until we've got full containment if (movedObj.EnclosingBox.Max != movedObj.EnclosingBox.Min) { while (current.m_region.Contains(movedObj.EnclosingBox) != ContainmentType.Contains) if (current._parent != null) current = current._parent; else { break; //prevent infinite loops when we go out of bounds of the root node region } } else { ContainmentType ct = current.m_region.Contains(movedObj.EnclosingSphere); while (ct != ContainmentType.Contains)//we must be using a bounding sphere, so check for its containment. { if (current._parent != null) { current = current._parent; } else { //the root region cannot contain the object, so we need to completely rebuild the whole tree. //The rarity of this event is rare enough where we can afford to take all objects out of the existing tree and rebuild the entire thing. List<Physical> tmp = m_root.AllObjects(); m_root.UnloadContent(); Enqueue(tmp);//add to pending queue return; } ct = current.m_region.Contains(movedObj.EnclosingSphere); } } //now, remove the object from the current node and insert it into the current containing node. m_objects.Remove(movedObj); current.Insert(movedObj); //this will try to insert the object as deep into the tree as we can go. } //now that all objects have moved and they've been placed into their correct nodes in the octree, we can look for collisions. if (IsRoot == true) { //This will recursively gather up all collisions and create a list of them. //this is simply a matter of comparing all objects in the current root node with all objects in all child nodes. //note: we can assume that every collision will only be between objects which have moved. //note 2: An explosion can be centered on a point but grow in size over time. In this case, you'll have to override the update method for the explosion. List<IntersectionRecord> irList = GetIntersection(new List<Physical>()); foreach (IntersectionRecord ir in irList) { if (ir.PhysicalObject != null) ir.PhysicalObject.HandleIntersection(ir); if (ir.OtherPhysicalObject != null) ir.OtherPhysicalObject.HandleIntersection(ir); } } }//end if tree built else { if (m_pendingInsertion.Count > 0) { ProcessPendingItems(); Update(time); //try this again... } } } Note that we call an Insert() method for moved objects. The insertion of objects into the tree is very similar to the method used to build the initial tree. Insert() will try to push objects as far down the tree as possible. Notice that I also try to avoid creating new bounding areas if I can use an existing one from a child node.
      /// <summary> /// A tree has already been created, so we're going to try to insert an item into the tree without rebuilding the whole thing /// </summary> /// <typeparam name="T">A physical object</typeparam> /// <param name="Item">The physical object to insert into the tree</param> private bool Insert<T>(T Item) where T : Physical { /*if the current node is an empty leaf node, just insert and leave it.*/ //if (m_objects.Count == 0 && m_activeNodes == 0) if(AllTreeObjects.Count == 0) { m_objects.Add(Item); return true; } //Check to see if the dimensions of the box are greater than the minimum dimensions. //If we're at the smallest size, just insert the item here. We can't go any lower! Vector3 dimensions = m_region.Max - m_region.Min; if (dimensions.X <= MIN_SIZE && dimensions.Y <= MIN_SIZE && dimensions.Z <= MIN_SIZE) { m_objects.Add(Item); return true; } //The object won't fit into the current region, so it won't fit into any child regions. //therefore, try to push it up the tree. If we're at the root node, we need to resize the whole tree. if (m_region.Contains(Item.EnclosingSphere) != ContainmentType.Contains) { if (this._parent != null) return this._parent.Insert(Item); else return false; } //At this point, we at least know this region can contain the object but there are child nodes. Let's try to see if the object will fit //within a subregion of this region. Vector3 half = dimensions / 2.0f; Vector3 center = m_region.Min + half; //Find or create subdivided regions for each octant in the current region BoundingBox[] childOctant = new BoundingBox[8]; childOctant[0] = (m_childNode[0] != null) ? m_childNode[0].m_region : new BoundingBox(m_region.Min, center); childOctant[1] = (m_childNode[1] != null) ? m_childNode[1].m_region : new BoundingBox(new Vector3(center.X, m_region.Min.Y, m_region.Min.Z), new Vector3(m_region.Max.X, center.Y, center.Z)); childOctant[2] = (m_childNode[2] != null) ? m_childNode[2].m_region : new BoundingBox(new Vector3(center.X, m_region.Min.Y, center.Z), new Vector3(m_region.Max.X, center.Y, m_region.Max.Z)); childOctant[3] = (m_childNode[3] != null) ? m_childNode[3].m_region : new BoundingBox(new Vector3(m_region.Min.X, m_region.Min.Y, center.Z), new Vector3(center.X, center.Y, m_region.Max.Z)); childOctant[4] = (m_childNode[4] != null) ? m_childNode[4].m_region : new BoundingBox(new Vector3(m_region.Min.X, center.Y, m_region.Min.Z), new Vector3(center.X, m_region.Max.Y, center.Z)); childOctant[5] = (m_childNode[5] != null) ? m_childNode[5].m_region : new BoundingBox(new Vector3(center.X, center.Y, m_region.Min.Z), new Vector3(m_region.Max.X, m_region.Max.Y, center.Z)); childOctant[6] = (m_childNode[6] != null) ? m_childNode[6].m_region : new BoundingBox(center, m_region.Max); childOctant[7] = (m_childNode[7] != null) ? m_childNode[7].m_region : new BoundingBox(new Vector3(m_region.Min.X, center.Y, center.Z), new Vector3(center.X, m_region.Max.Y, m_region.Max.Z)); //First, is the item completely contained within the root bounding box? //note2: I shouldn't actually have to compensate for this. If an object is out of our predefined bounds, then we have a problem/error. // Wrong. Our initial bounding box for the terrain is constricting its height to the highest peak. Flying units will be above that. // Fix: I resized the enclosing box to 256x256x256. This should be sufficient. if (Item.EnclosingBox.Max != Item.EnclosingBox.Min && m_region.Contains(Item.EnclosingBox) == ContainmentType.Contains) { bool found = false; //we will try to place the object into a child node. If we can't fit it in a child node, then we insert it into the current node object list. for(int a=0;a<8;a++) { //is the object fully contained within a quadrant? if (childOctant[a].Contains(Item.EnclosingBox) == ContainmentType.Contains) { if (m_childNode[a] != null) { return m_childNode[a].Insert(Item); //Add the item into that tree and let the child tree figure out what to do with it } else { m_childNode[a] = CreateNode(childOctant[a], Item); //create a new tree node with the item m_activeNodes |= (byte)(1 << a); } found = true; } } //we couldn't fit the item into a smaller box, so we'll have to insert it in this region if (!found) { m_objects.Add(Item); return true; } } else if (Item.EnclosingSphere.Radius != 0 && m_region.Contains(Item.EnclosingSphere) == ContainmentType.Contains) { bool found = false; //we will try to place the object into a child node. If we can't fit it in a child node, then we insert it into the current node object list. for (int a = 0; a < 8; a++) { //is the object contained within a child quadrant? if (childOctant[a].Contains(Item.EnclosingSphere) == ContainmentType.Contains) { if (m_childNode[a] != null) { return m_childNode[a].Insert(Item); //Add the item into that tree and let the child tree figure out what to do with it } else { m_childNode[a] = CreateNode(childOctant[a], Item); //create a new tree node with the item m_activeNodes |= (byte)(1 << a); } found = true; } } //we couldn't fit the item into a smaller box, so we'll have to insert it in this region if (!found) { m_objects.Add(Item); return true; } } //either the item lies outside of the enclosed bounding box or it is intersecting it. Either way, we need to rebuild //the entire tree by enlarging the containing bounding box return false; }  
      Collision Detection
      Finally, our octree has been built and everything is as it should be. How do we perform collision detection against it? First, let's list out the different ways we want to look for collisions:
      Frustum intersections. We may have a frustum which intersects with a region of the world. We only want the objects which intersect with the given frustum. This is particularly useful for culling regions outside of the camera view space, and for figuring out what objects are within a mouse selection area. Ray intersections. We may want to shoot a directional ray from any given point and want to know either the nearest intersecting object or get a list of all objects which intersect that ray (like a rail gun). This is very useful for mouse picking. If the user clicks on the screen, we want to draw a ray into the world and figure out what they clicked on. Bounding Box intersections. We want to know which objects in the world are intersecting a given bounding box. This is most useful for "box" shaped game objects (houses, cars, etc). Bounding Sphere Intersections. We want to know which objects are intersecting with a given bounding sphere. Most objects will probably be using a bounding sphere for coarse collision detection since the mathematics is computationally the least expensive and somewhat easy. The main idea behind recursive collision detection processing for an octree is that you start at the root/current node and test for intersection with all objects in that node against the intersector. Then, you do a bounding box intersection test against all active child nodes with the intersector. If a child node fails this intersection test, you can completely ignore the rest of that child's tree. If a child node passes the intersection test, you recursively traverse down the tree and repeat. Each node should pass a list of intersection records up to its caller, which appends those intersections to its own list of intersections. When the recursion finishes, the original caller will get a list of every intersection for the given intersector. The beauty of this is that it takes very little code to implement and performance is very fast. In a lot of these collisions, we're probably going to be getting a lot of results. We're also going to want to have some way of responding to each collision, depending on what objects are colliding.
      For example, a player hero should pick up a floating bonus item (quad damage!), but a rocket shouldn't explode if it hits said bonus item. I created a new class to contain information about each intersection. This class contains references to the intersecting objects, the point of intersection, the normal at the point of intersection, etc. These intersection records become quite useful when you pass them to an object and tell them to handle it. For completeness and clarity, here is my intersection record class:
      public class IntersectionRecord { readonly Vector3 m_position, m_normal; readonly Ray m_ray; readonly Physical m_intersectedObject1, m_intersectedObject2; readonly double m_distance; public class Builder { public Vector3 Position, Normal; public Physical Object1, Object2; public Ray hitRay; public double Distance; public Builder() { Distance = double.MaxValue; } public Builder(IntersectionRecord copy) { Position = copy.m_position; Normal = copy.m_normal; Object1 = copy.m_intersectedObject1; Object2 = copy.m_intersectedObject2; hitRay = copy.m_ray; Distance = copy.m_distance; } public IntersectionRecord Build() { return new IntersectionRecord(Position, Normal, Object1, Object2, hitRay, Distance); } } #region Constructors IntersectionRecord(Vector3 pos, Vector3 normal, Physical obj1, Physical obj2, Ray r, double dist) { m_position = pos; m_normal = normal; m_intersectedObject1 = obj1; m_intersectedObject2 = obj2; m_ray = r; m_distance = dist; } #endregion #region Accessors /// <summary> /// This is the exact point in 3D space which has an intersection. /// </summary> public Vector3 Position { get { return m_position; } } /// <summary> /// This is the normal of the surface at the point of intersection /// </summary> public Vector3 Normal { get { return m_normal; } } /// <summary> /// This is the ray which caused the intersection /// </summary> public Ray Ray { get { return m_ray; } } /// <summary> /// This is the object which is being intersected /// </summary> public Physical PhysicalObject { get { return m_intersectedObject1; } } /// <summary> /// This is the other object being intersected (may be null, as in the case of a ray-object intersection) /// </summary> public Physical OtherPhysicalObject { get { return m_intersectedObject2; } } /// <summary> /// This is the distance from the ray to the intersection point. /// You'll usually want to use the nearest collision point if you get multiple intersections. /// </summary> public double Distance { get { return m_distance; } } #endregion #region Overrides public override string ToString() { return "Hit: " + m_intersectedObject1.ToString(); } public override int GetHashCode() { return base.GetHashCode(); } /// <summary> /// check the object identities between the two intersection records. If they match in either order, we have a duplicate. /// </summary> /// <param name="otherRecord">the other record to compare against</param> /// <returns>true if the records are an intersection for the same pair of objects, false otherwise.</returns> public override bool Equals(object otherRecord) { IntersectionRecord o = (IntersectionRecord)otherRecord; // //return (m_intersectedObject1 != null && m_intersectedObject2 != null && m_intersectedObject1.ID == m_intersectedObject2.ID); if (otherRecord == null) return false; if (o.m_intersectedObject1.ID == m_intersectedObject1.ID && o.m_intersectedObject2.ID == m_intersectedObject2.ID) return true; if (o.m_intersectedObject1.ID == m_intersectedObject2.ID && o.m_intersectedObject2.ID == m_intersectedObject1.ID) return true; return false; } #endregion }  
      Intersection with a Bounding Frustum
      /// <summary> /// Gives you a list of all intersection records which intersect or are contained within the given frustum area /// </summary> /// <param name="frustum">The containing frustum to check for intersection/containment with</param> /// <returns>A list of intersection records with collisions</returns> private List<IntersectionRecord> GetIntersection(BoundingFrustum frustum, PhysicalType type = PhysicalType.ALL) { if (!m_treeBuilt) return new List<IntersectionRecord>(); if (m_objects.Count == 0 && HasChildren == false) //terminator for any recursion return null; List<IntersectionRecord> ret = new List<IntersectionRecord>(); //test each object in the list for intersection foreach (Physical obj in m_objects) { //skip any objects which don't meet our type criteria if ((int)((int)type & (int)obj.Type) == 0) continue; //test for intersection IntersectionRecord ir = obj.Intersects(frustum); if (ir != null) ret.Add(ir); } //test each object in the list for intersection for (int a = 0; a < 8; a++) { if (m_childNode[a] != null && (frustum.Contains(m_childNode[a].m_region) == ContainmentType.Intersects || frustum.Contains(m_childNode[a].m_region) == ContainmentType.Contains)) { List<IntersectionRecord> hitList = m_childNode[a].GetIntersection(frustum, type); if (hitList != null) ret.AddRange(hitList); } } return ret; } The bounding frustum intersection list can be used to only render objects which are visible to the current camera view. I use a scene database to figure out how to render all objects in the game world. Here is a snippet of code from my rendering function which uses the bounding frustum of the active camera:
      /// /// This renders every active object in the scene database /// /// public int Render() { int triangles = 0; //Renders all visible objects by iterating through the oct tree recursively and testing for intersection //with the current camera view frustum foreach (IntersectionRecord ir in m_octTree.AllIntersections(m_cameras[m_activeCamera].Frustum)) { ir.PhysicalObject.SetDirectionalLight(m_globalLight[0].Direction, m_globalLight[0].Color); ir.PhysicalObject.View = m_cameras[m_activeCamera].View; ir.PhysicalObject.Projection = m_cameras[m_activeCamera].Projection; ir.PhysicalObject.UpdateLOD(m_cameras[m_activeCamera]); triangles += ir.PhysicalObject.Render(m_cameras[m_activeCamera]); } return triangles; }  
      Intersection with a Ray
      /// <summary> /// Gives you a list of intersection records for all objects which intersect with the given ray /// </summary> /// <param name="intersectRay">The ray to intersect objects against</param> /// <returns>A list of all intersections</returns> private List<IntersectionRecord> GetIntersection(Ray intersectRay, PhysicalType type = PhysicalType.ALL) { if (!m_treeBuilt) return new List<IntersectionRecord>(); if (m_objects.Count == 0 && HasChildren == false) //terminator for any recursion return null; List<IntersectionRecord> ret = new List<IntersectionRecord>(); //the ray is intersecting this region, so we have to check for intersection with all of our contained objects and child regions. //test each object in the list for intersection foreach (Physical obj in m_objects) { //skip any objects which don't meet our type criteria if ((int)((int)type & (int)obj.Type) == 0) continue; IntersectionRecord ir = obj.Intersects(intersectRay); if (ir != null) ret.Add(ir); } // test each child octant for intersection for (int a = 0; a < 8; a++) { if (m_childNode[a] != null && m_childNode[a].m_region.Intersects(intersectRay) != null) { m_lineColor = Color.Red; List<IntersectionRecord> hits = m_childNode[a].GetIntersection(intersectRay, type); if (hits != null && hits.Count > 0) { ret.AddRange(hits); } } } return ret; }  
      Intersection with a list of objects
      This is a particularly useful recursive method for determining if a list of objects in the current node intersects with any objects in any child nodes (See: Update() method for usage). It's the method which will be used most frequently, so it's good to get this right and efficient. What we want to do is start at the root node of the tree. We compare all objects in the current node against all other objects in the current node for collision. We gather up any of those collisions as intersection records and insert them into a list. We then pass our list of tested objects down to our child nodes. The child nodes will then test their objects against themselves, then against the objects we passed down to them. The child nodes will capture any collisions in a list, and return that list to its parent. The parent then takes the collision list received from its child nodes and appends it to its own list of collisions, finally returning it to its caller.
      If you count out the number of collision tests in the illustration above, you can see that we conducted 29 hit tests and received 4 hits. This is much better than [11*11 = 121] hit tests.
      private List<IntersectionRecord> GetIntersection(List<Physical> parentObjs, PhysicalType type = PhysicalType.ALL) { List<IntersectionRecord> intersections = new List<IntersectionRecord>(); //assume all parent objects have already been processed for collisions against each other. //check all parent objects against all objects in our local node foreach (Physical pObj in parentObjs) { foreach (Physical lObj in m_objects) { //We let the two objects check for collision against each other. They can figure out how to do the coarse and granular checks. //all we're concerned about is whether or not a collision actually happened. IntersectionRecord ir = pObj.Intersects(lObj); if (ir != null) { //ir.m_treeNode = this; intersections.Add(ir); } } } //now, check all our local objects against all other local objects in the node if (m_objects != null && m_objects.Count > 1) { #region self-congratulation /* * This is a rather brilliant section of code. Normally, you'd just have two foreach loops, like so: * foreach(Physical lObj1 in m_objects) * { * foreach(Physical lObj2 in m_objects) * { * //intersection check code * } * } * * The problem is that this runs in O(N*N) time and that we're checking for collisions with objects which have already been checked. * Imagine you have a set of four items: {1,2,3,4} * You'd first check: {1} vs {1,2,3,4} * Next, you'd check {2} vs {1,2,3,4} * but we already checked {1} vs {2}, so it's a waste to check {2} vs. {1}. What if we could skip this check by removing {1}? * We'd have a total of 4+3+2+1 collision checks, which equates to O(N(N+1)/2) time. If N is 10, we are already doing half as many collision checks as necessary. * Now, we can't just remove an item at the end of the 2nd for loop since that would break the iterator in the first foreach loop, so we'd have to use a * regular for(int i=0;i<size;i++) style loop for the first loop and reduce size each iteration. This works...but look at the for loop: we're allocating memory for * two additional variables: i and size. What if we could figure out some way to eliminate those variables? * So, who says that we have to start from the front of a list? We can start from the back end and still get the same end results. With this in mind, * we can completely get rid of a for loop and use a while loop which has a conditional on the capacity of a temporary list being greater than 0. * since we can poll the list capacity for free, we can use the capacity as an indexer into the list items. Now we don't have to increment an indexer either! * The result is below. */ #endregion List<Physical> tmp = new List<Physical>(m_objects.Count); tmp.AddRange(m_objects); while (tmp.Count > 0) { foreach (Physical lObj2 in tmp) { if (tmp[tmp.Count - 1] == lObj2 || (tmp[tmp.Count - 1].IsStationary && lObj2.IsStationary)) continue; IntersectionRecord ir = tmp[tmp.Count - 1].Intersects(lObj2); if (ir != null) { //ir.m_treeNode = this; intersections.Add(ir); } } //remove this object from the temp list so that we can run in O(N(N+1)/2) time instead of O(N*N) tmp.RemoveAt(tmp.Count-1); } } //now, merge our local objects list with the parent objects list, then pass it down to all children. foreach (Physical lObj in m_objects) if (lObj.IsStationary == false) parentObjs.Add(lObj); //parentObjs.AddRange(m_objects); //each child node will give us a list of intersection records, which we then merge with our own intersection records. for (int flags = m_activeNodes, index = 0; flags > 0; flags >>= 1, index++) { if ((flags & 1) == 1) { if(m_childNode != null && m_childNode[index] != null) intersections.AddRange(m_childNode[index].GetIntersection(parentObjs, type)); } } return intersections; }  
      Screenshot Demos
      This is a view of the game world from a distance showing the outlines for each bounding volume for the octree. This view shows a bunch of successive projectiles moving through the game world with the frequently-used nodes being preserved instead of deleted.  
      Complete Code Sample
      I've attached a complete code sample of the octree class, the intersection record class, and my generic physical object class. I don't guarantee that they're all bug-free since it's all a work in progress and hasn't been rigorously tested yet.
    • By GameDev.net
      Does your code use one of the most popular graphics or compute APIs? Here is a map of Intel® processor series to each graphics generation to add to your dev docs.
      Developer Documents for Intel® Processor Graphics
      Intel® processor graphics provide the graphics, compute, media, and display for many of our processors including the 6th gen Intel® Core™ processors. Does your code use one of the popular graphics or compute APIs? Do you want a deeper understanding of our graphics hardware architecture? In the table, you’ll find the right documents to help you write and tune your software so it runs great on Intel processor graphics.
      If you’re developing compute applications, the compute architecture guides give foundational reading and the OpenCL™ optimization guides show you how to optimize. If your code uses the graphics APIs, read the graphics dev guides or programmers reference manuals.
      Read more
    • By GameDev.net
      Arizona Sunshine* found success in the VR space after following Intel® Guidelines for Immersive VR Experiences. See how they became the fastest-selling non-bundled virtual Reality title to date.

      With a dazzling launch in early 2017 that saw Arizona Sunshine* become the fastest-selling non-bundled virtual reality title to date, and instant recognition as the 2016 “Best Vive Game” according to UploadVR, the zombie-killer game is not just another VR shooter. Combining immersive game play with intriguing multi-player options, this game takes full advantage of VR capabilities to promote playability in both outdoor and underground environments.
      Through its association with Netherlands-based Vertigo Games and nearby indie developer Jaywalkers Interactive, Intel helped add sizzle to Arizona Sunshine by fine-tuning the CPU capabilities to provide end-to-end VR realism. The power of a strong CPU performance becomes apparent with every jaw-dropping zombie horde attack. From the resources available when a player chooses and loads a weapon, to the responsiveness of the surrounding eerie world, the immersive qualities of the VR interface make it easy to forget that it’s just a game.
      Read more
    • By GameDev.net
      Unity* 3.0 provides tools and settings that simplify game creation. This paper analyzes and troubleshoots performance with Unity* on Intel® graphics processors, and provides users with performance considerations for their games.
      Unity provides a number of tools and settings to help make games perform smoothly. For this project, we chose ones we thought could prove to be troublesome and analyzed how they affected game performance on Intel® graphics processors.
      We put ourselves in the shoes of a game developer learning how to use Unity. We wanted to stumble into performance pitfalls and then determine how to work through issues with Unity’s built-in performance mechanisms. One of Unity’s strengths is the ability to create content quickly, but when considering performance, especially on mobile and tablet devices, the developer needs to slow down and plan out how to utilize the built in performance mechanisms. This paper prepares new and existing Unity users with performance considerations when building your levels/games, and offers new ways to build.
      Read more

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!