Jump to content
  • Advertisement
  • 03/15/17 12:25 PM

    GPU Performance for Game Artists

    Graphics and GPU Programming


    Performance is everybody's responsibility, no matter what your role. When it comes to the GPU, 3D programmers have a lot of control over performance; we can optimize shaders, trade image quality for performance, use smarter rendering techniques... we have plenty of tricks up our sleeves. But there's one thing we don't have direct control over, and that's the game's art. We rely on artists to produce assets that not only look good but are also efficient to render. For artists, a little knowledge of what goes on under the hood can make a big impact on a game's framerate. If you're an artist and want to understand why things like draw calls, LODs, and mipmaps are important for performance, read on! To appreciate the impact that your art has on the game's performance, you need to know how a mesh makes its way from your modelling package onto the screen in the game. That means having an understanding of the GPU - the chip that powers your graphics card and makes real-time 3D rendering possible in the first place. Armed with that knowledge, we'll look at some common art-related performance issues, why they're a problem, and what you can do about it. Things are quickly going to get pretty technical, but if anything is unclear I'll be more than happy to answer questions in the comments section. Before we start, I should point out that I am going to deliberately simplify a lot of things for the sake of brevity and clarity. In many cases I'm generalizing, describing only the typical case, or just straight up leaving things out. In particular, for the sake of simplicity the idealized version of the GPU I describe below more closely matches that of the previous (DX9-era) generation. However when it comes to performance, all of the considerations below still apply to the latest PC & console hardware (although not necessarily all mobile GPUs). Once you understand everything described here, it will be much easier to get to grips with the variations and complexities you'll encounter later, if and when you start to dig deeper.

    Part 1: The rendering pipeline from 10,000 feet

    For a mesh to be displayed on the screen, it must pass through the GPU to be processed and rendered. Conceptually, this path is very simple: the mesh is loaded, vertices are grouped together as triangles, the triangles are converted into pixels, each pixel is given a colour, and that's the final image. Let's look a little closer at what happens at each stage. After you export a mesh from your DCC tool of choice (Digital Content Creation - Maya, Max, etc.), the geometry is typically loaded into the game engine in two pieces; a Vertex Buffer (VB) that contains a list of the mesh's vertices and their associated properties (position, UV coordinates, normal, color etc.), and an Index Buffer (IB) that lists which vertices in the VB are connected to form triangles. Along with these geometry buffers, the mesh will also have been assigned a material to determine what it looks like and how it behaves under different lighting conditions. To the GPU this material takes the form of custom-written shaders - programs that determine how the vertices are processed, and what colour the resulting pixels will be. When choosing the material for the mesh, you will have set various material parameters (eg. setting a base color value or picking a texture for various maps like albedo, roughness, normal etc.) - these are passed to the shader programs as inputs. The mesh and material data get processed by various stages of the GPU pipeline in order to produce pixels in the final render target (an image to which the GPU writes). That render target can then be used as a texture in subsequent shader programs and/or displayed on screen as the final image for the frame. For the purposes of this article, here are the important parts of the GPU pipeline from top to bottom:
    • Input Assembly. The GPU reads the vertex and index buffers from memory, determines how the vertices are connected to form triangles, and feeds the rest of the pipeline.
    • Vertex Shading. The vertex shader gets executed once for every vertex in the mesh, running on a single vertex at a time. Its main purpose is to transform the vertex, taking its position and using the current camera and viewport settings to calculate where it will end up on the screen.
    • Rasterization. Once the vertex shader has been run on each vertex of a triangle and the GPU knows where it will appear on screen, the triangle is rasterized - converted into a collection of individual pixels. Per-vertex values - UV coordinates, vertex color, normal, etc. - are interpolated across the triangle's pixels. So if one vertex of a triangle has a black vertex color and another has white, a pixel rasterized in the middle of the two will get the interpolated vertex color grey.
    • Pixel Shading. Each rasterized pixel is then run through the pixel shader (although technically at this stage it's not yet a pixel but 'fragment', which is why you'll see the pixel shader sometimes called a fragment shader). This gives the pixel a color by combining material properties, textures, lights, and other parameters in the programmed way to get a particular look. Since there are so many pixels (a 1080p render target has over two million) and each one needs to be shaded at least once, the pixel shader is usually where the GPU spends a lot of its time.
    • Render Target Output. Finally the pixel is written to the render target - but not before undergoing some tests to make sure it's valid. For example in normal rendering you want closer objects to appear in front of farther objects; the depth test can reject pixels that are further away than the pixel already in the render target. But if the pixel passes all the tests (depth, alpha, stencil etc.), it gets written to the render target in memory.
    There's much more to it, but that's the basic flow: the vertex shader is executed on each vertex in the mesh, each 3-vertex triangle is rasterized into pixels, the pixel shader is executed on each rasterized pixel, and the resulting colors are written to a render target. Under the hood, the shader programs that represent the material are written in a shader programming language such as HLSL. These shaders run on the GPU in much the same way that regular programs run on the CPU - taking in data, running a bunch of simple instructions to change the data, and outputting the result. But while CPU programs are generalized to work on any type of data, shader programs are specifically designed to work on vertices and pixels. These programs are written to give the rendered object the look of the desired material - plastic, metal, velvet, leather, etc. To give you a concrete example, here's a simple pixel shader that does Lambertian lighting (ie. simple diffuse-only, no specular highlights) with a material color and a texture. As shaders go it's one of the most basic, but you don't need to understand it - it just helps to see what shaders can look like in general. float3 MaterialColor; Texture2D MaterialTexture; SamplerState TexSampler; float3 LightDirection; float3 LightColor; float4 MyPixelShader( float2 vUV : TEXCOORD0, float3 vNorm : NORMAL0 ) : SV_Target { float3 vertexNormal = normalize(vNorm); float3 lighting = LightColor * dot( vertexNormal, LightDirection ); float3 material = MaterialColor * MaterialTexture.Sample( TexSampler, vUV ).rgb; float3 color = material * lighting; float alpha = 1; return float4(color, alpha); } A simple pixel shader that does basic lighting. The inputs at the top like MaterialTexture and LightColor are filled in by the CPU, while vUV and vNorm are both vertex properties that were interpolated across the triangle during rasterization. And the generated shader instructions: dp3 r0.x, v1.xyzx, v1.xyzx rsq r0.x, r0.x mul r0.xyz, r0.xxxx, v1.xyzx dp3 r0.x, r0.xyzx, cb0[1].xyzx mul r0.xyz, r0.xxxx, cb0[2].xyzx sample_indexable(texture2d)(float,float,float,float) r1.xyz, v0.xyxx, t0.xyzw, s0 mul r1.xyz, r1.xyzx, cb0[0].xyzx mul o0.xyz, r0.xyzx, r1.xyzx mov o0.w, l(1.000000) ret The shader compiler takes the above program and generates these instructions which are run on the GPU; a longer program produces more instructions which means more work for the GPU to do. As an aside, you might notice how isolated the shader steps are - each shader works on a single vertex or pixel without needing to know anything about the surrounding vertices/pixels. This is intentional and allows the GPU to process huge numbers of independent vertices and pixels in parallel, which is part of what makes GPUs so fast at doing graphics work compared to CPUs. We'll return to the pipeline shortly to see where things might slow down, but first we need to back up a bit and look at how the mesh and material got to the GPU in the first place. This is also where we meet our first performance hurdle - the draw call.

    The CPU and Draw Calls

    The GPU cannot work alone; it relies on the game code running on the machine's main processor - the CPU - to tell it what to render and how. The CPU and GPU are (usually) separate chips, running independently and in parallel. To hit our target frame rate - most commonly 30 frames per second - both the CPU and GPU have to do all the work to produce a single frame within the time allowed (at 30fps that's just 33 milliseconds per frame).
    To achieve this, frames are often pipelined; the CPU will take the whole frame to do its work (process AI, physics, input, animation etc.) and then send instructions to the GPU at the end of the frame so it can get to work on the next frame. This gives each processor a full 33ms to do its work at the expense of introducing a frame's worth of latency (delay). This may be an issue for extremely time-sensitive twitchy games like first person shooters - the Call of Duty series for example runs at 60fps to reduce the latency between player input and rendering - but in general the extra frame is not noticeable to the player. Every 33ms the final render target is copied and displayed on the screen at VSync - the interval during which the monitor looks for a new frame to display. But if the GPU takes longer than 33ms to finish rendering the frame, it will miss this window of opportunity and the monitor won't have any new frame to display. That results in either screen tearing or stuttering and an uneven framerate that we really want to avoid. We also get the same result if the CPU takes too long - it has a knock-on effect since the GPU doesn't get commands quickly enough to do its job in the time allowed. In short, a solid framerate relies on both the CPU and GPU performing well.
    missed-vsync.png Here the CPU takes too long to produce rendering commands for the second frame, so the GPU starts rendering late and thus misses VSync.
    To display a mesh, the CPU issues a draw call which is simply a series of commands that tells the GPU what to draw and how to draw it. As the draw call goes through the GPU pipeline, it uses the various configurable settings specified in the draw call - mostly determined by the mesh's material and its parameters - to decide how the mesh is rendered. These settings, called GPU state, affect all aspects of rendering, and consist of everything the GPU needs to know in order to render an object. Most significantly for us, GPU state includes the current vertex/index buffers, the current vertex/pixel shader programs, and all the shader inputs (eg. MaterialTexture or LightColor in the above shader code example). This means that to change a piece of GPU state (for example changing a texture or switching shaders), a new draw call must be issued. This matters because these draw calls are not free for the CPU. It costs a certain amount of time to set up the desired GPU state changes and then issue the draw call. Beyond whatever work the game engine needs to do for each call, extra error checking and bookkeeping cost is introduced by the graphics driver, an intermediate layer of code written by the GPU vendor (NVIDIA, AMD etc.) that translates the draw call into low-level hardware instructions. Too many draw calls can put too much of a burden on the CPU and cause serious performance problems. Due to this overhead, we generally set an upper limit to the number of draw calls that are acceptable per frame. If this limit is exceeded during gameplay testing, steps must be taken such as reducing the number of objects, reducing draw distance, etc. Console games will typically try to keep draw calls in the 2000-3000 range (eg. on Far Cry Primal we tried to keep it below 2500 per frame). That might sound like a lot, but it also includes any special rendering techniques that might be employed - cascaded shadows for example can easily double the number of draw calls in a frame. As mentioned above, GPU state can only be changed by issuing a new draw call. This means that although you may have created a single mesh in your modelling package, if one half of the mesh uses one texture for the albedo map and the other half uses a different texture, it will be rendered as two separate draw calls. The same goes if the mesh is made up of multiple materials; different shaders need to be set, so multiple draw calls must be issued. In practice, a very common source of state change - and therefore extra draw calls - is switching texture maps. Typically the whole mesh will use the same material (and therefore the same shaders), but different parts of the mesh will use different sets of albedo/normal/roughness maps. With a scene of hundreds or even thousands of objects, using many draw calls for each object will cost a considerable amount of CPU time and so will have a noticeable impact on the framerate of the game. To avoid this, a common solution is to combine all the different texture maps used on a mesh into a single big texture, often called an atlas. The UVs of the mesh are then adjusted to look up the right part of the atlas, and the entire mesh (or even multiple meshes) can be rendered in a single draw call. Care must be taken when constructing the atlas so that adjacent textures don't bleed into each other at lower mips, but these problems are relatively minor compared to the gains that can be had in terms of performance.
    atlas-infiltrator-signs.jpg A texture atlas from Unreal Engine's Infiltrator demo
    Many engines also support instancing, also known as batching or clustering. This is the ability to use a single draw call to render multiple objects that are mostly identical in terms of shaders and state, and only differ in a restricted set of ways (typically their position and rotation in the world). The engine will usually recognize when multiple identical objects can be rendered using instancing, so it's always preferable to use the same object multiple times in a scene when possible, instead of multiple different objects that will need to be rendered with separate draw calls. Another common technique for reducing draw calls is manually merging many different objects that share the same material into a single mesh. This can be effective, but care must be taken to avoid excessive merging which can actually worsen performance by increasing the amount of work for the GPU. Before any draw call gets issued, the engine's visibility system will determine whether or not the object will even appear on screen. If not, it's very cheap to just ignore the object at this early stage and not pay for any draw call or GPU work (also known as visibility culling). This is usually done by checking if the object's bounding volume is visible from the camera's point of view, and that it is not completely blocked from view (occluded) by any other objects. However, when multiple meshes are merged into a single object, their individual bounding volumes must be combined into a single large volume that is big enough to enclose every mesh. This increases the likelihood that the visibility system will be able to see some part of the volume, and so will consider the entire collection visible. That means that it becomes a draw call, and so the vertex shader must be executed on every vertex in the object - even if very few of those vertices actually appear on the screen. This can lead to a lot of GPU time being wasted because the vertices end up not contributing anything to the final image. For these reasons, mesh merging is the most effective when it is done on groups of small objects that are close to each other, as they will probably be on-screen at the same time anyway.
    xcom2-mesh-merging.png A frame from XCOM 2 as captured with RenderDoc. The wireframe (bottom) shows in grey all the extra geometry submitted to the GPU that is outside the view of the in-game camera.
    As an illustrative example take the above capture of XCOM 2, one of my favourite games of the last couple of years. The wireframe shows the entire scene as submitted to the GPU by the engine, with the black area in the middle being the geometry that's actually visible by the game camera. All the surrounding geometry in grey is not visible and will be culled after the vertex shader is executed, which is all wasted GPU time. In particular, note the highlighted red geometry which is a series of bush meshes, combined and rendered in just a few draw calls. Since the visibility system determined that at least some of the bushes are visible on the screen, they are all rendered and so must all have their vertex shader executed before determining which can be culled... which turns out to be most of them. Please note this isn't an indictment of XCOM 2 in particular, I just happened to be playing it while writing this article! Every game has this problem, and it's a constant battle to balance the CPU cost of doing more accurate visibility tests, the GPU cost of culling the invisible geometry, and the CPU cost of having more draw calls. Things are changing when it comes to the cost of draw calls however. As mentioned above, a significant reason for their expense is the overhead of the driver doing translation and error checking. This has long been the case, but the most modern graphics APIs (eg. Direct3D 12 and Vulkan) have been restructured in order to avoid most of this overhead. While this does introduce extra complexity to the game's rendering engine, it can also result in cheaper draw calls, allowing us to render many more objects than before possible. Some engines (most notably the latest version used by Assassin's Creed) have even gone in a radically different direction, using the capabilities of the latest GPUs to drive rendering and effectively doing away with draw calls altogether. The performance impact of having too many draw calls is mostly on the CPU; pretty much all other performance issues related to art assets are on the GPU. We'll now look at what a bottleneck is, where they can happen, and what we can do about them.

    Part 2: Common GPU bottlenecks

    The very first step in optimization is to identify the current bottleneck so you can take steps to reduce or eliminate it. A bottleneck refers to the section of the pipeline that is slowing everything else down. In the above case where too many draw calls are costing too much, the CPU is the bottleneck. Even if we performed other optimizations that made the GPU faster, it wouldn't matter to the framerate because the CPU is still running too slowly to produce a frame in the required amount of time.
    bottlenecks-ideal.png 4 draw calls going through the pipeline, each being the rendering of a full mesh containing many triangles. The stages overlap because as soon as one piece of work is finished it can be immediately passed to the next stage (eg. when three vertices are processed by the vertex shader then the triangle can proceed to be rasterized).
    You can think of the GPU pipeline as an assembly line. As each stage finishes with its data, it forwards the results to the following stage and proceeds with the next piece of work. Ideally every stage is busy working all the time, and the hardware is being utilized fully and efficiently as represented in the above image - the vertex shader is constantly processing vertices, the rasterizer is constantly rasterizing pixels, and so on. But consider what happens if one stage takes much longer than the others:
    What happens here is that an expensive vertex shader can't feed the following stages fast enough, and so becomes the bottleneck. If you had a draw call that behaved like this, making the pixel shader faster is not going to make much of a difference to the time it takes for the entire draw call to be rendered. The only way to make things faster is to reduce the time spent in the vertex shader. How we do that depends on what in the vertex shader stage is actually causing the bottleneck. You should keep in mind that there will almost always be a bottleneck of some kind - if you eliminate one, another will just take its place. The trick is knowing when you can do something about it, and when you have to live with it because that's just what it costs to render what you want to render. When you optimize, you're really trying to get rid of unnecessary bottlenecks. But how do you identify what the bottleneck is?


    Profiling tools are absolutely essential for figuring out where all the GPU's time is being spent, and good ones will point you at exactly what you need to change in order for things to go faster. They do this in a variety of ways - some explicitly show a list of bottlenecks, others let you run 'experiments' to see what happens (eg. "how does my draw time change if all the textures are tiny", which can tell you if you're bound by memory bandwidth or cache usage). Unfortunately this is where things get a bit hand-wavy, because some of the best performance tools available are only available for the consoles and therefore under NDA. If you're developing for Xbox or Playstation, bug your friendly neighbourhood graphics programmer to show you these tools. We love it when artists get involved in performance, and will be happy to answer questions and even host tutorials on how to use the tools effectively.
    unity-gpu-profiler.png Unity's basic built-in GPU profiler
    The PC already has some pretty good (albeit hardware-specific) profiling tools which you can get directly from the GPU vendors, such as NVIDIA's Nsight, AMD's GPU PerfStudio, and Intel's GPA. Then there's RenderDoc which is currently the best tool for graphics debugging on PC, but doesn't have any advanced profiling features. Microsoft is also starting to release its awesome Xbox profiling tool PIX for Windows too, albeit only for D3D12 applications. Assuming they also plan to provide the same bottleneck analysis tools as the Xbox version (tricky with the wide variety of hardware out there), it should be a huge asset to PC developers going forward. These tools can give you more information about the performance of your art than you will ever need. They can also give you a lot of insight into how a frame is put together in your engine, as well as being awesome debugging tools for when things don't look how they should. Being able to use them is important, as artists need to be responsible for the performance of their art. But you shouldn't be expected to figure it all out on your own - any good engine should provide its own custom tools for analyzing performance, ideally providing metrics and guidelines to help determine if your art assets are within budget. If you want to be more involved with performance but feel you don't have the necessary tools, talk to your programming team. Chances are they already exist - and if they don't, they should be created! Now that you know how GPUs work and what a bottleneck is, we can finally get to the good stuff. Let's dig into the most common real-world bottlenecks that can show up in the pipeline, how they happen, and what can be done about them.

    Shader instructions

    Since most of the GPU's work is done with shaders, they're often the source of many bottlenecks of the you'll see. When a bottleneck is identified as shader instructions (sometimes referred to as ALUs from Arithmetic Logic Units, the hardware that actually does the calculations), it's simply a way of saying the vertex or pixel shader is doing a lot of work and the rest of the pipeline is waiting for that work to finish. Often the vertex or pixel shader program itself is just too complex, containing many instructions and taking a long time to execute. Or maybe the vertex shader is reasonable but the mesh you're rendering has too many vertices which adds up to a lot of time spent executing the vertex shader. Or the draw call covers a large area of the screen touching many pixels, and so spends a lot of time in the pixel shader. Unsurprisingly, the best way to optimize a shader instruction bottleneck is to execute less instructions! For pixel shaders that means choosing a simpler material with less features to reduce the number of instructions executed per pixel. For vertex shaders it means simplifying your mesh to reduce the number of vertices that need to be processed, as well as being sure to use LODs (Level Of Detail - simplified versions of your mesh for use when the object is far away and small on the screen). Sometimes however, shader instruction bottlenecks are instead just an indication of problems in some other area. Issues such as too much overdraw, a misbehaving LOD system, and many others can cause the GPU to do a lot more work than necessary. These problems can be either on the engine side or the content side; careful profiling, examination, and experience will help you to figure out what's really going on. One of the most common of these issues - overdraw - is when the same pixel on the screen needs to be shaded multiple times, because it's touched by multiple draw calls. Overdraw is a problem because it decreases the overall time the GPU has to spend on rendering. If every pixel on the screen has to be shaded twice, the GPU can only spend half the amount of time on each pixel and still maintain the same framerate.
    pix-overdraw.png A frame capture from PIX with the corresponding overdraw visualization mode
    Sometimes overdraw is unavoidable, such as when rendering translucent objects like particles or glass-like materials; the background object is visible through the foreground, so both need to be rendered. But for opaque objects, overdraw is completely unnecessary because the pixel shown in the buffer at the end of rendering is the only one that actually needs to be processed. In this case, every overdrawn pixel is just wasted GPU time. Steps are taken by the GPU to reduce overdraw in opaque objects. The early depth test (which happens before the pixel shader - see the initial pipeline diagram) will skip pixel shading if it determines that the pixel will be hidden by another object. It does that by comparing the pixel being shaded to the depth buffer - a render target where the GPU stores the entire frame's depth so that objects occlude each other properly. But for the early depth test to be effective, the other object must have already been rendered so it is present in the depth buffer. That means that the rendering order of objects is very important. Ideally every scene would be rendered front-to-back (ie. objects closest to the camera first), so that only the foreground pixels get shaded and the rest get killed by the early depth test, eliminating overdraw entirely. But in the real world that's not always possible because you can't reorder the triangles inside a draw call during rendering. Complex meshes can occlude themselves multiple times, or mesh merging can result in many overlapping objects being rendered in the "wrong" order causing overdraw. There's no easy answer for avoiding these cases, and in the latter case it's just another thing to take into consideration when deciding whether or not to merge meshes. To help early depth testing, some games do a partial depth prepass. This is a preliminary pass where certain large objects that are known to be effective occluders (large buildings, terrain, the main character etc.) are rendered with a simple shader that only outputs to the depth buffer, which is relatively fast as it avoids doing any pixel shader work such as lighting or texturing. This 'primes' the depth buffer and increases the amount of pixel shader work that can be skipped during the full rendering pass later in the frame. The drawback is that rendering the occluding objects twice (once in the depth-only pass and once in the main pass) increases the number of draw calls, plus there's always a chance that the time it takes to render the depth pass itself is more than the time it saves from increased early depth test efficiency. Only profiling in a variety of cases can determine whether or not it's worth it for any given scene.
    particle-overdraw-p2.png Particle overdraw visualization of an explosion in Prototype 2
    One place where overdraw is a particular concern is particle rendering, given that particles are transparent and often overlap a lot. Artists working on particle effects should always have overdraw in mind when producing effects. A dense cloud effect can be produced by emitting lots of small faint overlapping particles, but that's going to drive up the rendering cost of the effect; a better-performing alternative would be to emit fewer large particles, and instead rely more on the texture and texture animation to convey the density of the effect. The overall result is often more visually effective anyway because offline software like FumeFX and Houdini can usually produce much more interesting effects through texture animation, compared to real-time simulated behaviour of individual particles. The engine can also take steps to avoid doing more GPU work than necessary for particles. Every rendered pixel that ends up completely transparent is just wasted time, so a common optimization is to perform particle trimming: instead of rendering the particle with two triangles, a custom-fitted polygon is generated that minimizes the empty areas of the texture that are used.
    unreal-particle-cutouts.jpg Particle 'cutout' tool in Unreal Engine 4
    The same can be done for other partially transparent objects such as vegetation. In fact for vegetation it's even more important to use custom geometry to eliminate the large amount of empty texture space, as vegetation often uses alpha testing. This is when the alpha channel of the texture is used to decide whether or not to discard the pixel during the pixel shader stage, effectively making it transparent. This is a problem because alpha testing can also have the side effect of disabling the early depth test completely (because it invalidates certain assumptions that the GPU can make about the pixel), leading to much more unnecessary pixel shader work. Combine this with the fact that vegetation often contains a lot of overdraw anyway - think of all the overlapping leaves on a tree - and it can quickly become very expensive to render if you're not careful. A close relative of overdraw is overshading, which is caused by tiny or thin triangles and can really hurt performance by wasting a significant portion of the GPU's time. Overshading is a consequence of how GPUs process pixels during pixel shading: not one at a time, but instead in 'quads' which are blocks of four pixels arranged in a 2x2 pattern. It's done like this so the hardware can do things like comparing UVs between pixels to calculate appropriate mipmap levels. This means that if a triangle only touches a single pixel of a quad (because the triangle is tiny or very thin), the GPU still processes the whole quad and just throws away the other three pixels, wasting 75% of the work. That wasted time can really add up, and is particularly painful for forward (ie. not deferred) renderers that do all lighting and shading in a single pass in the pixel shader. This penalty can be reduced by using properly-tuned LODs; besides saving on vertex shader processing, they can also greatly reduce overshading by having triangles cover more of each quad on average.'
    quad-coverage.png A 10x8 pixel buffer with 5x4 quads. The two triangles have poor quad utilization -- left is too small, right is too thin. The 10 red quads touched by the triangles need to be completely shaded, even though the 12 green pixels are the only ones that are actually needed. Overall, 70% of the GPU's work is wasted.
    (Random trivia: quad overshading is also the reason you'll sometimes see fullscreen post effects use a single large triangle to cover the screen instead of two back-to-back triangles. With two triangles, quads that straddle the shared edge would be wasting some of their work, so avoiding that saves a minor amount of GPU time.) Beyond overshading, tiny triangles are also a problem because GPUs can only process and rasterize triangles at a certain rate, which is usually relatively low compared to how many pixels it can process in the same amount of time. With too many small triangles, it can't produce pixels fast enough to keep the shader units busy, resulting in stalls and idle time - the real enemy of GPU performance. Similarly, long thin triangles are bad for performance for another reason beyond quad usage: GPUs rasterize pixels in square or rectangular blocks, not in long strips. Compared to a more regular-shaped triangle with even sides, a long thin triangle ends up making the GPU do a lot of extra unnecessary work to rasterize it into pixels, potentially causing a bottleneck at the rasterization stage. This is why it's usually recommended that meshes are tessellated into evenly-shaped triangles, even if it increases the polygon count a bit. As with everything else, experimentation and profiling will show the best balance.

    Memory Bandwidth and Textures

    As illustrated in the above diagram of the GPU pipeline, meshes and textures are stored in memory that is physically separate from the GPU's shader processors. That means that whenever the GPU needs to access some piece of data, like a texture being fetched by a pixel shader, it needs to retrieve it from memory before it can actually use it as part of its calculations. Memory accesses are analogous to downloading files from the internet. File downloads take a certain amount of time due to the internet connection's bandwidth - the speed at which data can be transferred. That bandwidth is also shared between all downloads - if you can download one file at 6MB/s, two files only download at 3MB/s each. The same is true of memory accesses; index/vertex buffers and textures being accessed by the GPU take time, and must share memory bandwidth. The speeds are obviously much higher than internet connections - on paper the PS4's GPU memory bandwidth is 176GB/s - but the idea is the same. A shader that accesses many textures will rely heavily on having enough bandwidth to transfer all the data it needs in the time it needs it. Shaders programs are executed by the GPU with these restrictions in mind. A shader that needs to access a texture will try to start the transfer as early as possible, then do other unrelated work (for example lighting calculations) and hope that the texture data has arrived from memory by the time it gets to the part of the program that needs it. If the data hasn't arrived in time - because the transfer is slowed down by lots of other transfers, or because it runs out of other work to do (especially likely for dependent texture fetches) - execution will stop and it will just sit there and wait. This is a memory bandwidth bottleneck; making the rest of the shader faster will not matter if it still needs to stop and wait for data to arrive from memory. The only way to optimize this is to reduce the amount of bandwidth being used, or the amount of data being transferred, or both. Memory bandwidth might even have to be shared with the CPU or async compute work that the GPU is doing at the same time. It's a very precious resource. The majority of memory bandwidth is usually taken up by texture transfers, since textures contain so much data. As a result, there are a few different mechanisms in place to reduce the amount of texture data that needs to be shuffled around. Memory bandwidth might even have to be shared with the CPU or async compute work that the GPU is doing at the same time. It's a very precious resource. The majority of memory bandwidth is usually taken up by texture transfers, since textures contain so much data. As a result, there are a few different mechanisms in place to reduce the amount of texture data that needs to be shuffled around. First and foremost is a cache. This is a small piece of high-speed memory that the GPU has very fast access to, and is used to keep chunks of memory that have been accessed recently in case the GPU needs them again. In the internet connection analogy, the cache is your computer's hard drive that stores the downloaded files for faster access in the future. When a piece of memory is accessed, like a single texel in a texture, the surrounding texels are also pulled into the cache in the same memory transfer. The next time the GPU looks for one of those texels, it doesn't need to go all the way to memory and can instead fetch it from the cache extremely quickly. This is actually often the common case - when a texel is displayed on the screen in one pixel, it's very likely that the pixel beside it will need to show the same texel, or the texel right beside it in the texture. When that happens, nothing needs to be transferred from memory, no bandwidth is used, and the GPU can access the cached data almost instantly. Caches are therefore vitally important for avoiding memory-related bottlenecks. Especially when you take filtering into account - bilinear, trilinear, and anisotropic filtering all require multiple texels to be accessed for each lookup, putting an extra burden on bandwidth usage. High-quality anisotropic filtering is particularly bandwidth-intensive. Now think about what happens in the cache if you try to display a large texture (eg. 2048x2048) on an object that's very far away and only takes up a few pixels on the screen. Each pixel will need to fetch from a very different part of the texture, and the cache will be completely ineffective since it only keeps texels that were close to previous accesses. Every texture access will try to find its result in the cache and fail (called a 'cache miss') and so the data must be fetched from memory, incurring the dual costs of bandwidth usage and the time it takes for the data to be transferred. A stall may occur, slowing the whole shader down. It will also cause other (potentially useful) data to be 'evicted' from the cache in order to make room for the surrounding texels that will never even be used, reducing the overall efficiency of the cache. It's bad news all around, and that's not to even mention the visual quality issues - tiny movements of the camera will cause completely different texels to be sampled, causing aliasing and sparkling. This is where mipmapping comes to the rescue. When a texture fetch is issued, the GPU can analyze the texture coordinates being used at each pixel, determining when there is a large gap between texture accesses. Instead of incurring the costs of a cache miss for every texel, it instead accesses a lower mip of the texture that matches the resolution it's looking for. This greatly increases the effectiveness of the cache, reducing memory bandwidth usage and the potential for a bandwidth-related bottleneck. Lower mips are also smaller and need less data to be transferred from memory, further reducing bandwidth usage. And finally, since mips are pre-filtered, their use also vastly reduces aliasing and sparkling. For all of these reasons, it's almost always a good idea to use mipmaps - the advantages are definitely worth the extra memory usage.
    minimized-texture.png A texture on two quads, one close to the camera and one much further away
    mipmap-example.jpg The same texture with a corresponding mipmap chain, each mip being half the size of the previous one
    Lastly, texture compression is an important way of reducing bandwidth and cache usage (in addition to the obvious memory savings from storing less texture data). Using BC (Block Compression, previously known as DXT compression), textures can be reduced to a quarter or even a sixth of their original size in exchange for a minor hit in quality. This is a significant reduction in the amount of data that needs to be transferred and processed, and most GPUs even keep the textures compressed in the cache, leaving more room to store other texture data and increasing overall cache efficiency. All of the above information should lead to some obvious steps for reducing or eliminating bandwidth bottlenecks when it comes to texture optimization on the art side. Make sure the textures have mips and are compressed. Don't use heavy 8x or 16x anisotropic filtering if 2x is enough, or even trilinear or bilinear if possible. Reduce texture resolution, particularly if the top-level mip is often displayed. Don't use material features that cause texture accesses unless the feature is really needed. And make sure all the data being fetched is actually used - don't sample four RGBA textures when you actually only need the data in the red channels of each; merge those four channels into a single texture and you've removed 75% of the bandwidth usage. While textures are the primary users of memory bandwidth, they're by no means the only ones. Mesh data (vertex and index buffers) also need to be loaded from memory. You'll also notice in first GPU pipeline diagram that the final render target output is a write to memory. All these transfers usually share the same memory bandwidth. In normal rendering these costs typically aren't noticeable as the amount of data is relatively small compared to the texture data, but this isn't always the case. Compared to regular draw calls, shadow passes behave quite differently and are much more likely to be bandwidth bound.
    gta-v-shadows.png A frame from GTA V with shadow maps, courtesy of Adrian Courr?ges' great frame analysis
    This is because shadow maps are simply depth buffer that represent the distance from the light to the closest mesh, so most of the work that needs to be done for shadow rendering consists of transferring data to and from memory: fetch the vertex/index buffers, do some simple calculations to determine position, and then write the depth of the mesh to the shadow map. Most of the time, a pixel shader isn't even executed because all the necessary depth information comes from just the vertex data. This leaves very little work to hide the overhead of all the memory transfers, and the likely bottleneck is that the shader just ends up waiting for memory transfers to complete. As a result, shadow passes are particularly sensitive to both vertex/triangle counts and shadow map resolution, as they directly affect the amount of bandwidth that is needed. The last thing worth mentioning with regards to memory bandwidth is a special case - the Xbox. Both the Xbox 360 and Xbox One have a particular piece of memory embedded close to the GPU, called EDRAM on 360 and ESRAM on XB1. It's a relatively small amount of memory (10MB on 360 and 32MB on XB1), but big enough to store a few render targets and maybe some frequently-used textures, and with a much higher bandwidth than regular system memory (aka DRAM). Just as important as the speed is the fact that this bandwidth uses a dedicated path, so doesn't have to be shared with DRAM transfers. It adds complexity to the engine, but when used efficiently it can give some extra headroom in bandwidth-limited situations. As an artist you generally won't have control over what goes into EDRAM/ESRAM, but it's worth knowing of its existence when it comes to profiling. The 3D programming team can give you more details on its use in your particular engine.

    And there's more...

    As you've probably gathered by now, GPUs are complex pieces of hardware. When fed properly, they are capable of processing an enormous amount of data and performing billions of calculations every second. On the other hand, bad data and poor usage can slow them down to a crawl, having a devastating effect on the game's framerate. There are many more things that could be discussed or expanded upon, but what's above is a good place to start for any technically-minded artist. Having an understanding of how the GPU works can help you produce art that not only looks great but also performs well... and better performance can let you improve your art even more, making the game look better too. There's a lot to take in here, but remember that your 3D programming team is always happy to sit down with you and discuss anything that needs more explanation - as am I in the comments section below!

    Further Technical Reading

    Note: This article was originally published on fragmentbuffer.com, and is republished here with kind permission from the author Keith O'Conor. You can read more of Keith's writing on Twitter (@keithoconor).

      Report Article

    User Feedback

    Good article. I think it would be worth mentioning Unity has Frame Debugger tool better than the mentioned general-purpose Profiler tool for GPU-related debugging.

    Share this comment

    Link to comment
    Share on other sites

    Create an account or sign in to comment

    You need to be a member in order to leave a comment

    Create an account

    Sign up for a new account in our community. It's easy!

    Register a new account

    Sign in

    Already have an account? Sign in here.

    Sign In Now

  • Advertisement
  • Game Developer Survey


    We are looking for qualified game developers to participate in a 10-minute online survey. Qualified participants will be offered a $15 incentive for your time and insights. Click here to start!

    Take me to the survey!

  • Advertisement
  • Latest Featured Articles

  • Featured Blogs

  • Advertisement
  • Popular Now

  • Similar Content

    • By gustavo rincones
      Hi, this is my first forum and I want to do it: quick way to calculate the square root in c ++ with floating data types. These types of functions are very useful to gain some CPU time, especially when used continuously. I will show you 3 similar functions and indicate the advantages and disadvantages of each of them. The last of these three functions was written by me. If you notice that the writing is a bit out of grammar, it is because I do not speak English and I am using a support tool. My native language is Spanish. Well, let's start:

      The First method is very famous was used in the video game quake III arena and you can find a reference in Wikipedia as: :https://en.wikipedia.org/wiki/Fast_inverse_square_root.
      The Function was optimized for improvements in computing times.
      float sqrt1(const float &n) { static union{int i; float f;} u; u.i = 0x5F375A86 - (*(int*)&n >> 1); return (int(3) - n * u.f * u.f) * n * u.f * 0.5f; }  
      * When Root of 0 is calculated the function returns 0.
      * The convergence of the function is acceptable enough for games.
      * It generates very good times.
      * The Reciprocal of the root can be calculated by removing the second “n” from the third line. According to the property of: 1 / sqrt (n) * n = sqrt (n).
      * Convergence decreases when the root to be calculated is very large.
      The second method is not as famous as the first. But it does the same function calculate the root.
      float sqrt2(const float& n) { union {int i; float f;} u; u.i = 0x1FB5AD00 + (*(int*)&n >> 1); u.f = n / u.f + u.f; return n / u.f + u.f * 0.25f; }  
      * The convergence of the function is high enough to be used in applications other than games.
      * Computing times are much larger.
      * The square root of “0” is a number very close to “0” but never “0”.
      * The division operation is the bottleneck in this function. because the division operation is more expensive than the other arithmetic operations of Arithmetic Logic Units (ALU).
      The third method takes certain characteristics of the two previous methods.
      float sqrt3(const float& n) { static union {int i; float f;} u; u.i = 0x2035AD0C + (*(int*)&n >> 1); return n / u.f + u.f * 0.25f; }  
      * The convergence of the function is greater than that of the first method.
      * Generates times equal to or greater than the first method.
      * The square root of “0” is a number very close to “0” but never “0”.
      The 3 previous methods have something in common.
      They are based on the definition of the Newton-Raphson Method. according to the function of the square root > f (x) = x ^ 2 - s.
      well thanks to you for reading my forum.
      well thanks to you for reading my forum.
    • By bzt
      I was looking for a way to effectively interpolate between skeletons represented by list of bones with position vector + orientation quaternion pairs. To my surprise, I couldn't find any good solution, not to my taste that is. I can't use NLERP because I need a general solution, and NLERP is only good for small distance interpolations. SLERP would be perfect, but I couldn't find a decent implementation, only slow ones, so I spent a lot of time looking at game engines and quaternion libraries and reading academic articles on the topic. Finally, I decided to collect my findings and write my own optimized version of SLERP, which focuses on performance over precision, and which I now would like to share with you.
      Optimized cross-platform SLERP
      I've started from the well known formula, as implemented by most C++ libraries, and used all the tricks I could find on the net. Then I took it to the next level, and made some more adjustments on my own speeding up a little bit more. Finally I ended up with two solutions:
      1. one that is cross-platfrom, ANSI C solution that is easily embeddable in C++ and compatible with any C / C++ quaternion class (provided their data can be seen as 4 floats), very simple, very minimal. Use it freely if you want, MIT licensed
      2. a SIMD version, which I had to leave half-ready, because my compiler and Intel disagree on what intrinsics should be provided for SSE, moreover the compiler does not work the way its documentation say it should :-( Shit happens. I would only recommend this for experimental purposes, also MIT licensed
      Both versions are simpler and much faster than any implementations I could find, designed to be called in a loop several times. The only dependency they have is lib math (acosf, sinf). The SIMD version is half-ready, but even in this form it's almost 1.5 as fast as the other one, especially if you call it in a loop with memory prefetch on the quaternions.
      If I made any mistake, miscalculated or mistyped something, let me know. If anybody has an idea how to overcome that intrinsics blockade I have, that would be appreciated very much, because I can clearly see the path how to make the code several times faster, only if I could get rid of the library calls. I also plan to create an ARM NEON port once I have that.
    • By Ruben Torres
      [This post was originally posted with its original formatting at The Gamedev Guru's Blog]
      If you've been following me, you will probably know my interest in Unity Addressables. That is for a reason.

      Unity Addressables is a powerful Unity package that upgrades the way you and I have been tackling some of the most important challenges in Game Development: efficient, pain-free content management.
      When managing your game assets, it's hard to keep good standards that prevent our project from becoming a disgusting pile of mess. A big issue there is the coupling between the different responsibilities of our asset management systems.
      The way we store the assets in our project has way too much to do with the method we load them, and later use them.
      For instance, you may decide to store an innocent sprite in the Resources folder. This, in turn, will force Unity to build the player in a way that that sprite is put into special archive files. And the fact that it was put there, will corner you into loading it through the Resources API.
      Things get messy quicker than you can realize!
      One choice, multiple long-term consequences.
      A good system will prevent you and me from easily making sloppy mistakes like that. A great system will be that, and also easy to learn and use.
      With Unity Addressables, we separate the asset management concerns. Our goal is to remain flexible and to keep our project maintainable.
      Here are 3 proven ways Unity Addressables will help you and your games:

      1. Reduce Your Game's Memory Pressure
      When you publish your game, you'll be required on most platforms to specify the minimum hardware specifications your players must meet to buy and play your game.
      The math is easy here: the more hardware power you demand, the fewer will buy your game. Or, seen from another perspective, the better memory management you do, the higher the amount of content and fun you can offer in your game.
      Unity Addressables helps you in this regard enormously!
      To give you a brief idea, converting this kind of code:
      public class CharacterCustomization : MonoBehaviour { [SerializeField] private List<Material> _armorVariations; [SerializeField] private MeshRenderer _armorRenderer; public void ChangeArmorVariation(int variationId) { _armorRenderer.material = _armorVariations[variationId]; } } Into this other one:
      using UnityEngine.AddressableAssets; public class CharacterCustomizationV2 : MonoBehaviour { [SerializeField] private List<AssetReference> _armorVariations; [SerializeField] private MeshRenderer _armorRenderer; public IEnumerator ChangeArmorVariation(int variationId) { var loadProcess = _armorVariations[variationId].LoadAssetAsync(); yield return loadProcess; _armorRenderer.material = loadProcess.Result; } } Will bring you these results:

      Easy gains I'd say.
      -> Read more on Unity Addressables for better memory management in Unity Addressables: It's Never Too Big to Fit (opens in a new tab)

      2. Sell Your Next DLC - Quick and Easy
      The fact that Unity Addessables gives you full control over how, when and where to store and load your game assets is incredibly useful for implementing and selling Downloadable Content.
      Even if you are not thinking of releasing DLCs any time soon, just by using Unity Addressables in your project, you will have done already a big chunk of the work ahead.
      Other approaches for selling DLCs, such as Asset Bundles, are a very deprecated way of doing the same thing but at a much higher cost. Maintaining a well-functioning Asset Bundle pipeline is painfully time-consuming and requires a high degree of expensive expertise.
      There are many ways you can approach implementing DLCs in Unity, but for starters, this is a good starting point:
      public class DlcManager : MonoBehaviour { // ... public IEnumerator TryDownloadDlc() { if (_hasBoughtDlc && _dlcDownloaded == false) { var operationHandle = Addressables.DownloadDependenciesAsync("DLC-Content"); while (operationHandle.IsDone == false) { _progressText.text = $"{operationHandle.PercentComplete * 100.0f} %"; yield return null; } } } } You get the idea.
      Why would you say no to selling more entertainment for your players at a fraction of the cost?

      3. Reduce Your Iteration Times
      Using Unity Addressables will reduce the time wasted waiting in several areas.
      Tell me, how frustrating is it to be blocked for half a minute after pressing the Unity play button? And it only gets worse if you deploy your build on another platform, such as mobile or WebGL. This all starts adding minutes and minutes to your iteration times. It gets old so quickly.
      I don't like waiting either.
      But do you know what I like? Unity Addressables, my long-awaited hero. This is how Addressables will help you:
      A) Reduced Build Size
      Your game has a lot of content, I get it. Gamers love enjoying content. Developers love creating content.
      That doesn't mean, however, that every single asset you produced has to be included in the build your players will install. In fact, you should remove as much as possible.
      Players want to start playing ASAP. And they're not happy when your game steals 2GB of their data plan and 30 minutes of their gaming time. They'll just keep downloading Candy Crush kind of games that install well under 50MB.
      One strategy is to include only the assets needed to run your game up to the main menu. Then, you can progressively download the rest of your content in the background, starting, of course, downloading the first level of your game.
      It's also neat to realize that your deployment times during development will be much faster. You'll be able to iterate more times each day; this benefit quickly adds up in the long term.

      Unity Addressables - Reduced Build Sizes
      B) Reduced Load Times
      We, both as game developers and as players, hate waiting. Waiting takes us out of the zone and before you realize it, it is time to go to bed.
      Unity is working hard towards reducing the time it takes us to start playing our games, both in the Unity Editor and in the games we distribute.
      But not hard enough.
      Things look promising in the future, but not without side effects. Avoiding domain reloads in Unity 2019.3 looks promising, but as of today that's still in beta and not everyone can profit from it.
      In the mean-time, we can do better than just being frustrated.
      Let's say you're working on a medieval game. Several months ago, you implemented armor types for your game. You did a pretty damn good job and generated over 100MB of content.
      At some point, it was time to move on and right now you're working on something else, let's say sword fighting.
      Realize that, every time you press the play button to work on your features, you are loading an insane amount of data coming from all the already developed features, and loading this data takes a massive amount of time. You press play to test your sword fighting animations, and you spend 5 seconds waiting due to loading the armor features you implemented.
      The time wasted in loading is mostly spent on I/O (Input/Output), because memory bandwidth is expensive. And, on top of that, your CPU has to process it. You, as a developer, pay this time penalty while developing in the Unity Editor. But your players pay it as well in the games you are distributing.
      Knowing how big of a deal this can be, let's ask ourselves: which shortcuts can we take here?
      It turns out that Unity Addressables can help us here in two ways.
      1. Unity Addressables will reduce your Players' Loading Times
      We can alleviate some of our players' pain.
      Keeping indirect references to our assets instead of direct references will drastically improve your loading times.
      By using indirect references (AssetReference), Unity will not load everything at once but only what you tell it to. And more importantly, you have direct control over when that happens.
      2. Unity Addressables will reduce your Unity Editor Iteration Times
      How much do you know about the play mode script in the Unity Addressables Window? The play mode script defines how the Unity Editor should load the content marked as Addressable.
      With Packed Play Mode selected, Unity will directly load your pre-built addressable assets with little to no processing overhead, effectively reducing your Unity Editor iteration times
      Just do not forget to build the player content for this to work

      Unity Addressables - Build Player Content
      What if you applied these strategies to your most demanding content?
      4. Extra: Are You There Yet?
      It is true. Unity Addressables is very helpful. But this package will help only those who want to be helped.
      After reading how Addressables will help you producing better and selling more to your players, you probably want to start with it right away. However, starting in this new unknown area may be challenging.
      To make the most of your chance, answer these questions first:
      Where are you standing right now? Are you just starting, or are your skills production-ready? When to use indirect references, when to use direct references? What's the bigger picture? What is your next logical step? → Take this short quiz now to test your answers ← (opens in a new tab)

    • By Ruben Torres
      [This article was originally posted in The Gamedev Guru's Blog]
      You are leading a team of a handful of programmers and artists to port a good-looking PS4 VR game to Oculus Quest. You have six months to complete it. Well? What's your first move? Let's bring Unity Addressables to the table.
      You do realize you have to tackle quite many challenging tasks at once. I bet some might be personally more concerning than others, depending on your expertise level in each area. If you had to pick one that robbed your sacred sleep the most, which one would it be?
      My initial guess would be this: above 70% of the readers would say CPU/GPU performance is the biggest concern when porting a title to Quest. And to that I say: you can very well be right. Performance is one of the toughest areas to improve on in a VR game. For optimizations of this kind, you require in-depth knowledge about the product, which is a time intensive process. Sometimes you even cannot just optimize further, which usually ends in dropping expensive gameplay or graphics features. And not meeting people's expectations is a dangerous place to be.
      Performance, performance, performance.. The word might let some chills go down your spine. What can you expect in this regard from the Quest platform? How does it perform? The thing is, if you have had some development experience in it, you will probably know that in spite of being mobile hardware, it is astonishingly powerful.
      But Ruben! Don't give me that crap. I tell you my phone slows down significantly the time I open a second browser tab. How dare you say mobile platforms can be so performant?
      I read your mind, didn't I? The massive difference lies on Quest's active cooling system, which gives it a huge boost on its attainable CPU/GPU hardware levels that no other mobile platform can offer. It is a powerful fan that will prevent your hair from gathering dust and avoid melting the CPU together with your face (the GoT crown scene comes to mind).
      Additionally and on the side of the Quest software, the more specialized OS is better optimized for virtual reality rendering (surprise) than the generic Android variant. Mobile hardware has been catching up with standalone platforms so quickly in the last few years.
      But, at the same time, I cannot deny that our task of constantly rendering at 72 fps will prove to be challenging, especially for VR ports coming from high-end platforms. To be more precise, when we talk about the Oculus Quest, you have to picture yourself a Snapdragon 835 with a screen, a battery, four cameras and a fan.
      What could look like a disadvantage can actually be thought of as an edge. This mobile platform is a well researched piece of s*** hardware. One can say there are a myriad of known tricks you can pull off to quickly reduce the CPU and GPU load up to an acceptable point. If it is of your interest, you will be able to read about it in upcoming posts. As of now, we will take performance out of the equation for this post.
      What might catch your attention in our challenge is that, compared to the PS4, there is a single hardware characteristic literally halved on the Quest: the RAM capacity. That's right, we go from 8 to 4GB RAM. This is an approximation since, in both platforms, the operating system does not allow you to use it all so it can keep track of a few subsystems for the ecosystem to work. On the Quest you will be able to roughly use up to 2.2 GB of RAM before things get messy.
      Ruben, what do you exactly mean by messy? The thing is, proper memory handling is crucial for your game. This is so because you have two constraints:
      Hard memory constraint: if you go above a certain threshold, the OS will just kill your game Soft memory constraint: above another certain limit, the OS might decide to kill your game when your user minimizes your game, takes the headset off or you go out of the Oculus Guardian area Obviously, you do not want any of the two to happen in your game. Can you picture an angry player who just lost his last two hours of gameplay? Yes, they will go to your store and nothing pretty will come out of their mouth.

      Disgruntled Player - RUN!
      The thing is, the guaranteed availability of 2.2GB of RAM is not much, honestly. It's usually not a problem for new projects where you track your stats constantly from the beginning, but it definitely is an issue for a port to a severely downgraded hardware.
      If you dealt with similar ports in the past, you will quickly realize how extremely challenging it can become to decrease your game's RAM budget by half. It grandly depends on how well the game architecture was prepared for such a change, but in most cases this will temporally transform you into a tear-producing machine.
      The most popular strategies to reduce memory pressure include tweaking asset compression settings, optimizing scripts, reducing shader variants, etc.. Depending on the specifics of your project, tweaking texture import settings is your first go-to solution, but if you need to you can also resort to compressing meshes, animations and audio. The problem is that those techniques are usually complex in nature and will have a ceiling limit.
      Not all platforms support the same import settings; targeting several devices will dramatically increase your build pipeline overhead, not to mention QA, art, design and programming complexity. Does this Android device support ASTC, or just ETC2 (if at all)? Oh, we also want to build 64 bit builds, but also keep the players on the 32 bit versions. How many split-APKs should we create, and worse, manage and test, for each update we do on the game? You want to make your life easier, so you should not rely exclusively on these techniques.
      We would like to go further. As usual, we want to keep things as simple as possible (TM), especially if you are doing a port. Redesigning the entire game for performance's sake is a worse option than just not porting it. For that matter and in today’s topic, I will offer you one of the biggest gains for the buck: I will show you how to halve a project's memory budget in matter of hours. Boom!
      Wouldn't that be awesome?
      Go ahead, go ahead... ask me: is it really possible for me? The answer is: it depends on your starting point, but in my experience, YES. Unity Addressables can be of a great value here. The catch? You must be willing to put in the work and to master the process. But this workflow will earn you the title of employee of the month. 
      If you are interested, keep reading.
      In this post you and I will go through the journey of moving from a traditional asset management to an addressables-based asset management system. To illustrate the process,  we will port a simplified old-school project to the new era of Unity Addressables.
      Now, you might ask me: why don't you just show me the real-life work you did?
      In a non-competitive world I would just show you all the material I created. In the real world though, that is likely to get me canned. And worse, busted.
      What I will do instead is to offer you my guidance so you and I iterate on project that absolutely represents the difficulties you could find tomorrow in your next project. And we will do so by first welcoming Unity's Addressables to our family of suggested packages.
      In this blog post I will get you started in Addressables ASAP so you can implement your own Unity Addressables system in a matter of minutes.

      Unity Addressables: Why?
      Time to pay attention to this very important section. Our goal is to diagnose easy memory gains and implement them fast. There are several ways to do this, but one of the most powerful yet simplest methods to pick the low-hanging fruits by loading the initial scene and opening up the profiler. Why this?
      Because an unoptimized game architecture can frequently be diagnosed at any point during gameplay, so the quickest way to check this is by profiling the initial scenes. The reason for this is the common over-usage of singleton-like scripts containing references to every single asset in the project just in case you need it.
      In order words, in many games there is usually an almighty script causing an asset reference hell. This component keeps each asset loaded at all times independently from whether it's used or not at that time point.
      How bad is this?
      It depends. If your game is likely to be constrained by memory capacity, it is a very risky solution, as your game will not scale with the amount of assets you add (e.g. think of future DLCs). If you target heterogeneous devices, such as Android, you don't have a single memory budget; each device will offer a different one, so you settle for the worst case. The OS can decide to kill your app at any point if our user decides to answer a quick message in Facebook. Then the user comes back and surprise, all their game is gone for good.
      How fun is that?
      Zero. Unless you are overworked and sleep deprived, situation which might grant you a desperation laugh.
      To complicate matters further, if later on you decide (or someone decides for you) to port your game to another less powerful platform while keeping cross-play working, good luck. You don't want to get caught in the middle of this technical challenge.
      On the other side, is there a situation where this traditional asset management solution works just fine? The answer is yes. If you are developing it for a homogeneous platform such as PS4 and most requirements are set from the beginning, the benefits of global objects can potentially outweigh the extra added complexity of a better memory management system.
      Because let's face it: a plain, good old global object containing everything you need is a simple solution, if it works well enough for you. It will simplify your code and will also preload all your referenced assets.
      In any case, the traditional memory management approach is not acceptable for developers seeking to push the boundaries of the hardware. And as you are reading this, I take you want to level up your skills. So it is time for just doing that.
      Greet Unity Addressables.
      Requirements for our Unity Addressables project
      If you are planning on just reading this blog entry, your screen will suffice. Otherwise, if you want to do this along with me, you will need:
      Functional hands Overclocked brain Unity 2019.2.0f1 or similar The level 1 project from GitHub (download zip or through command line) Willingness to get your hands dirty with Unity Addressables The git repository will contain three commits, one per skill level-up section in the blog (unless I messed up at some point, case in which I will commit a fix).
      Download the project in ZIP format directly from GitHub

      Level 1 developer: Traditional asset management
      We are starting here with the simplest asset management method here. In our case, this entails having a list of direct references to skybox materials in our component.
      If you are doing this with me, the setup is a simple 3 step process:
      Download the project from git Open the project in Unity Hit the damn play button! Good, good. You can click a few buttons and change the skybox. This is so original... and boring. I take it, no Unity Addressables yet.
      In a short moment you and I will see why we need to endure this short-lasting boredom.
      To start with, how is our project structured? It pivots around two main systems. On the one side, we have our Manager game object. This component is the main script holding the references to the skybox materials and switches between them based on UI events. Easy enough.
      using UnityEngine; public class Manager : MonoBehaviour { [SerializeField] private Material[] _skyboxMaterials; public void SetSkybox(int skyboxIndex) { RenderSettings.skybox = _skyboxMaterials[skyboxIndex]; } } The Manager offers the UI system a function to apply a specific material to the scene settings through the usage of the RenderSettings API.
      Secondly, we have our CanvasSkyboxSelector. This game object contains a canvas component, rendering a a collection of vertically-distributed buttons. Each button, when clicked, will invoke the aforementioned Manager function to swap the rendered skybox based on the button id. Put in another way, each button's OnClick event calls the SetSkybox function on the Manager. Isn't that simple?
      Unity Addressables - Scene Hierarchy
      Once we're done daydreaming how immersive it was to play X2, it's time to get our hands dirty. Let's launch the multisensory experience and open the profiler (ctrl/cmd + 7; or Window - Analysis - Profiler). I take you are familiar with this tool, otherwise you know what to do with the top record button. After a few seconds of recording, stop it and see the metrics for yourself: CPU, memory, etc.. Anything of interest?

      Performance is pretty good, which is nothing surprising considering the scope of the project. You could literally turn this project into a VR experience and I assure you that your users will not fill any of the bile buckets that so many players filled when playing Eve: Valkyrie.

      In our case, we will be focusing on the memory section. The simple view mode will display something as depicted below:
      Level 1 Asset Management - Simple Memory Profiling
      The numbers on the texture size look pretty oversized for just displaying a single skybox at any given time, don't you agree? Surprise incoming: this is the pattern you will find in many unoptimized projects you are likely to suffer lead. But heck, in this case it's just a collection of skyboxes. In others, it will be more about characters, planets, sounds, music. You name it, I have it.
      If dealing with many assets falls under your responsibility, well, then I am glad you are reading this article. I will help you transitioning to scalable solution.
      Time for magic. Let's switch the memory profiler to detail mode. Have a look!
      Level 1 Asset Management - Detailed Memory Profiling
      Holy crap, what happened there? All skybox textures are loaded in memory, but only one is displayed at any given time. You see what we did? This rookie architecture produced the whoooooooping figure of 400mb.
      This is definitely a no-go, considering this is just a small piece of a future game. Addressing this very problem is the foundation for our next section.
      Come with me!
      Traditional asset management entails direct references Therefore you keep all objects loaded at all times Your project will not scale this way  
      Level 2 developer: Unity Addressables workflow
      In video- games you start at level 1, which is great, but once you know the gameplay rules it is time to leave the safe city walls in our quest to level up. That is exactly what this section is about.
      Grab the level 2 project now.
      As we previously saw in the profiler, we have all skyboxes loaded in memory even though only one is actively used. That is not a scalable solution, as at some point you will be limited on the amount of different variations of assets you can offer to your players. An advice? Don't limit the fun of your players.
      Here, let me help you. Take my shovel so we can dig the so needed tunnel to escape the prison of traditional asset management. Let's add a new fancy tool to our toolbox: the API of Unity Addressables.
      The first thing we need to do is to install the Addressables package. For that, go to Window → Package Manager, as shown below:

      Unity Package Manager - Unity Addressables
      Once installed, it's time to mark the materials as addressables. Select them and activate the addressables flag in the inspector window.
      Level 2 Asset Management (Unity Addressables)
      What this will do is to ask Unity politely to include those materials and their texture dependencies in the addressables database. This database will be used during our builds to pack the assets in chunks that can easily be loaded at any point during in our game.
      I'll show you something cool now. Open Window → Asset Management → Addressables. Guess what's that? It's our database screaming to go live!
      Level 2 Asset Management (Unity Addressables) - Main Window
      My dear reader: that was the easy part. Now comes the fun part.
      I want you to pay a visit to an old friend of ours from the previous section: Sir Manager. If you check it, you will notice it is still holding direct references to our assets! We don't want that.
      We are teaching our manager to use indirect references instead - i.e. AssetReference (in Unreal Engine you might know them as soft references).
      Let us do just that, let's beautify our component:
      using UnityEngine; using UnityEngine.AddressableAssets; using UnityEngine.ResourceManagement.AsyncOperations; public class Manager : MonoBehaviour { [SerializeField] private List<AssetReference> _skyboxMaterials; private AsyncOperationHandle _currentSkyboxMaterialOperationHandle; public void SetSkybox(int skyboxIndex) { StartCoroutine(SetSkyboxInternal(skyboxIndex)); } private IEnumerator SetSkyboxInternal(int skyboxIndex) { if (_currentSkyboxMaterialOperationHandle.IsValid()) { Addressables.Release(_currentSkyboxMaterialOperationHandle); } var skyboxMaterialReference = _skyboxMaterials[skyboxIndex]; _currentSkyboxMaterialOperationHandle = skyboxMaterialReference.LoadAssetAsync(); yield return _currentSkyboxMaterialOperationHandle; RenderSettings.skybox = _currentSkyboxMaterialOperationHandle.Result; } } What happens here is the following:
      A major change happens in line 7, where we hold a list of indirect references (AssetReference) instead of direct material references. This change is key, because these materials will NOT be loaded automatically just by being referenced. Their loading will have to be made explicit. Afterwards, please reassign the field in the editor. Line 13: since we are now in an asynchronous workflow, we favor the use of a coroutine. We simply start a new coroutine that will handle the skybox material change. We check in lines 18-20 whether we have an existing handle to a skybox material and, if so, we release the skybox we were previously rendering.  Every time we do such a load operation with the Addressables API, we receive a handle we should store for future operations. A handle is just a data structure containing data relevant to the management of a specific addressable asset. We resolve specific addressable reference to a skybox material in line 23 and then you call its LoadAssetAsync function, over which you can yield (line 25) so we wait for this operation to finish before proceeding further. Thanks to the usage of generics, there's no need for sucky casts. Neat! Finally, once the material and its dependencies have been loaded, we proceed to change the skybox of the scene in line 26. The material will be offered in the Result field that belongs to the handle we used to load it. Level 2 Asset Management (Unity Addressables) - AssetReference list
      Keep in mind: this code is not production-ready. Do not use it when programming an airplane. I decided to favor simplicity over robustness to keep the matter simple enough.
      Enough with explanations. It is time you and I saw this in action.
      If you would be so kind to perform the following steps:
      In the addressables window, cook the content (build player content) Then make a build on a platform of your choice Run it and connect the (memory) profiler to it. Protect your jaw from dropping.
      Level 2 (Unity Addressables) - Build Player Content
        Level 2 Asset Management (Unity Addressables) - Memory Profiler
          Isn't asset cooking delicious?
      I like happy profilers. And what you saw is the happiest profiler the world has ever seen. A satisfied profiler will mean several things. For one, it means happy players playing your game in a Nokia 3210. It also means happy producers. And as of you, it means a happy wallet.
      This is the power of the Addressables system.
      Addressables which comes with little overhead on the team. On the one side, programmers will have to support asynchronous workflows (easy-peasy with Coroutines). Also, designers will have to learn the possibilities of the system, e.g. addressable groups, and gather experience to make intelligent decisions. Finally, IT will be delighted to set up an infrastructure to deliver the assets over the network, if you opt to host them online.
      I have to congratulate you. Let me tell you what we have accomplished:
      Appropriate memory management. Faster initial loading times. Faster install times, reduced in-store app size. Higher device compatibility. Asynchronous architecture. Opening the door of storing this content online → decoupling data from code. I would be proud of such a gain. It's for sure a good return on investment.
      Oh, and make sure to mention your experience with Addressables in job interviews.
      INTERMEDIATE: Instancing and reference counting. Read on it my blog post for information on this topic.
      OPTIONAL: Alternative loading strategies. Read on it my blog post for information on this topic.
      Addressables-based asset management scales just well Addressables introduces asynchronous behavior Do not forget to cook content on changes or you'll give your game a pinkish tint!  

      Level 3 Asset Management (??) - Content Network Delivery
        Level 3 Asset Management (??) - Content Network Delivery
      In the previous section, we achieved the biggest bang for the buck. We leveled up our skills by moving from a traditional asset management system to an addressables-based workflow. This is a huge win for your project, as a small time investment gave your project the room to better scale in assets while keeping your memory usage low. That accomplishment indeed made you step up to level 2, congrats! However, one question is yet to answer:
      Is that it?
      No. We barely scratched the surface of Addressables, there are further ways to improve your project with this game-changer package.
      Of course, you do not have to memorize all the details regarding Addressables, but I highly suggest you to have an overview of them because down the road you are likely to encounter further challenges and you will be thankful to have read a bit further. That's why I prepared an extra short guide for you.
      There you will learn about the following aspects:
      The Addressables window: the details matter Addressables profiling: don't leak a (memory) leak ruin your day Network delivery: reduce the time-to-play user experience Build pipeline integration Practical strategies: speed up your workflow, trash your 10-minute coffee breaks  
      And more importantly, answer questions such as:
      What is the hidden meaning behind Send Profiler Events? How useful is the AddressableAssetSettings API? How do I integrate this all with the BuildPlayerWindow API? What's the difference between Fast Mode, Virtual Mode and Packed Mode? In order to grab the level 3 guide check out my blog post
    • By yaboiryan
      I am trying to make a map editor with Assembly and C...
      Everything used to work, but now, I get these errors:
      cd C:\Editor wmake -f C:\Editor\Editor.mk -h -e C:\Editor\Editor.exe wasm SOURCE\MEASM.ASM -i="C:\WATCOM/h" -mf -6r -d1 -w4 -e25 -zq SOURCE\MEASM.ASM(2): Error! E032: Syntax error SOURCE\MEASM.ASM(3): Error! E032: Syntax error SOURCE\prologue.mac(1): Error! E032: Syntax error SOURCE\prologue.mac(1): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(4): Error! E535: Procedure must have a name SOURCE\prologue.mac(4): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(6): Error! E523: Segment name is missing SOURCE\prologue.mac(6): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(11): Error! E535: Procedure must have a name SOURCE\prologue.mac(11): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(14): Error! E525: Data emitted with no segment SOURCE\prologue.mac(14): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(15): Error! E525: Data emitted with no segment SOURCE\prologue.mac(15): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(19): Error! E506: Block nesting error SOURCE\prologue.mac(19): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(22): Error! E535: Procedure must have a name SOURCE\prologue.mac(22): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(24): Error! E525: Data emitted with no segment SOURCE\prologue.mac(24): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(25): Error! E525: Data emitted with no segment SOURCE\prologue.mac(25): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(28): Error! E535: Procedure must have a name SOURCE\prologue.mac(28): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(30): Error! E525: Data emitted with no segment SOURCE\prologue.mac(30): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(31): Error! E525: Data emitted with no segment SOURCE\prologue.mac(31): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(32): Error! E525: Data emitted with no segment SOURCE\prologue.mac(32): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(33): Error! E525: Data emitted with no segment SOURCE\prologue.mac(33): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(34): Error! E525: Data emitted with no segment SOURCE\prologue.mac(34): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(37): Error! E535: Procedure must have a name SOURCE\prologue.mac(37): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(39): Error! E525: Data emitted with no segment SOURCE\prologue.mac(39): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(40): Error! E525: Data emitted with no segment SOURCE\prologue.mac(40): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(41): Error! E525: Data emitted with no segment SOURCE\prologue.mac(41): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\prologue.mac(42): Error! E525: Data emitted with no segment SOURCE\prologue.mac(42): Note! N591: included by file SOURCE\MEASM.ASM(4) SOURCE\MEASM.ASM(5): Error! E032: Syntax error SOURCE\MEASM.ASM(6): Error! E032: Syntax error SOURCE\MEASM.ASM(9): Error(E42): Last command making (C:\Editor\MEASM.obj) returned a bad status Error(E02): Make execution terminated Execution complete  
      I haven't used assembly for a while, and I don't really remember how to fix these errors.
      Here is the main assembly file:
      IDEAL JUMPS include "prologue.mac" P386 ; 386 specific opcodes and allows all the necessary crap... P387 ; Allow 386 processor MASM .MODEL FLAT ;32-bit OS/2 model .CODE IDEAL PUBLIC SetPalette2_ PUBLIC SetVGAmode_ PUBLIC SetTextMode_ PUBLIC inkey_ PUBLIC PutHex_ ;============================================================================== ; void SetPalette2(unsigned char *PalBuf,short count); ;============================================================================== Proc SetPalette2_ near push esi mov esi,eax mov cx,dx mov bx,0 cld mov dx,3C8H sp210: mov al,bl out dx,al inc dx lodsb out dx,al lodsb out dx,al lodsb out dx,al dec dx inc bx loop sp210 pop esi ret endp ;============================================================================== ; void SetVGAmode(void); ;============================================================================== Proc SetVGAmode_ near push ebp mov ax,13h int 10h ; Set 320x200x256 pop ebp ret endp ;============================================================================== ; ;============================================================================== Proc SetTextMode_ near push ebp mov ax,3 int 10h pop ebp ret endp ;============================================================================== ; ;============================================================================== Proc inkey_ near xor eax,eax mov ah,1 ;see if key available int 16h jz ink080 ;nope xor ax,ax int 16h jmp short ink090 ink080: xor ax,ax ink090: ret endp ;============================================================================== ; ;============================================================================== Proc HexOut_ near and al,15 cmp al,10 jb short hex010 add al,7 hex010: add al,'0' stosb ret endp ;============================================================================== ; void PutHex(char *buf,UINT mCode); ;============================================================================== Proc PutHex_ near push edi mov edi,eax mov eax,edx shr al,4 call HexOut_ mov eax,edx call HexOut_ xor al,al stosb pop edi ret endp end  
      This file includes another file that I found on the Internet a while ago called "Prologue.MAC"
      Here is the code for it:
      P386 Macro SETUPSEGMENT SEGMENT _TEXT PARA PUBLIC 'CODE' ASSUME CS:_TEXT Endm macro PENTER STORAGE ;; 17 - Enter a procedue with storage space ;; Procedure enter, uses the 286/386 ENTER opcode push ebp mov ebp,esp IF STORAGE sub esp,STORAGE ENDIF ENDIF endm macro PLEAVE ;; 18 - Exit a procedure with stack correction. mov esp,ebp pop ebp endm macro PushCREGS ;; 19 - Save registers for C push es push ds ;The Kernel is responsible for maintaining DS push esi push edi cld endm macro PopCREGS ;; 20 - Restore registers for C pop edi pop esi pop ds ;The Kernel is responsible for maintaining DS pop es endm  
      Can I have an Idea of what I could do to fix these errors?
      I am using OpenWatcom compiler. I *have* gotten Watcom to work before. In fact, this is one of the first few times it has not worked with me... 
      Thanks in advance!

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!