Sign in to follow this  

DX11 Instancing worth it for rendering text in a 2D game?

Recommended Posts

Hey guys,

 

I have a DX11 text rendering implementation where I have a bitmap with all the character glyphs that I use to build a list of quads with the position and UV of each glyph corresponding to the string I need to print. I then use one call to draw the text into a texture and can draw that texture wherever I need. Simple functions, simple shaders. I can also easily parse for special character, do linefeeds, limit the width for text boxes, change color, etc.

 

Today I sat down and decided to tackle the long overdue task of converting the rendering to use instancing, but it just hit me that it might not provide the great gains I previously thought it would. Most characters are of different sizes, so I would either have to handle resizing the geometry or maybe update the bitmap and then draw a fixed size quand and then reposition it based on the real width of the character. Or something else. This seems to break the K.I.S.S. principle.

 

Anyways, in the case where I do not issue more than 1 draw call to render the text, would there be any real benefits of using instancing to render text?

 

Thanks!

Share this post


Link to post
Share on other sites

There's actually a much easier way to do it.  When you have your final scene ready to present, you can get the Win32 device context - apologies, but it's been awhile since I used DX 11 and I can't remember the specifics.  A little perusal of the various interfaces and you'll find it; there's a method to GetDC.  You can get the device context (DC) from DirectX, pass that to any Windows GDI calls desired to use standard font drawing stuff - sizing, colors, you name it - then just present the scene as normal.  It saves a lot of problems.

Share this post


Link to post
Share on other sites

A quick test to check if any optimizations are needed - just fill the whole screen with text, and see if there is a significant slowdown.

 

Great idea, I'll definitely do that. 

 

In addition to the above advice, you're almost certainly not drawing enough text for it to be a bottleneck worth addressing.

 

Yeah, you're probably right. Besides, rendering the text in a texture would incur the rendering cost of parsing that text only once. It's probably cheaper to then just render the text texture until it's not needed anymore. I'll check that too.

 

@Ryan_001, thanks a lot for the link to that presentation. I found the video of that presentation on the GDC website: http://www.gdcvault.com/play/1020624/Advanced-Visual-Effects-with-DirectX along with many other great ones too! Very informative.

 

Thanks guys.

 

 

In addition to the above advice, you're almost certainly not drawing enough text for it to be a bottleneck worth addressing.

 

Yeah, you're probably right. Besides, rendering the text in a texture would incur the rendering cost of parsing that text only once. It's probably cheaper to then just render the text texture until it's not needed anymore. I'll check that too.

 

@Ryan_001, thanks a lot for the link to that presentation. I found the video of that presentation on the GDC website: http://www.gdcvault.com/play/1020624/Advanced-Visual-Effects-with-DirectX along with many other great ones too! Very informative.

 

In addition to the above advice, you're almost certainly not drawing enough text for it to be a bottleneck worth addressing.

 

Yeah, you're probably right. Besides, rendering the text in a texture would incur the rendering cost of parsing that text only once. It's probably cheaper to then just render the text texture until it's not needed anymore. I'll check that too.

 

@Ryan_001, thanks a lot for the link to that presentation. I found the video of that presentation on the GDC website: http://www.gdcvault.com/play/1020624/Advanced-Visual-Effects-with-DirectX along with many other great ones too! Very informative.

 

Share this post


Link to post
Share on other sites

Instancing does not perform well for meshes with small polygon counts -- such as a single quad. You're better off not using instancing for rendering a list of quads.

It performs identically on the GPU and orders of magnitude faster on the CPU. How is that not better?

Share this post


Link to post
Share on other sites

Instancing does not perform well for meshes with small polygon counts -- such as a single quad. You're better off not using instancing for rendering a list of quads.

It performs identically on the GPU and orders of magnitude faster on the CPU. How is that not better?

If you're comparing one instanced draw-call for all quads vs one draw-call for each quad, then sure there's a massive difference in CPU perf...
But you shouldn't use one draw-call per quad, you should use a single indexed draw-call for all quads, in which case the CPU performance is the same.
 
When i said 'does not perform well' I was referring to the GPU side -- instancing does incur a cost on the GPU side, especially for meshes with a small number of vertices. The alternative that I mentioned will be faster in terms of GPU time and equal in CPU time (one draw call, one buffer of per-quad attributes).

 

Vertex Shader Tricks by Bill Bilodeau (linked above by Ryan) has the gist of it -- Drawing quads as an indexed draw-call is much faster than an instanced draw-call in terms of GPU time:

 

4u4NSeq.png awOM9hk.png

I've seen this in practice too -- we saved a measurable amount of milliseconds by converting our impostor rendering system (for drawing a crowd of 100k characters) from using instanced quads to a large index list of quads -- and we didn't even do it the ideal way of having one vertex per quad (we still used the simple method of four verts per quad in the buffer and a standard VS and IA config).

Side notes from the above graph; NV GPU's seem especially sensitive to this "small-mesh instancing overhead" (this penalty goes away for meshes with ~500 verts IIRC), and NV GPU's are great at using the GS stage.

Edited by Hodgman

Share this post


Link to post
Share on other sites

Obviously the highly specific technique for that use case is gonna be faster than the generic instancing one that doesn't allow further optimization and you should use it whenever possible, but that's not a valid comparison to show instancing overhead.

Share this post


Link to post
Share on other sites

I think the flaw in the thinking here is only measuring at the front-end.  I know that I certainly used to fall into that trap years ago.

 

The way it looks in this case is: (a) you measure the number of vertices used for an indexed or non-indexed draw, and (b) you measure the number of vertices used for a GS or instanced draw.  You see that (b) is significantly lower than (a), and therefore you assume that (b) must be faster than (a).

 

The reality is that vertex counts are only part of what can contribute to performance, there are other factors, and depending on one's use case vertex counts may not even be relevant.

 

This can be counter-intuitive; there's a whole "anti-bloatware" culture based on the premise that using more memory is bad, using less must be good, and this kind of metric just flies completely in the face of it.

Share this post


Link to post
Share on other sites

Obviously the highly specific technique for that use case is gonna be faster than the generic instancing one that doesn't allow further optimization and you should use it whenever possible, but that's not a valid comparison to show instancing overhead.

The statement you wanted me to clarify was: Instancing does not perform well for meshes with small polygon counts. You're better off not using instancing for rendering a list of quads.  :wink:

I assumed we were both talking about the performance in this specific situation of text rendering, and that specific bit of advice, not the general case :P

 
Rendering a list of quads (e.g. text or billboards) is the extreme case, but the same performance pitfall applies to any low-poly model. e.g. if instancing a-few-hundred-poly models, you may find that old-school pre-HW-instancing techniques  (or modern techniques that appeared after the IA stage disappeared from HW) are actually still faster than using HW instancing. At around 1k+ poly's you'll likely see no real performance overhead from instancing, making it useful.

Share this post


Link to post
Share on other sites

Instancing does not perform well for meshes with small polygon counts -- such as a single quad. You're better off not using instancing for rendering a list of quads.
You can minimize data by only sending the quad center position X/Y coordinate and the width/height (instead of four x/y coordinates) along with a special VS that manually reads the vertex attributes from a SRV (instead of using the IA to read them automatically), or alternatively you can use the GS to convert a single vertex into four.


I have agree here because I've done exactly what the op originally did but i have compulsion to change to instancing. Simply because i don't get anything from the move.

1. I actually Calc the quad in screen space and store all the geometry in the cpu. This Calc only occurs when my string is updated.
2. I don't even bother passing the point only to gs to expand it. Because the difference in data is yes an order of magnitude but I'd still just bytes.
3. Instancing does give me some options like rotating quads etc. And adding instancing anyway should be trivial for this.

So in the end. I would use instancing actually not for instancing but for things like animating the quad. I found nothing performance wise with such small data sets.

Share this post


Link to post
Share on other sites
Anyways, in the case where I do not issue more than 1 draw call to render the text, would there be any real benefits of using instancing to render text?

 

In my renderer I have 1 draw call for text rendering (1 draw per pass: score pass is 1 draw call, stat pass - is a second one).

I use GS for quad generation, texture has several fonts, and each letter can be in different color.

VS passes through 1 point/character {uv, screen coord, and color}

I do not use rotations or other transforms yet.

I can use mono fonts and normal fonts and combine them in one draw call.

 

Just measured timings in NSight for stat pass:

For 38 characters - 36 microsec,

For 122 characters - 52microsec,

for 300 characters - 66.

 

The other benefit for me is that I need to update instance buffer only once per draw call and only if my text has changed.

 

One more thought: probably rendering more than 1000 characters per frame is not very common, so examples with 500k sprites are not so relevant for this topic.

Edited by Happy SDE

Share this post


Link to post
Share on other sites

I just did some tests while I rewrote my text box parsing and rendering. This is on a Lenovo Ideapad Y560 laptop, which has a Radeon HD5730 video card.

 

Rendering a screen full of text at 1980x1080 (external monitor),

15750 total characters for 63000 vertices,

using a single DrawIndexed call,

with a simple pixel shader that does transparency.

 

The time to render the text varies between 2 and 4 microsecond. That is with no instancing and sending all those vertices to the gpu. I only update the buffer if the text changes in some way.

 

Strangely enough, if I render 200 characters, I get the same timing, 2 to 4 microseconds. This might be an innacuracy of the highperf counter, but I'm not sure. What I know is that its fast enough for my needs.

Share this post


Link to post
Share on other sites

Strangely enough, if I render 200 characters, I get the same timing, 2 to 4 microseconds. This might be an innacuracy of the highperf counter, but I'm not sure. What I know is that its fast enough for my needs.

Probably you are measuring CPU time (time to prepare your commands).

GPU timings are usually measured by GPU profilers such as NSight/DX GPU query/GPUView/..

Share this post


Link to post
Share on other sites

I think you're probably better off using a batching method for doing text. The only real difference is the geometry, which should be perfectly fine for creating a massive index for rendering. Even better is that you can store this index data, and then toss out the string if you're not going to change it.

 

Also, if you're rendering 200 characters in 2 microseconds, I think you're doing good. Remember that a micro is a fraction of a mili....

 

Especially when you consider the fact that if you're doing gui with scroll bars or something... more than likely you have a scissor over the window to get rid of the junk you don't care about.

Edited by Tangletail

Share this post


Link to post
Share on other sites

 

Strangely enough, if I render 200 characters, I get the same timing, 2 to 4 microseconds. This might be an innacuracy of the highperf counter, but I'm not sure. What I know is that its fast enough for my needs.

 

Probably you are measuring CPU time (time to prepare your commands).

GPU timings are usually measured by GPU profilers such as NSight/DX GPU query/GPUView/..

 

Doh, very true! I'l give it a looksee.

 

Thanks guys. I got things working pretty fast now. If this becomes a problem, then I'll investigate.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Forum Statistics

    • Total Topics
      628764
    • Total Posts
      2984576
  • Similar Content

    • By GreenGodDiary
      Having some issues with a geometry shader in a very basic DX app.
      We have an assignment where we are supposed to render a rotating textured quad, and in the geometry shader duplicate this quad and offset it by its normal. Very basic stuff essentially.
      My issue is that the duplicated quad, when rendered in front of the original quad, seems to fail the Z test and thus the original quad is rendered on top of it.
      Whats even weirder is that this only happens for one of the triangles in the duplicated quad, against one of the original quads triangles.

      Here's a video to show you what happens: Video (ignore the stretched textures)

      Here's my GS: (VS is simple passthrough shader and PS is just as basic)
      struct VS_OUT { float4 Pos : SV_POSITION; float2 UV : TEXCOORD; }; struct VS_IN { float4 Pos : POSITION; float2 UV : TEXCOORD; }; cbuffer cbPerObject : register(b0) { float4x4 WVP; }; [maxvertexcount(6)] void main( triangle VS_IN input[3], inout TriangleStream< VS_OUT > output ) { //Calculate normal float4 faceEdgeA = input[1].Pos - input[0].Pos; float4 faceEdgeB = input[2].Pos - input[0].Pos; float3 faceNormal = normalize(cross(faceEdgeA.xyz, faceEdgeB.xyz)); //Input triangle, transformed for (uint i = 0; i < 3; i++) { VS_OUT element; VS_IN vert = input[i]; element.Pos = mul(vert.Pos, WVP); element.UV = vert.UV; output.Append(element); } output.RestartStrip(); for (uint j = 0; j < 3; j++) { VS_OUT element; VS_IN vert = input[j]; element.Pos = mul(vert.Pos + float4(faceNormal, 0.0f), WVP); element.Pos.xyz; element.UV = vert.UV; output.Append(element); } }  
      I havent used geometry shaders much so im not 100% on what happens behind the scenes.
      Any tips appreciated! 
    • By mister345
      Hi, I'm building a game engine using DirectX11 in c++.
      I need a basic physics engine to handle collisions and motion, and no time to write my own.
      What is the easiest solution for this? Bullet and PhysX both seem too complicated and would still require writing my own wrapper classes, it seems. 
      I found this thing called PAL - physics abstraction layer that can support bullet, physx, etc, but it's so old and no info on how to download or install it.
      The simpler the better. Please let me know, thanks!
    • By Hexaa
      I try to draw lines with different thicknesses using the geometry shader approach from here:
      https://forum.libcinder.org/topic/smooth-thick-lines-using-geometry-shader
      It seems to work great on my development machine (some Intel HD). However, if I try it on my target (Nvidia NVS 300, yes it's old) I get different results. See the attached images. There
      seem to be gaps in my sine signal that the NVS 300 device creates, the intel does what I want and expect in the other picture.
      It's a shame, because I just can't figure out why. I expect it to be the same. I get no Error in the debug output, with enabled native debugging. I disabled culling with CullMode.None. Could it be some z-fighting? I have little clue about it but I tested to play around with the RasterizerStateDescription and DepthBias properties with no success, no change at all. Maybe I miss something there?
      I develop the application with SharpDX btw.
      Any clues or help is very welcome
       


    • By Beny Benz
      Hi,
      I'm currently trying to write a shader which shoud compute a fast fourier transform of some data, manipulating the transformed data, do an inverse FFT an then displaying the result as vertex offset and color. I use Unity3d and HLSL as shader language. One of the main problems is that the data should not be passed from CPU to GPU for every frame if possible. My original plan was to use a vertex shader and do the fft there, but I fail to find out how to store changing data betwen shader calls/passes. I found a technique called ping-ponging which seems to be based on writing and exchangeing render targets, but I couldn't find an example for HLSL as a vertex shader yet.
      I found https://social.msdn.microsoft.com/Forums/en-US/c79a3701-d028-41d9-ad74-a2b3b3958383/how-to-render-to-multiple-render-targets-in-hlsl?forum=xnaframework
      which seem to use COLOR0 and COLOR1 as such render targets.
      Is it even possible to do such calculations on the gpu only? (/in this shader stage?, because I need the result of the calculation to modify the vertex offsets there)
      I also saw the use of compute shaders in simmilar projects (ocean wave simulation), do they realy copy data between CPU / GPU for every frame?
      How does this ping-ponging / rendertarget switching technique work in HLSL?
      Have you seen an example of usage?
      Any answer would be helpfull.
      Thank you
      appswert
    • By ADDMX
      Hi
      Just a simple question about compute shaders (CS5, DX11).
      Do the atomic operations (InterlockedAdd in my case) should work without any issues on RWByteAddressBuffer and be globaly coherent ?
      I'v come back from CUDA world and commited fairly simple kernel that does some job, the pseudo-code is as follows:
      (both kernels use that same RWByteAddressBuffer)
      first kernel does some job and sets Result[0] = 0;
      (using Result.Store(0, 0))
      I'v checked with debugger, and indeed the value stored at dword 0 is 0
      now my second kernel
      RWByteAddressBuffer Result;  [numthreads(8, 8, 8)] void main() {     for (int i = 0; i < 5; i++)     {         uint4 v0 = DoSomeCalculations1();         uint4 v1 = DoSomeCalculations2();         uint4 v2 = DoSomeCalculations3();                  if (v0.w == 0 && v1.w == 0 && v2.w)             continue;         //    increment counter by 3, and get it previous value         // this should basically allocate space for 3 uint4 values in buffer         uint prev;         Result.InterlockedAdd(0, 3, prev);                  // this fills the buffer with 3 uint4 values (+1 is here as the first 16 bytes is occupied by DrawInstancedIndirect data)         Result.Store4((prev+0+1)*16, v0);         Result.Store4((prev+1+1)*16, v1);         Result.Store4((prev+2+1)*16, v2);     } } Now I invoke it with Dispatch(4,4,4)
      Now I use DrawInstancedIndirect to draw the buffer, but ocassionaly there is missed triangle here and there for a frame, as if the atomic counter does not work as expected
      do I need any additional synchronization there ?
      I'v tried 'AllMemoryBarrierWithGroupSync' at the end of kernel, but without effect.
      If I do not use atomic counter, and istead just output empty vertices (that will transform into degenerated triangles) the all is OK - as if I'm missing some form of synchronization, but I do not see such a thing in DX11.
      I'v tested on both old and new nvidia hardware (680M and 1080, the behaviour is that same).
       
  • Popular Now