Hardware instancing speed check

Started by
8 comments, last by Matias Goldberg 8 years, 6 months ago

I've just implemented hardware instancing in my engine and I wanted to know whether my results are on par or not - I appreciate this can be a subjective question.

In full release/full screen mode I'm rendering 100 instances of a 30k triangle tree. With nothing else on the screen, my frame time is approximately 35ms, or 30FPS. That's 3 million triangles each frame meaning approximately 90million triangles per second.

I'm running this on my Mac Book Pro 2015 Retina at 2560 x 1440, with a 2Gb NVidia GT750M, 16Gb RAM and an Intel i7 2.8GHz quad core processor.

Seems alright doesn't it...?

Advertisement

You don't say your FPS without instancing and how you manage it, it's purely a result of bench without other informations.

Trees have a considerable amount of overdraw which is pixelshader and ROP abuse, so as a subjective opinion, I would say getting 35ms / 30 FPS at a reasonable resolution on a mobile GPU is quite OK. Probably no reason to complain.

In general, when you're not sure whether your draw calls are "OK", try what you get if you render to a smaller viewport (say, 16x16, or even 1x1). If you still only get 30 FPS, your instancing (or your draw calls / batching) sucks. If you suddenly have 400-500 FPS, you're good smile.png

You don't say your FPS without instancing and how you manage it, it's purely a result of bench without other informations.

Well that's a good point!

So, interestingly, I switched off instancing and drew the same 100 models with the same texture under the same circumstances and it runs slightly faster @ 30ms frame time vs 35ms. I think this is because the CPU is practically idle during this test so doing 100 DIPs whilst nothing else is happening isn't causing any trouble. I wonder if when the CPU is under load with all the game logic, this will improve things.

I must admit, I was under the impression hardware instancing (I'm using the DX9 API) would dramatically improve rendering times for large batches of objects. I just tried it again with 500 trees and using instancing, it's noticeably slower than without (200ms frame time with instancing vs 125ms without instancing).

Don't think I'm doing anything wrong, I'm not filling my instance array each frame, I only do it once. Everything else is identical.

Instancing is a CPU-side optimization. If your game is GPU bound then it won't help. It might eventually be a good optimization if you become bound by the number of drawcalls on the CPU.


I just tried it again with 500 trees and using instancing, it's noticeably slower than without (200ms frame time with instancing vs 125ms without instancing).

Instancing should be slightly more expensive for GPU, but not much. Can you use a GPU profiler to get more details to see if this is an increase in GPU time?

Are the instanced and non-instanced versions drawing trees in exactly the same spots, and the camera is in exactly the same position?

Your measuring metrics are very poor.

You're just considering number of triangles and framerate, whereas any meaningful evaluation would require:

  • Vertex size in bytes / vertex description
  • Number of vertex attributes
  • Number of vertex buffers
  • Number of vertices
  • Number of triangles
  • Complexity of vertex shader. Number of uniforms
  • Number of interpolants exported to pixel shader
  • Complexity of pixel shader
  • How many pixels the average triangle occupies
  • Whether they're rendered front to back or back to front
  • Frametime in milliseconds (rather than FPS)
  • A bit of source code to get a rough idea of some of the above (e.g. complexity of shaders, etc)
  • HW and OS you're running on

With the information provided in your post we have absolutely no idea if your numbers are good or not.

Your measuring metrics are very poor.

You're just considering number of triangles and framerate, whereas any meaningful evaluation would require:

  • Vertex size in bytes / vertex description
  • Number of vertex attributes
  • Number of vertex buffers
  • Number of vertices
  • Number of triangles
  • Complexity of vertex shader. Number of uniforms
  • Number of interpolants exported to pixel shader
  • Complexity of pixel shader
  • How many pixels the average triangle occupies
  • Whether they're rendered front to back or back to front
  • Frametime in milliseconds (rather than FPS)
  • A bit of source code to get a rough idea of some of the above (e.g. complexity of shaders, etc)
  • HW and OS you're running on

With the information provided in your post we have absolutely no idea if your numbers are good or not.

I wasn't after an exact science-based response, hence my "I appreciate this can be a subjective question" and "Seems alright doesn't it...?" lines. I think I've supplied enough information for someone to say "yes that seems fair" or "no, I can get 500 times more than that on my similar machine. I've implied that I'm just rendering 100 simple trees, I didn't want to bore everyone or waste anyone's time with reams of shader code or other info to scan through.

Therefore, I feel my measuring metrics are just fine for this level of post. Your reply appears to be copy/pasted from another post as you have stated you require the frame time in ms rather than FPS - which I have supplied (along with HW).

The responses I've received are great for what I need right now.

Instancing is a CPU-side optimization. If your game is GPU bound then it won't help. It might eventually be a good optimization if you become bound by the number of drawcalls on the CPU.


I just tried it again with 500 trees and using instancing, it's noticeably slower than without (200ms frame time with instancing vs 125ms without instancing).

Instancing should be slightly more expensive for GPU, but not much. Can you use a GPU profiler to get more details to see if this is an increase in GPU time?

Are the instanced and non-instanced versions drawing trees in exactly the same spots, and the camera is in exactly the same position?

Yes, same spot, same camera position/direction, etc.

This is interesting information, thanks.

I wasn't after an exact science-based response, hence my "I appreciate this can be a subjective question" and "Seems alright doesn't it...?" lines. I think I've supplied enough information for someone to say "yes that seems fair" or "no, I can get 500 times more than that on my similar machine. I've implied that I'm just rendering 100 simple trees, I didn't want to bore everyone or waste anyone's time with reams of shader code or other info to scan through.

The problem is that you can get MAJOR differences.

Rendering 30M triangles with 7M vertices is far different than rendering 30M triangles with 90M vertices (very high triangle reuse vs 3 vertices per triangle). We're talking about 1280% increase in the amount of data needed to be processed by the vertex shader.

It's not the same to send the world view projection matrix as a single matrix than to send each matrix separately and concatenate them in the shader. Also it's not the same to send the world, view & projection matrices per tree, than sending view & projection matrix per camera, and the world matrix per tree.

Having many triangles that occupy less than a pixel can cause you up to 4x slowdown.

I'm not talking about pasting huge loads of shader code (you're right in that no one's gonna read that), but at least tell us a rough idea of the complexity. Outputting a fixed colour is the most simple pixel shader, then there's a diffuse texture with a simple Blinn-Phong BRDF... and then there's a shader that can do normal mapping, specular mapping, and uses a GGX with Fresnel BRDF to lit it.

And last but not least, instancing at 100 instances is going to barely make a difference. If you have noticeable performance differences with an older implementation that didn't use instancing, then you're (or were) doing it wrong.
Instancing will fix your CPU bottlenecks, which often begin to take a toll once you exceed the 1000 drawcalls (depends on API; e.g. in DX11 you can make 50k drawcalls like nothing if you're careful enough; you're tagging this thread as D3D9 so I assume you're using D3D9, and that API often begins to show its problems between 1k - 3k draws)

I wasn't after an exact science-based response, hence my "I appreciate this can be a subjective question" and "Seems alright doesn't it...?" lines. I think I've supplied enough information for someone to say "yes that seems fair" or "no, I can get 500 times more than that on my similar machine. I've implied that I'm just rendering 100 simple trees, I didn't want to bore everyone or waste anyone's time with reams of shader code or other info to scan through.

The problem is that you can get MAJOR differences.Rendering 30M triangles with 7M vertices is far different than rendering 30M triangles with 90M vertices (very high triangle reuse vs 3 vertices per triangle). We're talking about 1280% increase in the amount of data needed to be processed by the vertex shader.It's not the same to send the world view projection matrix as a single matrix than to send each matrix separately and concatenate them in the shader. Also it's not the same to send the world, view & projection matrices per tree, than sending view & projection matrix per camera, and the world matrix per tree.Having many triangles that occupy less than a pixel can cause you up to 4x slowdown.I'm not talking about pasting huge loads of shader code (you're right in that no one's gonna read that), but at least tell us a rough idea of the complexity. Outputting a fixed colour is the most simple pixel shader, then there's a diffuse texture with a simple Blinn-Phong BRDF... and then there's a shader that can do normal mapping, specular mapping, and uses a GGX with Fresnel BRDF to lit it.And last but not least, instancing at 100 instances is going to barely make a difference. If you have noticeable performance differences with an older implementation that didn't use instancing, then you're (or were) doing it wrong.Instancing will fix your CPU bottlenecks, which often begin to take a toll once you exceed the 1000 drawcalls (depends on API; e.g. in DX11 you can make 50k drawcalls like nothing if you're careful enough; you're tagging this thread as D3D9 so I assume you're using D3D9, and that API often begins to show its problems between 1k - 3k draws)

I guess in my intention to not drag everyone through my code and various shading and rendering parameters, I may have given slightly too little information but I was still really after the info that had been posted back, it's all extremely useful (including yours of course).

For the record, my shader is very simple, just a gouraud-shaded diffuse texture but the model I'm using does not make good use of vertex sharing, I'll try it again with another model.

I had a look at the links in your signature and saw the video of one of your games. That's pretty impressive if you've done all that by yourself. It was from 2012, did you complete it?

This topic is closed to new replies.

Advertisement