Jump to content

  • Log In with Google      Sign In   
  • Create Account

mind in a box

Member Since 20 Apr 2010
Offline Last Active Oct 25 2015 05:31 PM

Topics I've Started

Switching PixelShader gains performance, without any pixels on screen.

15 October 2015 - 04:51 AM

Hi everyone!


I've been profiling my application lately, and I got something weird, which I hope one of you can explain to me.


In my Scene, I have a large World-Mesh and about 5000 decoration Objects, which all are using the same PixelShader, which simply does one Texture-Lookup. All in all, I get about 1000 Drawcalls from this and about 105fps on my laptop. There are no post process effects or similar techniques.


The weird thing is, when I switch to a very simple PixelShader (Using Intel GPA), my FPS-Counter ramps up to ~150, even when there are next to no pixels at all drawn on screen! (Small health-bar in the corner, but thats it).

Applying a 1x1 scissor-rect doesn't give the same effect.


How can switching a pixelshader improve performance by so much, when there aren't even any rendered pixels on screen?


Thanks in advance!

About dynamic vertex pulling

07 October 2015 - 04:54 AM

Hi everyone!


I look forward to implement the technique described in this Article, which looks fine by itself, but I got some questions about the implementation details.


First of all, the general idea seems to be that you have one big Vertex- and one big Indexbuffer to work with. You then put every mesh you want to be rendered in there and store the offsets and index-counts in an other datastructure which goes together with the instance-data into another buffer.

Then all you need to do is to issue a call to something like DrawInstanced, with the maximum amount of indices a mesh in the buffer has, and walk the instance-data buffer to get the actual vertexdata from the buffers.

If the mesh uses less indices as we told the Draw-Call, it says one should just use degenerate triangles and keep an eye on the vertexcounts.


Now, the article gives us a scenario about rendering a forest, with different types of trees and LOD-levels.

  • #1: Why even bother with LODs, when we draw everything with the same vertex/index-count anyways?
  • Idea: Use multiple instance-buffers with different ranges of vertex/index-counts and use more DrawCalls instead of wasting time on drawing overhead vertices on simple LOD-levels.

Next problem is about the updating of the instance-buffer. Since of course we want some frustum-culling or moving objects if we are drawing a huge forest, we would need to do that every frame. The Article suggests that one should keep a CPU-copy of the data in the buffer and if something changes, just copy everything over again.

  • #2: Wouldn't that take a huge impact on performance if we have to copy thousands of matrices to the GPU every frame? Also I'm pretty sure you would hit a GPU-sync point when doing this the naive way.
  • Idea: I haven't looked to deep into them yet, but couldn't you update a single portion of the buffer by using a compute-shader or just do the full frustum-culling on the GPU? If not, there are those Map-Modes (other than WRITE_DISCARD) worth a shot where the data stays to update only single objects? Or do I just throw this into an other thread, use doublebuffering to help with sync-points and forget about it?


The last question is regarding textures. I assume that in the article the textures are all of the same size, which makes it easy to put them all into a TextureArray, as the Author is doing at least.

  • #3: But I don't know much about the textures I have to work with, other than that they are all sized by a power of two. I'm using D3D11 at the moment, so TextureArrays is as far as I would get. Next problem is, that my textures can be dynamically streamed in and out.
  • Idea: Make texture-arrays of different sizes and assume how many slots we would need for the given size. For example, pre-allocate a TextureArray with 100 Slots of the size 1024² and if we ever break that boundry or a texture gets cached out, allocate more/less and copy the old one over. Slow, but would work. Then use the shaders registers for the different arrays to get access to them.
  • The other thing I could do is to allow this kind of rendering technique only for static level-geometry and to try to keep the textures for them in memory the whole time.

Does anyone maybe have better solutions/ideas to the problems than me or can give me some other useful input about this technique?


Thanks in advance!

Slow BSP-Tree visibility checking

24 July 2015 - 03:06 AM

Hi everyone!


I'm currently reworking my culling strategy and I came to a point where I just don't really know of what to do from here.

My current project is to improve the renderer of an older game (Open-World RPG) and it's working very nice so far, just not as fast as it could.

(Using D3D11, C++, if that matters)


This game basically has a static world-mesh and objects stored in a BSP-Tree, so I already have that. The BSP-Tree itself is created around the world-mesh and breaks the world down into little about 5 meter-AABBs of ingame space, which then contain a list of the objects resisting in that leaf.


Also bigger objects may are registered in multiple leafs as all of them are roughly the same size.


That leads me to following approach:

  • Walk the BSP-Tree and check visibility to cut of branches
  • If gotten to a leaf, iterate over the objects in all lists (Outdoor, Indoor, Small, etc)
  • Check a flag if an object has been drawn before (Big-Object issue) and register it in the renderer
  • After that, copy all instance-IDs to the GPU, where I remap that ID to a static World-Matrix-List in a structured buffer.

If I initialize everything only once, putting every object into the renderlist, then I can draw my whole world at about 200fps, which is nice. If I enable the culling I am down to ~100fps while not even drawing 1/4 of them.


Using profiling, I have found out that most of the time is spent in the function that iterates through the objectlists of a BSP-Leaf.


It basically only does this, like 5 times for the different lists:

if(nodeDistance < vobOutdoorDist)
	for(auto it = node->Vobs.begin(); it != node->Vobs.end(); it++)
		GVobObject* vob = (*it);

			// Just draw

Internally the "DrawVob"-Method only pushes an indexvalue to an std::vector after the first time it has been called. It is still slow when I remove the code inside it completely.


It would be great if someone could tell me how this kind of scene is usually handled efficiently. By the way, it is still slower, even if I completely remove every code in the "DrawVob"-Method, so it's nothing in there causing the slowdown.

ConstantBuffer is leaking memory after release

12 March 2015 - 12:02 PM

Hello everyone!


I am currently fighting with a memory leak in my project, and I stripped it down to creating and releasing a ConstantBuffer.


Since these particular buffers get created during load-time, I added some test-code to check if this really was the problem.


The game starts the following code with approx. 400mb of used RAM, and eventually crashes at about 1.7gb:

for(int i=0;i<INT_MAX;i++)
	VS_ExConstantBuffer_PerInstance dt; // Just some 64-byte sized struct

	// Init cb
	d.pSysMem = &dt;
	d.SysMemPitch = 0;
	d.SysMemSlicePitch = 0;

	// Create constantbuffer
	ID3D11Buffer* buffer;
	device->CreateBuffer(&CD3D11_BUFFER_DESC(sizeof(VS_ExConstantBuffer_PerInstance), D3D11_BIND_CONSTANT_BUFFER), &d, &buffer);
	// Release it again

So, the only thing this does is creating and releasing an ID3D11Buffer over and over again, and that should keep the memory-consumption on an even level, right?


I really don't know any further here, what could even cause such a problem?

Rendering a lot of different geometry

26 January 2015 - 03:49 PM

Hi there!


I'm currently working on replacing an old games D3D7 renderer with a more modern D3D11 one (It's Gothic 2, in case you know that game!). The game originally does very much CPU-Culling, which lets it run at mostly 40fps on a modern system in the more demanding main areas, because graphic-cards weren't so strong back then.


Well, they caught up so much that it is faster now to just not cull anything at all.


However, now comes the problem:

I have no clue how I should render those objects in a fast way. I have about 1000 objects with ~3 submeshes in general on screen, while there are about 17000 objects in the whole world.


To get them, I have to run over a BSP-Tree and then pack them into a list, which I can then sort by texture and vertexbuffers.


So, drawing doesn't really take much time on my GTX970, but I am already down to 80fps (from ~300) when I don't even enable the actual draw-call.

Debugging and profiling got me, that one part of the problem seems to be the lists I am filling with the pointers to the objects.


Lets say I have 1000 Objects on screen. I am on 32-Bit, so a pointer is 4bytes, which makes for a total of about 4kb I stuff into my list every frame.

This list also gets copied in the process to sort for texture, so I have 8kb of stuff laying around in lists in my render-function, which need to get cleaned up in the end. That doesn't sound too good for realtime rendering, right?


I then went by and made the lists static vectors, which won't get reduced in capacity over the frames, so there is no cleanup needed. However, I can't do that with the sorting-part. So still 4kb.


Everything aside, what are the best options in such a case? The objects aren't all different, in fact I have loaded about 500 meshes right now, so I could go with instancing for the static stuff (using D3D11), but how do I handle frustum-culling then without updating all the constantbuffers every frame?


Any help would be appreciated!