Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 15 Dec 2001
Offline Last Active Private

#5074667 So, Direct3D 11.2 is coming :O

Posted by on 02 July 2013 - 01:25 AM

Why the negative vibe on this release???  It seems pretty solid to me.

*puts on professional rendering developer hat*

The release is solid, don't get me wrong, the only issue I have is what features are going to end up going back to Win7? This hasn't been addressed anywhere at all and so from a 'use it in the real world' stand point, much like 11.1, it's a great big unknown.

To be honest most of the things I could live without, but Tiled Resources is a BIG thing because it brings a feature parity with both the up coming consoles.

As much as I like using Win8 at home and will update to Win8.1 pretty much right away the fact of the matter is that Win7 represents a very large market share still and will for some years yet so unless features are also avaliable on Win7 you will get a 'meh' reaction as they aren't overly fessible to work with/use.

#5072975 So, Direct3D 11.2 is coming :O

Posted by on 26 June 2013 - 10:04 AM

My biggest wish would be getting this stuff down stream to Win7 at least - as much as I like using Win8 at home the fact is Win7 is a massive market still and I don't see the Win8.1 update changing that any time soon.

Given they seem to be providing hardware partially resident textures in this update they better otherwise it'll be just a useless bullet point for most developers :|

#5072482 loop counters as class members? (memory usage)

Posted by on 24 June 2013 - 06:48 AM

Also, remember that with member variables you are effectively doing 'this->var' which, unless the compiler is able to prove can't be affected by other code, it is forced to produce sub-optimal code where it is likely to read-modify-write the value at the start of every loop iteration to ensure any other code looking at it gets the up to date version.

Same applies to end conditions, where accessing a member is going to, more than likely, cause reload from the 'end' variable.

This is why I prefer to structure my loops 'for(unsigned int idx = 0, end = m_count; idx < end; ++idx)' to give the compiler the best chance of removing redundant read/writes and get things into registers.

(member variables for loop counters would also invalidate cache lines where not required causing more memory traffic and core synchronization to happen and in world where memory bandwidth and latency is the biggest issue going this is bad voodoo)

#5070239 Compute Shader Fail

Posted by on 16 June 2013 - 02:09 PM

There was some confusion as to why our (very work in progress) tone mapping solution was running so slowly; on a simple scene which was GPU limited we were failing to hit 60fps on even a 470GTX.

After a couple of weeks of this I finally got ahead of work enough to have a look and very quickly found a few minor problems... dry.png

The code computes a histogram of the screen over a couple of stages and stages 1 and 2 were running very slowly and appeared to be horribly bandwidth limited (on a laptop with around 50% of the bandwidth of my card it was running 50% slower). Breaking out nSight gave us some nice timing information.

The first pass, which broke the screen up into 32x32 tiles, thus requiring 60x34 work groups (or 2088960 threads across the device) for a 1920*1080 source image, looked like this;
if(thread == first thread in group)
    for(uint i = 0; i < 128; ++i) // (1)
        localData]i] = 0;


if(thread == first across ALL the work groups) // (2)
    for(uint i = 0; i< 128; ++i)
        uav0[i] = 0;

if(thread within screen bounds)
    result = do some work;
    interlockAdd(localData[result], 1)

if(thread == first in group) // (3)
    for(uint i = 0; i < 128; ++i)
        uav1[i] = localData[i];
Total execution time on a 1920*1080 screen; 3.8ms+
Threads idle at various points;
(1) 1023
(2) 2088959
(3) 1023

Amusingly point (2) was to clear a UAV, used in a latter stage, and was there to 'save a dispatch call of a 1x1x1 kernel' o.O

Even a quick optimisation of;
(1) have each of the first 32 threads write the uints; more work gets done, bank conflicts avoided due to offsets of each thread doing the write
(2) having 32 threads write this data (although this shouldn't need to be done)
(3) having the first 32 threads per group write the data, same reasons as (1)

Resulted in the time going from 3.8ms+ down to around 3.0ms+ or a saving of approximately 0.8ms.

That wasn't the biggest problem... oh no!

You see, after this comes a stage where each of the data which is written per-tile is them accumulated into a UAV as the complete histogram of the scene. This means in total we needed to process 60x34 (2040) tiles of data with 128 unsigned ints in each tile.

buffer<uint> uav;
buffer<uint> srv;
if(thread < totalTileCount)
    for(uint i = 0; i < 128; ++i)
        interlockAdd(uav[i], srv[threadTile + i]);
In this case each thread reads in a single uint value from its tile and then does an interlock add to the accumulation bucket.

OR, to look at it another way;
1) each thread reads a single uint, with each thread putting in a request 128 uints away from all the others around it which is a memory fetch nightmare
2) each thread then tries to atomically add that value to the same index in a destination buffer, serialising all the writes as each one tries to complete.
3) each thread reads and writes 128*32bit or 512 bytes which, across the thread group, works out to be ~1meg in both directions. (512bytes * 2040 tiles in each direction with interlock overhead.)

This routine was timed at ~4ms on a 470GTX and only got worse when memory bandwidth was reduced on the laptop.

This one I did take a proper look at and changed to a reduce style operation so;
1) Each thread group was 32 threads
2) Each thread group handled more than one tile (20 was my first guess value for first thread, reduced on latter passes)
3) source and destination buffers where changed to uint4 types
4) For the first tile each thread group reads in the whole tile at once (32xuint4 reads) and pushes it to local memory
5) Then all the rest of the tiles are added in (non-atomically, we know this is safe!)
6) Then all threads write back to the thread group's destination tile
7) Next dispatch repeats the above until we get a single buffer like before

Per-pass each thread group now reads 512bytes per tile still but now only writes back 512bytes per tile.
1) 1Meg in, coalesced reads
2) 51kb out, coalesced writes

Amusingly this first try, which try took 3 passes, reduced the runtime from 4ms to 0.45ms in total happy.png

Moral of this story;
1) Don't let your threads idle
2) Watch your data read/write accesses!

#5069531 Writing Game Engine!

Posted by on 13 June 2013 - 02:53 PM

But I don't know for sure. I've never finished one engine. Games. But not engines. And although I'm convinced I know how the things I write now should be used,
perhaps it's not actually a good way to go about it.

That's because you never finish an engine - they are always changing and evolving as you learn and find weaknesses while using it.

Which is kind of the problem with trying to develop one on it's own; unless you are constantly using it you'll never know how much works and how much is flawed.

Even at work, where we have a small team developing an engine, it is being driven by customer game requirements and ideas.

#5069248 A new feature i want in my C++: Contexts

Posted by on 12 June 2013 - 04:44 PM

I much prefer the explicit parameter.  If I need a bunch of them I'll build my own struct of pointers and pass that around.

1000x this.

Hidden dependencies are the devil.

They make testing harder, they make code harder to reason and they obscure what is going on; a simple function can end up doing far more than you think because it's pulling dependencies which aren't shown and doing work in them.
(Not to mention the inherent memory access issues which makes my shudder just thinking about it)

Code is for reading more than writing; being clear about what is going in is always the best goal.

#5068209 single responsibility principle in practice

Posted by on 08 June 2013 - 04:41 AM

Yes, IMHO, the above is an abuse of inheritance. It's using "is a" (inheritance) where "has a" (composition) would suffice. Standard OO doctorine is to default to composition and only use inheritance where it's actually required -- i.e. where you actually need to polymorphically operate on some abstract type through an interface -- so if ModelInstance isn't implementing an interface that it's inherited from Entity, then it shouldn't inherit from it.

1000x this.

Inheritance chains longer than 2 or 3 deep is likely a symptom that you've now got a horribly ridge relationship graph between all your classes and, as Hodgman goes on to say, some which are 90% duplicates of the others in order to get around a 'small problem'.

Small, focused, simple classes are not only easier to understand and compose but also have the side benefit of likely being a more friendly size for CPU caches as you won't have a cascade of data items playing a game of Take Up Cache Space during various updates etc.

#5065079 Is sampling textures slow?

Posted by on 26 May 2013 - 04:49 PM

Sampling textures is something that GPUs are very specifically designed to be good at. It's cheaper than computation in a lot of cases.

Actually it is the contrary. Yes, sampling textures is very fast, however, computations are even faster. I can't find good reference here but ALU : Tex ratio is getting higher ie. shaders should do more calculations for each texture fetch. Nowadays even look up textures can be replaced with the actual calculations, at least in the simple cases.
Well, to be correct about this sampling textures may be fast depending on what is going on.

If you are lucky and your texture data sits in cache then yay! fast data return and all is good.
If you are unlucky and we have to make a request out to VRAM then noooo! slow data return as while GDDR has high bandwidth the latency is pretty damned poor - plus you are now queued with other requests.

The good news is that if the GPU has enough work to do then you might never notice these stalls as your thread group will get swapped out and someone else will get a run at the resources covering the latency of your data fetch. If you don't have enough work or other threads can't be swapped in however then the GPU will stall while it waits for data to come back so it can resume a thread group.

Note; thread group.

When thinking about branching you have to consider how it will effect all the threads running at a time in that group. The common group size on DX11 hardware (which I'll focus on as the topic was tagged DX11) is 32 or 64 threads to a wave front which is all the threads working in lock-step together.

When it comes to branching the rules are simple.
     /* code block A */
    /* code block B */
/* code block C */
If the GPU is executing this code then ALL threads will evaluate 'conditionA'.
If all threads evaluate 'conditionA' to 'true' then they will all run 'code block A' followed by 'code block C'.
If all threads evaluate 'conditionB' to 'false' then they will all run 'code block B' followed by 'code block C'.

The fun starts when some threads evaluate to 'true' and some to 'false', at which point the GPU does the following;
- mask out all lanes which evaluated 'false'
- run 'code block A'
- mask out all lanes which evaluated 'true'
- run 'code block B'
- mask all lanes back in
- run 'code block C'

As you can see the GPU has to execute both sides of the 'if' condition which, depending on the code in blocks A and B could cause a major performance hit as it has to do all the work in both branches (texture fetches might not happen, depending on arch and other issues, so you can potentially save some bandwidth/texture cache).

That said, if you know your data and you know how things are going to branch on a group basis then branch can help in certain cases (even on DX9 hardware where I've had wins doing it).

The best example of this was on a game I worked on where our roads where draw as a mesh being overlaid onto the world best and blended in. The road texture consisted of a large section in the middle which was alpha=1, a few transition texels where alpha tended towards 0 and a few pixels where alpha=0 along the sides.

When applied to the terrain large portions of the road mesh where covered with the alpha=0 part and others where alpha=1; by placing an 'if' statement on the initial alpha value (sampled from a texture) large amounts of work could be skipped (the road had other textures applied to it) and the pixels discarded (saving blending).

This worked well because inside a pixel quad the majority of the threads either had everyone with alpha=0 or alpha=1 with only the small border section varying and as the 'else' case was a simple 'discard' statement the cost from running both code paths in that instance was small resulting in a large performance increase on the target hardware.

The key point to get across here is that branching CAN help but you have to be aware of the data you are feeding it and the amount of work to do.

In this specific case I probably wouldn't branch as you don't really have enough work to do anyway; if the texturing statement was a bit more complex AND the colour == (0,0,0,0) has a frequency such that there was a high chance all the threads in a group would take one path or the other then I'd be tempted to use it.

Branching has it's uses, we are beyond the days of it totally destroying your performance if used, but you still have to use it sensibly.

And yes, burning ALU time is often better than burning bandwidth because, as mentioned, the ALU:tex ratio has long been shifting in that direction; modern GPUs are ALU monsters frankly.

#5065072 Why all the gpu/ video memory?

Posted by on 26 May 2013 - 04:21 PM

A 1920*1080 screen requires
- 7.91Meg @ 32bit/pixel
- 15.8Meg @ 64bit/pixel (16bit/channel)

Throw in some AA and you can increase that by 2 or 4 times the amount (32 to 64meg) for just a single colour buffer.
Throw in at least one depth stencil which will take 8,16 or 32meg depending on multi-sample settings.

So, before we've done anything a simple HDR rendering setup can eat 160meg on just the 16bpc HDR buffer, z-buffer and two final colour buffers (double buffering support, more buffers you want the more ram you take).

At which point you can throw in more off screen targets for things like post processing steps, depending on your render setup, and maybe some buffers to let you do tile based tone mapping and suddenly you can be eating a good 400meg before you've even got to shaders, constant buffers, vertex and index buffers and textures (including any and all shadow maps required.)

Various buffers in there are likely to be double buffered, if not by you then by the driver, which will also increase the memory footprint nicely.

Textures however are the killer; higher resolution and more colour depth, even with BC6 and BC7 compression, is going to be expensive and when you throw in things like light mapping and the various buffers used for that which can't be compressed.

Basically stuff takes up a LOT of space.

#5064974 DX11 - Lighting - Does it affect the mesh?

Posted by on 26 May 2013 - 07:23 AM

Mesh has a bounding sphere, light has a bounding sphere; do an intersection test -> if they intersect then the light effects it.

#5062543 Why do most tutorials on making an API-agnostic renderer suggest that it'...

Posted by on 17 May 2013 - 04:28 AM

In my API, I expose textures, vertex buffers, index buffers, constant buffers, input layouts (buffer->shader binding), render states and a high-level shader object (vertex/pixel/etc all set in one go) because then you can write nearly any graphical effect on top of this API, instead of repeating work.

We do much the same, although instead of having vb, ib, cb etc types everything is unified under a 'buffer' type where usage/construction sets up the underlying buffer type. We also don't differentiate between 'textures' and 'render targets'; all are unified under a 'surface' type. The logic being the people creating the buffers/surfaces already know their intent and many of the operations are the same anyway. Data streaming off disk has it's type information embedded in the header section which can be directly de-serialised for the create functions when loading.

#5060662 Alternatives to singletons for data manager?

Posted by on 09 May 2013 - 01:04 PM

An infinitely better example would be an static-inheritance class giving access to something like the graphics device.

No, that would be another example of a massive design flaw...

#5059571 SSAO horrendous performance and graphical issue(s)

Posted by on 05 May 2013 - 04:09 PM

Question; why wasn't the first suggestion in this post to break out nSight or something like it to profile precisely what is going on?

So far this thread pretty much seems like a bunch of people trying various forms of voodoo to try and get things working...

(btw, it looks like you are rendering your post passes with a quad; try swapping it for a triangle, it might help your black line problem and would also improve performance every so slightly as you are no longer splitting pixels along the diagonal.)

#5053928 Render queue texture clearing

Posted by on 16 April 2013 - 12:31 PM

So is this how your entire rendering architecture works, or is it some part put on top of everything? It doesn't seem like a bad idea, but it appears to not being really compatible to my current rendering architecture, which works by submitting a render instance (state changes + sort key + draw call), but I'm not sure...

Ah yes, I should have been a bit more clear about that; we build the same kind of list too and its that which is processed in side the "render each visible model" bit (by 'render' it means 'record these draw calls into a command list to execute on the device later'), this list is just established per unique camera in the world and used with the scenes which require it.

#5052273 lots of small meshes or one big dynamic buffer?

Posted by on 11 April 2013 - 05:12 PM

and i'm trying to stick with fixed function for maximum compatibility.

Compatibility with what?
2002 - ATI releases the R300 GPU. No fixed function hardware.
2004 - NV release the NV40. No fixed function hardware.

Heck, everything you've written about being worried about smacks of problems from nearly 10 years ago...