Jump to content

  • Log In with Google      Sign In   
  • Create Account

We need your feedback on a survey! Each completed response supports our community and gives you a chance to win a $25 Amazon gift card!


Member Since 15 Dec 2001
Offline Last Active Today, 06:58 PM

#4952957 Disable pixel shader

Posted by phantom on 26 June 2012 - 03:22 AM

As with your last thread on the subject the results are going to get are not going to be accurate; as I pointed out before ( http://www.gamedev.net/topic/626047-processing-time-at-vertex-shader/page__view__findpost__p__4947729 ) the only way you are going to get decent results is to the vendor supplied libraries or tools to time things.

More importantly; WHAT are you trying to do with this information?
Currently you are learning nothing useful what so ever...

#4951762 Which is faster? std::priority_queue or std::multimap

Posted by phantom on 22 June 2012 - 09:21 AM

It's not just the insert which matters, it is the whole scope of how you plan to use this data structure as different structures have different charactristics depending on what you are going to do.

If you are looking up lists of data based on a common key then multimap might give you the best performance as it is designed for fast lookups.

On the other hand if you just want a list of data you can iterate over from first to last then something which allocates contiguous memory as a backing store will be faster (think std::vector)

It all comes down to usage pattern, without knowing that you can't make an informed choice as to what container you should use.

#4951178 Which would be better? for a renderlist vector or list

Posted by phantom on 20 June 2012 - 06:48 PM

The problem with a list isn't the pushing of the object - it is the cost of pulling data out of that list as you walk along it.
(well, ok, the holding would also be a problem as I'm pretty sure the container would have to 'new' a piece of memory the size of a pointer to hold the data + list book keeping information for each 'push' which could cost as well).

Memory allocated to hold elements in a list can be 'anywhere' in memory, which means right away access it is going to cost you a cache miss. You then have to use that pointer to grab another chunk of data, again 'anywhere' in memory and going to cost you. Memory access is basically very slow when compared to ALU speed - CPUs try to compensate by reading ahead but when bouncing all over memory as you'll do in this system you don't help them one bit.

A better system would be a vector of pointers : the pointers themselves would be contiguous in memory although you'd still take a hit to go and find the memory to work with.

The 'best' system would involve using small structures to store JUST the rendering information required - that way the rendering part isn't reading redundant data, the CPU can happily read ahead and pre-fetch things into cache for you and things should generally be faster.

However as stop gap I'd go with a vector of pointers - not ideal but it'll cost you less in memory problems in the short term and remove the constant 'new' problem as you'd only have to 'clear' the vector and not release all the nodes and reclaim memory as with the list method.

#4951151 Looking into how to parallelize my game engine

Posted by phantom on 20 June 2012 - 05:04 PM

If you are just throwing commands at the D3D9 device from multiple threads then yes, single threaded is going to end up faster as the driver/runtime internally has to do a lot of state maintaince and locking - you effectively go single threaded but with more overhead.

The correct way to do this is to use multiple threads/tasks to assemble a sorted, ordered command queue using your own structures - this command queue is then processed by a main thread/task while everything else gets on with setting up for the next frame. In this case 'sorted and ordered' means doing state sorting and any other work to ensure your rendering does as few state changes as it can.

The engine used for Civ5 (LORE) uses this method on its DX9 path and they saw a speed up due to being able to take advantage of data being in cache as they were quickly re-hitting DX code rather than doing 'a bit of their code' then 'a bit of DX code' which is going to involve jumping around memory and losing data from the cache. You might not see the same speed up in .Net as they did in C++ but you should still see some improvement if you do it right.

In short;
- multiple threads/tasks to setup a command queue
- one thread to process this queue

#4950987 Win32 multiple OpenGL windows, one window fails

Posted by phantom on 20 June 2012 - 09:11 AM

Only one OpenGL window can be 'current' on a thread at a time.
In order for a new window to become 'current' the old window must first release its claim to being current (iirc you call wglMakeCurrent with null parameters); only then can the second window make itself current.

So, given two windows A & B the correct sequence is;

A::do opengl operations
A::makeCurrent(null, null)
B::do OpenGL operations
B::makeCurrent(null, null)

Make sure this is happening as its the most likely source of the problem.

#4948913 Processing time at vertex shader

Posted by phantom on 13 June 2012 - 02:01 PM

Yes, the tools and librarys to track his information do indeed break down by shader type and give you information about the time spent processing those shader types.

Some counters from AMD's GPUPerf lib;

Timing : ShaderBusy, ShaderBusyCS, ShaderBusyDS, ShaderBusyGS, ShaderBusyHS, ShaderBusyPS, ShaderBusyVS
Vertex Shader; VSALUBusy, VSALUEfficiency

(They have many more, covering all shader types, memory information and more. NV and Intel will cover the same.).

And of course you can still be vertex or pixel bound - if most of your shader time in a frame is being spent on vertex shader operations then you are vertex bound. Same thing if you replace 'vertex' with 'pixel', 'geometry', 'hull', 'domain' or 'compute'.

Just because there are no longer dedicated "pipes" doesn't mean you can't be bound by a perticular shader type; being 'bound' by something just means it is taking the most amount of time and optimsing other parts of the process aren't going to make a difference.

Just because the GPU can balance resources as it needs to doesn't magically mean that all shaders will execute in the same amount of time, it just means the GPU can stay as busy as possible while it has work to do.

#4948635 Epic Optimization for Particles

Posted by phantom on 12 June 2012 - 03:26 PM

The trick with fast particle simulation on the CPU is basically use SSE and thinking about your data flow and layout.

struct Particle
float x, y, z;

That is basically the worst to even consider doing it.

struct Emitter
float * x;
float * y;
float * z;

On the other hand, where each pointer points to large array of components is the most SSE friendly way of doing things.
(You'd need other data too, they should also be in seperate arrays).

The reason you want them in seperate arrays is due to how SSE works; it multiples/adds/etc across registers rather than between components.

So, with that arrangement you could read in four 'x' and four 'x direction' chunks of data and add them together with ease; if you had them packed as per the above structure then in order to use SSE you'd have to read in more data, massage it into the right SSE friendly format, do the maths and then sort it out again to write back to memory.

You also have to be aware of cache and how it interacts with memory and CPU prefetching.

Using this method of data layout, some SSE and TTB I have an old 2D (not fully optimised) particle simulator which, on my i7 920 using 8 threads, can simulate 100 emitters with 10,000 particles each in ~3.3ms.

#4947957 Singleton pattern abuse

Posted by phantom on 10 June 2012 - 11:10 AM

You don't solve the 'forget' problem BUT you do solve hidden dependencies which is the problem with globals.

void func1()
void func2()
void func15()
// vs
void func1(SomeState &)
void func2(SomeState &)
void func15(SomeState &)
// or
void func1(const SomeState &)
void func2(const SomeState &)
void func15(SomeState &)

In the first 3 examples the dependencies are hidden away, there is no way by looking at the code you can say what dependencies those functions have.
The second group make it clear they require an instance of 'SomeState' to do work.
The third group make it clear that 2 of the functions won't modify the state, the third will/can.

The other problem with globals is that they are just that; global.
They are everywhere, there get everywhere and they are had to track depenency wise and, most of the time, they really don't need to be global.

Often people say 'oh, but I need to do X' everwhere but when you get down to it they rarely need to do it 'everywhere' but in a very small select area of code.
People think 'everywhere' because they haven't put the effort into thinking about the problem beyond 'I need to do X'.

Nor is this about 'having a phd in computer science'; it is about sitting down and thinking about your design and thinking beyond the first solution which appears. Yes, a global often seems the easiest to start with, and if you want to run with it then go ahead, but don't be surprised when you run into problems or when people say you've come up with a bad solution.

Globals are, by and large, the lazy solution.
Singletons are, by and large, the wrong solution.

#4947640 #define vs const

Posted by phantom on 09 June 2012 - 06:58 AM

Some time ago, i read somewhere (i think it was on an old C++ book, not sure) that one should use const instead of #define, the reason being that #define is a preprocessor directive, and if for some reason the compiler ignored it, error would occur.

The compiler wouldn't 'ignore' it because the compiler never saw it and there in lies the problem.

A #define is nothing more than a global text replace operation.
If you #define foo 1 then everywhere the preprocessor sees foo it replaces it with 1 with no concerns about scope nor context. This can lead to all manner of problems a classic example being the 'max' macro in the Windows headers.

If you include windows.h and then go on to try to use std::max() you'll get some lovely compiler errors spat out at you because 'max' is a #define provided by the windows.h header (or a decendant thereof) which replaces the 'max' portion of 'std::max' with some existing code.

This lack of scope is a double edged sword; on the one hand it does allow you to do global text replacement operations, on the other hand it stomps all over scope and has no regard for context.

In short; for constant variable declarations you should perfer the various C++ methods for declaring such things (const and constexpr in C++11), rather than #define, as those respect scope and are seen by the compiler.

#4947635 Singleton pattern abuse

Posted by phantom on 09 June 2012 - 05:37 AM

But a map (talking about 2D) which is rendered tile by tile is rendered differently from, say, a checkbox or any simple object that only needs to be rendered with one procedure. Mustn't the Map class know of the logic by which it'll be rendered? (nested loops etc). Otherwise, that information must be hardcoded somewhere else (renderer?).

And you've run smack into a problem of doing too much; what does the 'map' represent in this?

The information doesn't have to be 'hard coded' anywhere but it should be abstracted in some manner. A 'map' shouldn't have to care HOW it is rendered, or even what is rendered, all it needs to care about is the fact it has something which needs to be rendered.

If you step back and think about it for a moment ALL rendering is the same; you provide the GPU with a vertex and index stream, some constants, some render states (including shaders) and some textures. That is is.

So at the lowest level you have a 'renderable work packet' object which bundles this state together and that is all your render cares about; give it these packets of information and it can render you the world.

It doesn't matter if you have a 2D map, a rigged 3D character or a box the basic rendering remains the same.

So then pull back out one more level; you still need something to contain and send that work down to the renderer. This is the point when you start to specialise a bit; for your 2D map example the renderer would have been told 'here is an object which needs to be rendered and will provide you with packets of work'. In this instance the 'renderable' knows how to render a 2D map and can provide the render with work packets to do just that.

Finally you can back up one more level to the 'map' object itself, or the logical game represenation of it; this doesn't know how to render the map at all, all it knows how to do is respond to certain events and other high level map related details. It will more than likely have a reference to the 2DMapRenderable object which will do the rendering work but it has no direct means of communicating with the renderer. It could talk to the map renderable depending on the game; for example if you have a game which has a map consisting of 2D 'screens' which instead of scrolling 'flip' you might well have a function which tells the renderable which screen to send to the renderer when it is next asked.

The renderer backend itself deals with talking to the renderable, including asking it for (culled) work, and doesn't care what the game above it is doing.

Short version; all rendering is the same at the lowest end; high level logic doesn't care about how things are rendered, abstract it away.

#4947437 Visual Studio 2012 Express won't support Win32 Projects

Posted by phantom on 08 June 2012 - 12:16 PM

I don't think "given in" is the right way to put it - its not like they spent the last 6 months going 'no no no, not going to happen, live with it' - they released it, people gave feedback, they reacted to change.

In fact I find that reaction remarkable - everyone complain that MS don't listen when they do its 'giving in' rather than 'listening to customer feedback'... christ on a bike...

#4947405 Visual Studio 2012 Express won't support Win32 Projects

Posted by phantom on 08 June 2012 - 10:08 AM

I suspect you've done it wrong because I've got the VS2012 RC installed, and only the Professional Edition at that, and I've got access to a full group of C++ project options including Win32 and console applications.

#4933780 Measuring performance - overdraw

Posted by phantom on 22 April 2012 - 09:13 AM

It is performed afterwards because the pre-test isn't very fine grain with its rejection.

Lets say you have a 4x4 grid of pixel, the pre-text might be performed on a 2x2 grid, with each "pixel" in the coarse test covering 4 pixels in a quad for the real screen. This means the z value stored in that 2x2 grid must be as convervative as possible in order to not get false rejection.

Post pixel shader the 'real' z-test would be performed at a more fine grain level to catch the small number of pixels which would pass the pre-test but not the post-test.

So, if your planes are all aligned you would get 'perfect' early rejection, but as soon as things aren't perfectly aligned some pixels will pass the early test which shouldn't make it to the output buffers.

#4931688 Holy grail of z-fighting - I thought I found it, why does it not work?

Posted by phantom on 16 April 2012 - 05:13 AM

On OFP we 'solved' the problem by only rendering decals into the 'near' scene (0.2f -> 333.3f) using the same offset method the OP had to 'raise' the polygon as the distance increased. However for roads once you got beyond the 'near' scene they became part of the 'far' scene which just had them baked directly into the texture on the terrain.

#4927301 Revival of Forward Rending?

Posted by phantom on 01 April 2012 - 04:18 PM

OK, 7970 results as promised;

(Forward, Light Tile, Total Frame Time)

256 lights:
Index Deferred:
No AA: 1.4ms, 0.196ms, 2.5ms
2xAA: 1.5ms, 0.22ms, 2.7 -> 2.9ms
4xAA: 1.7ms, 0.27ms, 3.0 -> 3.4ms

Tiled Deferred:
No AA: 0.416ms, 0.914ms, 2.3ms
2x AA: 0.49ms, 1.86ms, 3.3ms (spike 3.5ms)
4x AA: 0.55ms, 2.53ms, 4.2ms (4.4ms spike)

512 light
Index Deferred:
No AA: 2.2ms, 0.2ms, 3.3ms
2xAA: 2.59ms, 0.22ms, 3.8ms
4xAA: 2.85ms, 0.28ms, 4.2ms

Tiled Deferred:
No AA: 0.416ms, 1.4ms, 2.9ms
2x AA: 0.49ms, 2.65ms, 4.2ms
4x AA: 0.55ms, 3.5ms, 5.2ms

1024 light
Index Deferred:
No AA: 4.8ms, 0.2ms, 5.9ms
2xAA: 5.45ms, 0.25ms, 6.7ms
4xAA: 5.99ms, 0.315ms, 7.4ms

Tiled Deferred:
No AA: 0.416ms, 3.08ms, 4.5ms
2x AA: 0.49ms, 4.88ms, 6.4ms
4x AA: 0.55ms, 6.16ms, 7.8ms (8ms spike)

Side note: GPU-Z reports back;
1690MB VRAM used
420MB 'dynamic' ram used

That's.... a lot :o

So, if we try to make some sense out of this test :D

It would seem that without AA the Tiled Deferred has the edge, but as soon as you throw AA into the mix things swing towards the Index Deferred method (1024, 2xAA being the notable exception to that rule).

TD shows the normal deferred charactistic of stable G-buffer pass times, but the tile lighting phase begins to get very expensive for it.
By constrast ID has a pretty constant lighting phase but the forward render phase shows the same kind of increase as TD's lighting phase.