Jump to content

  • Log In with Google      Sign In   
  • Create Account

Hodgman

Member Since 14 Feb 2007
Offline Last Active Private

#5295589 responsiveness of main game loop designs

Posted by Hodgman on 08 June 2016 - 03:09 AM

>> On Windows, the Keyboard, Mouse and USB game devices use an event queue[/size]

do you know off hand if that's hardware driven or polled? hardware driven, id imagine. it can't miss a keypress can it?[/size]

AFAIK it doesn't miss events, unless you let the queue run out of RAM :)

and the message pump is a hardware driven event queue, but with no time stamps... [/size]

all you can say is "these things occurred at some point between now and the last time i checked the queue."[/size]

Some people work around this by running a dedicated poll+timestamp thread thread at 1000Hz, and then having their update thread read from this timestamped queue.

AFAIK, quake has usually used a 500Hz input polling thread.

well, at least my original question was answered. nobody came up with an example of a different order at all, much less one that didn't introduce lag. i suppose that makes sense. if you have to do each step A thru E in order, then ABCDE is the shortest path - whether its all on one thread or spread across two (one, then the other).[/size]

I gave you two examples of where other orders are used to decrease input latency.

1) When the game is GPU bottlenecked, or display-vsync bottlenecked, and when you're left with enough CPU time to increase the update+poll rate above the render rate, then increasing thr update+poll rate can make it feel more responsive. Worst-case input latency remains the same, but average input latency can dramatically improve.

2) when the sim is fixed at a low frequency - e.g. a 10Hz large scale RTS - increasing the poll and render rate not only provides smooth animation, but makes the GUI much morse responsive, allowing orders to be given more easily and feedback of commands given immediately (well before the next sim tick).

Or 3 ,above, when the relative timing of inputs matters to gameplay, such as in a rhythm game, polling at a much higher rate than the sim reduces quantization error and allows timestamping, making inputs feel more responsive and accurate.


#5295569 use ID or Pointers

Posted by Hodgman on 07 June 2016 - 09:46 PM

What does PrimitiveBase do for you in this case?

Allows metaprograms to determine if a type T is one of these strong-typedefs with:
std::is_base_of<PrimitiveBase, T>::value

Which is useful in serialization / script bindings / etc




#5295551 use ID or Pointers

Posted by Hodgman on 07 June 2016 - 05:44 PM

Add "no type-safety" to the "cons" list for integer handles/IDs, please.

You can add integer ID's (and void* ID's!) to the C++ type system like this, in order to maintain type safety:
struct PrimitiveBase {};
template<class T, class Name> struct PrimitiveType : PrimitiveBase
{
	typedef T Type;
	PrimitiveType() : value() {}
	explicit PrimitiveType(T v) : value(v) {}
	operator const T&() const { return value; }
	operator       T&()       { return value; }
private:
	T value;
};

#define MAKE_PRIMITIVE_U64(name)					struct tag_##name; typedef PrimitiveType<u64,tag_##name> name;
#define MAKE_PRIMITIVE_U32(name)					struct tag_##name; typedef PrimitiveType<u32,tag_##name> name;
#define MAKE_PRIMITIVE_U16(name)					struct tag_##name; typedef PrimitiveType<u16,tag_##name> name;
#define MAKE_PRIMITIVE_S16(name)					struct tag_##name; typedef PrimitiveType<s16,tag_##name> name;
#define MAKE_PRIMITIVE_U8(name)						struct tag_##name; typedef PrimitiveType<u8,tag_##name> name;
#define MAKE_PRIMITIVE_S8(name)						struct tag_##name; typedef PrimitiveType<s8,tag_##name> name;
#define MAKE_PRIMITIVE_PTR(name)					struct tag_##name; typedef tag_##name* name;
MAKE_PRIMITIVE_U32( MyObjectId );
class Foo
{
  MyObjectId Create();
  void Release( MyObjectId );
};



#5295491 use ID or Pointers

Posted by Hodgman on 07 June 2016 - 07:40 AM

Pointers - most cases.

Smart pointers - resource ownership.

Integer handles (e.g. ID's) - con: no direct access. pro: actual storage mechanism is now abstract, so much more flexible. Supports relocation, defrag, serialization, networking, 'weak handles' (detect use after free) etc...

Offsets: con: requires strict memory management. pro: can be templated to work like pointers, can be smaller than pointers, can be serialized.

Perfect hashing: great for asset names - gets rid of strings from the engine. Basically a subset of integer handle methods.




#5295488 responsiveness of main game loop designs

Posted by Hodgman on 07 June 2016 - 07:31 AM

It depends on the type of API being used -- message queues or snapshots.

On Windows, the Keyboard, Mouse and USB game devices use an event queue. So in the "short press" situation, you just get a "press" and "release" event on the one frame. At low framerates, you may even get a whole load of individual "press" events per frame! So, this is ideal :) The best of the best design would be this, along with timestamps attached to the events, so a low-polling-rate game could still inject them with the original spacings into a simulation.

 

However, for some reason, XInput -- the standard gamepad API for Windows -- is a non-buffered snapshot API... Each time you poll it, it tells you the gamepad state at that point in time only, so, if you don't poll fast enough, yep you could miss events :o

 

There's also some snapshot APIs that avoid dropping short presses, e.g. by returning two bits per key that's polled. One bit specifies whether the key is down at the moment, and the other bit specifies whether it has been down at all during the period of time since the previous poll call.

 

I'm not sure about GetAsyncKeyState - it might have XInput's flaw, or it might be 'sticky', where keys only "unpress" after being polled...?




#5295439 Is redundant state checking still a thing?

Posted by Hodgman on 07 June 2016 - 01:30 AM

The CBuffer in my example above is also 1 cbuffer for all 100k drawcalls, it's just updated with map/unmap before every drawcall.

So you're actually dynamically making 100k
If map/unmap reallocation is making than yes.
 
as far is i rememember "measuring" cbuffer binding alone is much more expensive compared to map/unmap.

 

That's what I meant before:

Side note - updating a constant buffer causes resource renaming within the driver -- your resource handle (D3D COM pointer) now points to a different memory allocation than before, which probably forces D3D to set a whole bunch of internal dirty flags that get checked on the next update.
So, actually updating the constant buffer is probably hiding the cost of a PSSetConstantBuffers call (as it's probably also just setting the same dirty flags, to be checked on next draw).

You can't edit a resource that's in use by the GPU. The GPU is one frame behind the CPU. Therefore in order to make it look like you're editing a resource, the driver is actually performing reallocation. If you update the resource 100k times per frame, you're peforming 100k reallocations, and asking a garbage collector to delete them in a few frame's time when the GPU has finished using them.

 

Binding the same resource repeatedly might be cheap, but each one of your draw calls is actually binding different resources. So both of your loops have a high memory allocation cost and resource binding cost per draw call.




#5295416 Is redundant state checking still a thing?

Posted by Hodgman on 06 June 2016 - 09:32 PM

To clarify myself : 

I currently draw a low-poly model (can't tell you the primitive count currently). 
I draw that model 100 000 times with different position (this is why I call Map/Unmap on a ConstatBuffer), the resources needed to draw the object do not change (one texture and one vertex buffer), and in that case, there is no difference if i bind those once (for frame 1) vs vs binding them every frame,

So you're actually dynamically making 100k cbuffers per frame and handing them all to the garbage collector. In both your loops, this will be the bulk of the cost.

 

Seeing every draw is using a different cbuffer, the driver does have to emit new resource bindings per draw.
Try pre-creating 100k static cbuffers and pre-filling them with data so you don't need to do this work per frame, and see how that affects performance.

Or just for testing, use a single static cbuffer so that the driver doesn't have to rebind resources per draw, and see how that performs.




#5295414 Do you usually prefix your classes with the letter 'C' or something e...

Posted by Hodgman on 06 June 2016 - 09:26 PM

 

Hey All.
I used to Put C for class, reason was so I could use the actual name of the class as the variable name.
Like CActor Actor; //works for me

as opposed to Actor actor?

 

But then your capitalization is ruined. So Instead of C for class, you can use O for object.

Actor OActor;

:D




#5295413 Get the amount of draw calls in runtime

Posted by Hodgman on 06 June 2016 - 09:24 PM

Increment an integer every time you call Draw :P




#5295152 Is redundant state checking still a thing?

Posted by Hodgman on 05 June 2016 - 05:00 PM

Yeah you should still avoid redundant state setting, if you can do so cheaply yourself.

 

Many D3D/GL functions could be as simple as storing some pointers and setting a dirty flag -- with the real cost occurring in the next draw call.

Other functions can have quite a bit of validation overhead. I remember recently I was measuring an OMSetRenderTargets call as high as 300μs :(

 

Side note - updating a constant buffer causes resource renaming within the driver -- your resource handle (D3D COM pointer) now points to a different memory allocation than before, which probably forces D3D to set a whole bunch of internal dirty flags that get checked on the next update.

So, actually updating the constant buffer is probably hiding the cost of a PSSetConstantBuffers call (as it's probably also just setting the same dirty flags, to be checked on next draw).

 

You should use the correct usage hints where it's feasible for you to do so. Immutable allows the driver to greatly simplify memory management for a resource - which could mean CPU time, GPU time, CPU space and/or GPU space savings.




#5294936 responsiveness of main game loop designs

Posted by Hodgman on 04 June 2016 - 06:33 AM

actually the time to be measured is from the end of present, followed by a poll, an update, a render, and a present.
 
that is only the same as poll to photon time if you do nothing between present and polling
 
so its really the time from one present to the next that includes the results of the polling after the first present. and that is the same as input -> photon only if you poll immediately after present

Nah that's too conservative to cover input->photon time

Given two frames (which poll, update, draw, present), and three keyboard inputs, A, B, C and D:

| Frame 1          | Frame 2          |
| A       B        |CD                |
| Pol,Up,Drw,Prsnt | Pol,Up,Drw,Prsnt |

The user pressed C immediately before a Poll, so it spends close enough to zero time in the hardware and OS processing queues before being picked up by the poll. So C has no extra delay.
B spends about half a frame waiting in an OS queue.
A spends about a whole frame waiting in an OS queue.
D occurred just momentarily after C, but also just missed the Poll, so it will have to wait around for a whole frame (like A did).

So input->photon latency must also include a variable factor of between zero and one frames, or from 0 to 1000/PollingHz milliseconds.

If you just count present->present, you're also not including any GPU processing time whatsoever. The GPU and CPU do not run in close synchronization - and usually have at least one frame of latency between them. Graphics drivers deliberately introduce one frame of latency to ensure that no pipeline stalls can occur and throughput is maintained.
LCD's also buffer inputs for at least one frame.
So assuming a decent graphics driver (and rendering code), and a decent LCD, your timeline looks like:

| CPU Frame 0       | CPU Frame 1       | CPU Frame 2       |
| Pol,Up,Drw,Prsnt0 | Pol,Up,Drw,Prsnt1 | Pol,Up,Drw,Prsnt2 |
+-------------------+-------------------+-------------------+
|                   | GPU Frame 0       | GPU Frame 1       |
|                   | Render,    Prsnt0 | Render,    Prsnt1 |
+-------------------+-------------------+-------------------+
|                   |                   | LCD Frame 0       |
|                   |                   | Buffer,    Prsnt0 |

^^ Just to be clear, this is what the timeline of your game basically looks like right now ^^ three different processors, handling the frame in a serial pipeline
 
So just measuring the CPU's present->present timeframe will give you a value that's potentially 3x smaller than the real value.
When you add the effect of input polling causing events to linger in a buffer, your actual input->photon latency is between 3x and 4x the numbers you're calculating.

 

The exception to this "at least three frames" rule is when the CPU/GPU/LCD update rates are all very different.

e.g. if your CPU framerate is 15Hz, GPU framerate is 30Hz, and LCD framerate is 60Hz, then you get:

Max time an event can linger in a queue before being picked up by a Poll: 15Hz / up to 66.7ms

CPU present->present time: 15Hz / 66.7ms

GPU present->present time: 30Hz / 33.3ms

LCD buffering time: 60Hz / 16.7ms

Total: from 116.7 to 183.33333333, or from 1.75x to 2.75x (instead of the 3x to 4x for the general rule of thumb).

 

You shouldn't just calculate these values and trust the theory though; get a 240Hz camera and film your keyboard+screen while you strike a key, and count the 240Hz frames that tick by between your finger first touching the keyboard and the LCD showing a response.
On a regular 60Hz game, it should be at least 11 frames in the 240Hz footage (somewhere around 50ms). 
On a 15Hz game, it should be at least 35 frames in the 240Hz footage (somewhere around 150ms).
 
If you fix the caveman download links, I can do some empirical tests with a 240Hz camera for you.

Humans do not act "at 5Hz".  
I can push a button for less than 10 ms and routinely demonstrate how HMI's cannot handle button presses that quick (and we show on an oscilloscope that the button was indeed pressed for 7~12 ms).

Again, 5Hz only got brought up as the rate of high level cognition / conscious experience, and is also approximately the human conscious reaction rate. If you had no idea how far away the button was, forcing you to actually think about whether you've touched it yet and should now release (instead of using muscle memory), you'd end up with much longer press/hold times due to that large thinking/reaction delay.
This is off topic, but striking a button can be a controlled by a conscious decision making process at 1Hz for all it matters, and still achieve a 10ms contact time, as long as you don't have to think too hard about the process itself once it's begun  :P




#5294733 How to retrieve a sampler state in SM5/D3D12

Posted by Hodgman on 02 June 2016 - 06:47 PM

D3D9/SM3 had samplers/textures as a single object.
D3D10/SM4 split them into two objects now, which must be configured in C++.


#5294732 Depth Buffer in D3D9 for Volumetric Fog

Posted by Hodgman on 02 June 2016 - 06:43 PM

On the page I linked to, he lists a few different FOURCC codes.
INTZ will work on every Dx10 compatible card (Intel/amd/nvidia).

The other ones will work on older (pre-dx10) cards, but are vendor-specific.


#5294634 Depth Buffer in D3D9 for Volumetric Fog

Posted by Hodgman on 02 June 2016 - 07:35 AM

D3D9 originally didn't support reading from the depth buffer as a texture, but there's an extension that works on D3D10-capable GPU's. You have to create an INTZ texture, which you can use as a depth buffer, and then read from as a texture.

 

Alternatively, what he seems to be suggesting is to make a regular color texture with a float32 format... That's a weird article though. It's a mixture of advice from 2005 to 2015.




#5294610 How to retrieve a sampler state in SM5/D3D12

Posted by Hodgman on 02 June 2016 - 03:43 AM

That syntax within the {} section, initializing the sampler state members, actually isnt HLSL code. It's Microsoft FX code.

If you're using the old FX compiler, it gets used.
If you use the plain HLSL compiler, it gets ignored, even in DX9!

IMHO, you shouldn't use the FX system, even on DX9, which means you need to initialize sampler states yourself via the API.




PARTNERS