Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 14 Feb 2007
Offline Last Active Today, 07:49 AM

#5150823 Multithreading vs variable time per frame

Posted by Hodgman on 01 May 2014 - 05:20 PM

Just use mutexes to control the handing over of one "frame" of data from one thread to another. Volatile is only needed if you're trying to reinvent mutexes yourself -- in most projects, use of the volatile keyword is simply banned: if you try to use it, the team will just assume that your code is buggy, because it most likely is (unless you're the kind of guy who understands all the OS, compiler and hardware specific details required to properly implement your own synchronization primitives, and has written multiple unit-tests that have been running for 6 months across different hardware configurations just to be sure laugh.png).


You might want to have storage for 3 frames's worth. Say the renderer is reading one of those buffers, and the updater is writing one. If the updater finishes it's work, it can't start a new frame because there's no free buffers (it must wait for the renderer to finish reading one). A 3rd buffer lets the updater start a new frame immediately. YMMV smile.png

Make an array of structures that describe which tread owns each buffer, and on which frame it was generated. Make a mutex that protects these structures.
When the render thread starts a frame, lock the mutex and loop through the structure to find the most recent frame that isn't owned by any thread at the moment. Mark it as owned by the renderer, unlock the mutex and render from it. When you're done, lock the mutex, mark it as non-owned and unlock the mutex.
When the update thread starts a frame, lock the mutex, find the oldest non-owned buffer, mark it as owned by the updater, unlock, write a new frame to it, lock, mark as non-owned, unlock.

#5150526 why unreal engine wait the gpu to finish rendering after every Present?

Posted by Hodgman on 30 April 2014 - 06:24 AM

It seems to be waiting for an evens that occurred at the beginning of the previous frame.

Whenever you send commands to the GPU, they are actually just being written into a command buffer. The GPU may be executing D3D/GL calls 1,2,3+ frames after the CPU makes those calls.

I would assume that this is there as a hack to ensure that the GPU is only ever 1 frame behind the CPU, and no more than that. It basically seems like a hack to disable tripple buffering or any other kind of excessive buffering hat a driver might automatically be performing. I guess this might help reduce input-to-screen latency on some systems?

#5150485 Multithreading vs variable time per frame

Posted by Hodgman on 29 April 2014 - 10:54 PM

Read these

^^That happy.png 
You can have different update and rendering frequencies without the need for any threads at all. Games are (almost) real-time systems, with a period measured in milliseconds. So as long as the time periods of your periodic systems are also in a similar range (e.g. 5-50ms), then there's no need to complicate things by using extra threads for that.

If you want the update code to run at some fixed frequency, while the rendering just runs as often as it can, then you'd use a main loop something like this:

lastTime = getTime()
  now = getTime()
  deltaTime = now - lastTime
  lastTime = now

  accumulator += deltaTime
  while( accumulator > updateStepSize )
    accumulator -= updateStepSize;

The real use of threads within a game engine is just to unlock the extra computational power of multi-core CPUs. Inside the update/render functions above, the main thread would cooperate with all the other threads in your pool to complete those tasks in parallel.

Data parallel works in a lot of cases very well where thread pools and task dags lose horribly due to Amdahl's law.

All the DAG/Job systems that I've had experience with have been data-parallel.
The two solutions I've seen are, either:

  • data-parallel systems will put many jobs into the queue, representing chunks of their data that can be executed in parallel,
    • Usually there's another tier in here - e.g. A system has 100000 items - it might submit a "job list", which might contain 100 sub-jobs that each update 100 of the system's items. An atomic counter in the job-list is used to allow every thread in the pool to continually pop different sub-jobs. Once all the sub-jobs have been popped, the job-list itself is removed from the pool's queue. The DAG is actually one of job-lists, not jobs.
  • or, every job is assumed to be data parallel, with a numItems constant associate with them, and begin/end input variables in the job entry point. Many threads can execute the same job, specifying different, non-overlapping begin/end ranges within begin=0 to end=numItems. Once the job has been run such that every item in that full range has been executed exactly once, then it's marked as being complete. Non-scalable / non-data-parallel jobs have numItems=1.

Are you saying that a data-parallel system wouldn't have a fixed pool of threads, or maybe you're using connotations of the phrase "thread pool" that I'm missing?

#5150448 Depth-VR , A new Virtual-Reality device needs your suggestion!

Posted by Hodgman on 29 April 2014 - 06:24 PM

Can it be used as a regular head-tracker, as an alternative to TrackIR or FreeTrack?

#5150328 Multithreading vs variable time per frame

Posted by Hodgman on 29 April 2014 - 07:25 AM

If you end up with something like a mutex per object, which is there to allow multiple threads to randomly share it.... Then you're completely on the wrong track ;-)

The standard model in use by engines these days is not to have a fixed number threads that each run their own system. Instead you have a pool of threads (sized depending on the number of cores in the system) which are all pretty much equal. The game is then decomposed into a directed-acyclic-graph of small "jobs". A job is a function that reads some inputs and writes to some outputs (no global shared state allowed!). If the output of one job is the input of another, that's a dependency that affects scheduling (this job cannot start until the input actually exist). From there you can construct a schedule, or dependency graph can be constructed, so each job can tell whether it's allowed to start yet or not. Every thread then runs the same loop, trying to pop jobs from a job-queue and executing them (if allowed).

There's no error-prone mutex locking required, there's no shared state to create potential race-conditions, and it takes full advantage of 1, 2, 4, 8, or more core CPUs.

That's the ideal version. Often engines will still use one of those threads as a "main thread", still in a typical single-threaded style, but anything computationally expensive will be diced up into jobs and pushed into the job queue.
Some platforms have restrictions like you mention, so you may have to restrict all "rendering API jobs" to a particular thread too.
But overall, the "job queue" or "job graph" (or flow-based) approach has become the defacto standard in game engines, rather than mutexes (shared state) or message passing styles of concurrency.

Also, message-passing should always be a default choice over shared-state too ;-)

#5150243 Do modern pc hardware based arcade cabinet games run on an os?

Posted by Hodgman on 28 April 2014 - 10:02 PM

I worked in a company that built similar products (big boxes with buttons and a screen that you play "games" on). We had some that used Windows XP Embedded, and some that used Linux.

Absolutely not worth it to develop your own OS when other options already exist wink.png


You'd actually be surprised how many other devices around the place are using common OS's like Linux -- e.g. your network router is a very likely candidate.

#5150236 loop break

Posted by Hodgman on 28 April 2014 - 08:55 PM

Moreover, it seems like you're arguing my method, rather than my result.

Exactly. I'm not attacking you or arguing the results. I'm honestly interested in what the results are. No need to get super defensive.
...I was just pointing out that by using volatile, you've forced the compiler's hand, so you may as well have just hand-written the assembly; it's not a test of the compiler, so seeing that the method is artificial, the results are also artificially achieved.

You can't use a forced example to demonstrate the optimizer's actual behavior!

Then that nullifies all of your suggested modifications to my exercise, and we're at a stalemate.

Not quite. I was suggesting that your example forces the compiler to only produce very specific assembly -- that your use of volatile has explicitly told the compiler exactly how it's allowed to construct the loops, and that it's not allowed to make any changes/optimizations.
Real production code will not use the volatile keyword like this (in 99.99% of cases) so it's not at all natural code. Write it naturally and perhaps the compiler can convert a forwards loop into a superior, optimal backwards loop?

If you're testing whether the compiler can transform between up/down

I'm not.

Oh, you threw me off with this where you tried to prove that the up/down versions produced different code tongue.png

You do however realize that an optimizing compiler will produce exactly the same code whether you use the obfuscated version or not?

[up.c ... down.c ...] They aren't identical, and the counting down version has one fewer instruction (compiled with gcc -O3 -S). Yes, cleaner code is better. However, it is all too easy to say that they compile to the same instructions.

#5150212 loop break

Posted by Hodgman on 28 April 2014 - 06:28 PM

...Isn't that the point, that counting down doesn't fetch on every iteration? It is cached into a local variable; if you look at the ASM, it mov's the value from an offset from the stack pointer. That sounds local to me. It's doing exactly what it would if there were too many values held in the loop, and the counter spilled to RAM.

Counting up wouldn't have fetched each iteration if you'd cached he volatile into a local. If you want to simulate register pressure, then use a lot of variables in the loop body rather than disabling the optimizer via volatile.

I posted the ASM. You can see exactly what happened.

and it's not necessarily the same thing that the optimizer would've actually done in real code. Your written an example that does exactly what you assume the compiler will do, but can't possibly show us what the compiler would actually do in a real situation.
You can't use a forced example to demonstrate the optimizer's actual behavior! ;-)

In other words, allow the compiler to see that it's an unchanging number and inline the constant limit (I've tried that).

You can use a global, or a parameter from another function, or get the value from another translation unit, etc... That will stop the compiler from hard doing the loop limit in he same way that it does in real code -- rather than forcing its hand with code that gives it zero leeway.

...In other words, just don't use signed values, so that I can feel good about counting up all the time? I fail to see the point of actually modifying the use case to suit the algorithm, rather than the other way around.

unsigneds work for both up and down. If you want to compare those two loops fairly then you've got to make them as equivalent as possible -- if one uses "not zero", the other should use "not limit", etc. Neither use negative values so signed/unsigned is nuetral, but it would be interesting to see if it affects the codegen (many code bases use uint over int by default).
IMO it's also more common to write !=limit and !=0 rather than <limit and >0...

I don't mean to be abrasive, but did you read the assembly output? It clearly moves zero to each of the array elements, exactly as the C code says (array is volatile qualified). It's guaranteed to have not been optimized out, because it says right in the resulting code that it is doing it.

If you're testing whether the compiler can transform between up/down, then the use of volatile on the output array completely voids your test -- volatile writes can't be reordered, so you've told the compiler explicitly not to modify your iteration order!
So to test this behavior of the optimizer, you need to remove volatile, and then add an alternative method of stopping the code from being optimized away entirely.

#5150195 loop break

Posted by Hodgman on 28 April 2014 - 04:55 PM

Ectara, you don't cache the limit variable into a local in the up case, so it has to be fetched every iteration, which doesn't happen in your down case.
The use of volatile is also going to do god knows what to the optimizer - you can't compare code that touches that keyword to any other code.

Remove volatile, add that local, maybe also use unsigneds so the compiler doesn't have to reason about negative numbers, make the up condition a != rather than a <, and they'll be more equivalent. Ten add a loop that passes the output array's values into a printf or something to ensure they're not optimized out.

#5150098 Texture2DArray

Posted by Hodgman on 28 April 2014 - 06:24 AM

Can you just out all 250 items into the array at the start, and not bother about updating it?

#5150097 Using same graphics framework across different projects?

Posted by Hodgman on 28 April 2014 - 06:20 AM

Just use branches/tags/whatever-your-VCS-calls them for each game/project.
When you update the engine for a new project, the old projects will still be on their old engine branch so will keep working. If you want to go back and update your old projects, you can merge/rebase/etc the latest branch over their old branch (and then manually fix the old code I there were any breaking changes introduced by the update).

#5150057 Handling shaders without D3DX

Posted by Hodgman on 28 April 2014 - 02:05 AM

If you're comfortable with D3D9 and D3DX then stick with them. D3D9 isn't being updated any more, and neither is D3DX cool.png


To answer your question, you can either compile your shaders using D3DXCompileShader, or using fxc.exe ahead of time.

If you use D3DX to compile your shaders, you get a ID3DXConstantTable which tells you which indices match up with which names.

There's also options in fxc.exe to get it to tell you this information.


If you're not using D3DX's effects, then you should use this information to build your own name->index map structures for each of your shaders. 


Alternatively, you can specify the indices to use when writing your actual shader code - e.g.

float myData : register(c0);//constant register #0

sampler myTexture : register(s0);//a texture in sampler register #0


And yes, in D3D9, "samplers" are texture slots.

#5150055 Instance data in Vertex Buffer vs CBuffer+SV_InstanceID

Posted by Hodgman on 28 April 2014 - 01:56 AM

Also, if you want to use per-instance data within a pixel shader, the InstanceID+buffer method probably makes far more sense than reading it from a vertex-stream in the VS, then passing it down to the VS via an interpolator.


On modern hardware, there isn't really any dedicated vertex-fetch / input-assembler hardware left. Your input layout structure gets compiled into shader code (to fetch the vertex data from the vertex buffers) and this code is appended onto the front of your VS program by the driver!

#5150032 DX11 render to texture issue...

Posted by Hodgman on 27 April 2014 - 11:35 PM

All I am asking is "is there a single blend state method that will work for both RTT and render to back buffer?".....
Yes, there is no difference between the back buffer and RTT (every render-target works the same way), so your problem is elsewhere.


With back-buffer:

Set back-buffer as render-target

Set blend mode

Draw scene


With RTT:

Set texture as render-target

set blend mode

Draw scene

SET NEW BLEND STATE (ensure blending is disabled)

Set back-buffer as render-target

Draw quad using texture


With your RTT method, are you doing the bolded step?

#5150025 Preventing code duplication

Posted by Hodgman on 27 April 2014 - 10:43 PM

You could just as well express the same functionality as free functions working on typed data structures, and if you do that, the linker is better at stripping out the parts you don't use.

By the time it gets to the linker, it's exactly the same either way (assuming the equivalent code is written in both styles)...

Yeah there's some counter-examples, such as if you use virtual functions, they'll not be discarded... but this is a straw-man, because the same would occur with free functions if you wrote the equivalent code that manually created a table of pointers to functions (i.e. the function's who's addresses you capture will also no longer be discarded).

I thought of ECS but isn't that a bit overkill or is that the way to go in this scenario? Does anyone know another preferbly OO way of dealing with this problem?

The core of ECS is "composition" -- the core of OO is also "composition" (if you look up "inheritance vs composition" you'll find that OO teaches that you should default to using composition, and only use inheritance where necessary, which is not often). Under both those paradigms, you break your code up into small pieces which each only solve a single problem at a time, and then build complex parts by composing simple parts together.
Many ECS articles compare themselves against the "incorrect" OO styles, usually "deep inheritance" entity hierarchies, where inheritance has been completely over-used, and then offer ECS as a solution to that problem. I've never seen an ECS article that compares ECS against proper composition-based OO though laugh.png

Do you have any concrete ideas about separating the update and drawing code here?

This is pretty vague, but move all the drawing code out into classes that only do drawing stuff. Don't have game logic and graphics structures intertwined.
You can actually have two completely different worlds/scenes/collections of objects -- one list of game objects, and another list of graphics objects. The update part of your game loop can do stuff to the first list, and the draw part of your game loop can do stuff to the second list. The server can just not create or use the second list. On the client, the game objects obviously need to do stuff to the graphics objects (such as move them around, etc), so the items in the first list can contain pointers into their 'partners' in the second list, but on the server these pointers can just be NULL.