Jump to content
  • Advertisement

Sqeaky

Member
  • Content Count

    8
  • Joined

  • Last visited

Community Reputation

103 Neutral

About Sqeaky

  • Rank
    Newbie
  1. Honestly, for the thread count things that is more or less the response I was expecting, but I wanted to hear different. If it is super simple I might experiment with it but otherwise I will leave that alone. These are tough questions which are very specific to whatever threading system you end up using. Assume we get a 32 core chip in the near future (not unlikely), I would probably never schedule more than 8 threads to look at/wait on a single work queue due to hardware level problems which would likely come up when pulling data off the queue. Now, each of those threads could be it's own little "team" of threads to further subdivide any large chunks of work, so in answer it would really end up being a question of your design, the way you use it and actual hardware performance.[/quote] There will definitely be novel ways to scale as the hardware improves this way. I would want to measure and see what would happen. Though, I honestly can't see the "work queue" in my teams design being the bottleneck for a while. The atomic variables in each callback will only be written once per callback. Even on a collision, the thread that doesn't get the callback will not do any writing to a shared resources so I don't think it will force any caching issues beyond checking for changes on other caches. By collision I mean when two threads try to start the same callback at almost the same time. I really think the bottle neck will determined by how subdivided the work can be. If it isn't divided up enough then we won't get work for each thread. Considering the caching issue and syncing, at some point gains reduce when making work/callbacks smaller because of the increase in atomic operations. I just think that with good sized work chunks and the reported performance of atomic variables that the natural limit will be higher than 8 cores. After thinking about it for a I realised there is another point to the design that is odd. Most of the "work queue" designs I have seen, are a single point that needs to be updated atomically (I usually see a mutex insuring this). They really do behave like containers that hold work, and items come out of the container to represent that the work is in a thread somewhere. But if you distribute the atomic portion around and have the smallest atomic portions possible on each of the entries that store how complete the work is, then you can have multiple threads searching through or even adjusting the members of the "work queue". The trade off I see, is that without removing items from the queue, that each thread will have to search through the queue. In a bandwidth limited scenario that would seem desirable (assuming the queue was small enough to be cached). Another drawback is that the work queue must be relatively static, if lots of threads are reading from it, then even one change to the queue will wind up serialising a lot of parallel work. I really would like to test, does anyone have a spare 4 socket board loaded with quad cores I could test on for a while? I'm not picky if you can only spare hexa or octocore chips that would work too. ;)
  2. That makes me feel a lot better, that other people have limitations too. I still have tons of questions though. When isn't it best to simply use the number thread the hardware claims it supports? I assume that it would be good all the time, but how often is it optimal? Would it be worthwhile to try running more threads on a single threaded machine? Does it make sense to run fewer threads on a multicore machine with a heavyweight OS/other apps? An expanded version of the algorithm we came up with would attempt to change the amount of threads that the game would spawn based on performance. It would track the length of the pause to get to 1/60 of a second (or the time spent in excess), and it would attempt to maximize the length of the pause by changing the thread count. The first algorithm used a rolling average of the past ten frames and compared that to the average from 10 to 20 frames ago and would use that to increase or decrease the amount of threads spawned. The reason we thought this would help was two part. If the system had a significant load from the the OS or other source, then this would reduce thread count to minimise contention with that. Conversely, if a long time was spent waiting on cache misses, disk loading or other asynchronous stuff, then extra threads would be scheduled by the OS during these times to completely saturate computing resources. Have you ever tried anything like this? We have only anecdotal evidendence supporting that this would help, things like software compiles on single core machines running in 3 threads finishing 2.5x as fast as 1 thread. I know compiling is very different than most game workloads, but might some games in some environments perform better from +1 or -1 from the amount of hardware support threads ----- Just as an aside, I played around more with removing physics as a callback running it in a more dedicated fashion and it looks promising. It will scale a little better and waste less time on existing systems and gives us room to make it scale better in the future. Here is a smattering of what I was playing around with, these all let physics monopolise the hardware first, then run through list of callbacks sorted by dependants then previous execution time: This clearly shows how this won't scale arbitrarily well but it does improve with the 'very expanded' callback set up to 7 cores (well I guess there is a little bit of physics runs on that last core). This could be similar to what might actually run, but I had to lengthen a large number of the callbacks disproportionately so their names would fit. Otherwise It is just a bunch more of what is running around in my head, and another example of more smaller callbacks gets to scale better.
  3. I had not thought about such an involved scenario as your seeker example, but doing so definitely highlights weakness in the design. After your explanation I definitely understand what you mean by false sharing. You also make it clear that this will not scale well to 8+ cores well without the addition of many more callbacks. It also looks like this design will suffer greatly with a high number of weak cores (I currently know of no such platform but it could exist some day). The 'staged' idea definitely sounds like it would work well if we had the work already decomposed into smallish threadsafe chunks that had very similar execution times. However, currently we don't. But I do see a solution to this that I think will work in the real world and work us toward a system like you describe. When the callbacks are small and numerous enough my idea really does look like your idea. I am sure with enough revisions and optimizations our two ideas would converge. Completely overhauling the game code is infeasible at this point. Even though we are using only libraries that give us source code access it would simply be too much work to parallelize each library we depend on. I have to work with the reality that I am on a small team and we are using millions of lines of codes from other sources. Our glue, our mezzanine, for holding these various libraries together and the game code are less than 70 KLOCs (1000[K] Lines of Code). For reference bullet physics is over 500Klocs, Ogre3d is about 5,500 Klocs, and we use about 6 other libraries and I haven't counted them (but I think they are all bigger than 10 Klocs). But we can break our largest callbacks into smaller ones though in a piecemeal fashion. This way we could shrink the execution time of the critical path and gradually as future versions come out it would run better with each revision. Eventually we would have the work split up to the point where one missile was in one callback and the other missile was in another. Bullet also provides some facilities for multithreading. A hybrid design might be feasible. If physics wasn't callback (So nothing could depend on it) and after(or before) all the other callbacks run it was in run in parallel in the optimum number of threads. This would be possible with only minimal changes to the existing codebase and would increase scalability. Does progressively dissecting callbacks into smaller ones seem reasonable? Does the design where physics is pulled out of the callbacks and run on all threads seem reasonable? What flaws exist in these that would keep them from working? Is there something I am not seeing that would make any of these plans worse than just leaving the code single threaded? Is there some way to parallelize existing code that I might not be aware of that would make the 'staged' plan more workable? For reference, currently we have no central queue. I am aware of the issues with repeated used of the atomic variables with respect to cache consistency across multiple cores. Currently, we have a smallish list of dependencies, but we also haven't performed a complete audit of the existing code that would be required to know the exact dependency. We just have an understanding that there is very little shared data between managers currently. I expect that a few more dependencies would pop up if we did a complete audit, but I truly expect the full dependency graph would be simple at present. edit: Grammar fixes and commented on the central queue.
  4. Sorry for the double post but now I have pictures! First let's start will all the callbacks. Their width represents their execution time. The wider, the longer to execute. Here is a smallish amount, about the smallest reasonable amount that work with this algorithm. I guess it could work for lower but it wouldn't really make sense. The sizes are fairly arbitrary but I made sure to make physics and rendering the largest, because they currently are. In normal serialized operation every task would be run in any order that met dependencies. So any shrinking of the execution time is a gain. However, we can determine the critical path by looking at the dependencies and determining what is the longest dependency chain. In this example it is events, post events, physics, Rendering. So if the algorithm results in this ordering exactly in a thread in every frame, then it cannot be improved without changing the structure of the rest of the game/engine (It hate calling it an engine its like 50k lines, its so small). While doing this I found a flaw in the algorithm, it had to do with how I was prioritizing the callbacks by execution time. This isn't bad in many situations, but if there are callbacks that have no dependencies and take longer to run than something in the critical path there will be a needless slowdown. I have an amended algorithm that prioritises better. Here are the callbacks sorted by a running average of execution time (Ties are broken by dependency count more dependencies results in higher priority): The Job/callback/task/whatever we want to call it scheduler is simple function that will start at the top of this list and check going the list for a task to run. First it check if the prereqs/dependencies are completed, then it attempts to run the task after setting an atomic flag. If it fails to set the flag that mean another thread is alrady running the callback and it will go find another callback. The algorithm starts by launching the optimal numbers of thread, for this example lets assume 4. Each of those threads is calling the job scheduler function. once a callback finished it will call the job scheduler again. Each row represents a thread and the left represents the start of the frame and a gap represent unused CPU resources. Here is what the result looks like when using the flaw prioritising algorithm by average execution time: When comparing this unithread execution there is an improvement. However that improvement is not as large as it could be. 8 callbacks were parallelized, but 2 were still run in th e same thread as the critical path. When I was playing around with this I thought, what if each callback also stored a listing of what depended on it. Then we could easily get a count of unique of callbacks that with more dependencies. If we wanted we could even use this information to determine the average execution time of the critical path in an ideal scenario. That seems unneeded, if we prioritise items in our workqueue first by dependants count, then execution time, then dependency count for any tie breakers, we will have a much more robust scheduler. Here is a picture of the items grouped by dependencies: It is easy to see that 'Events' has the most dependants. So it really must go first. This picture also demonstrates that the are no cycles and that the dependency graph is fairly simple. In real world use there could be many more callbacks and a much more complex tree. However, there should never be cycles or circular dependencies. To break those the game coder must decide what will use the previous frames data (or use a different algorithm). For reference here is the count of dependants each callback has: Events: 3 User task 4: 1 User task 2: 1 Post Events: 2 Physics: 1 AI: 1 The Rest: 0 Here is what the work queue looks like when sorted by dependant count, then exec time, then ties broken with dependency count: I am sure there are many scenarios that would results in this algorithm not executing the critical path singularly in its own thread, but I think it will do better than the last version that ignores dependant count. Here is the results of running a frame with the new version: The critical path winds up in its own thread by itself. In this example it couldn't possibly be faster. If the threads persist through frames there are 0 context switches or if they do not persist they there are exactly 6, the minimum amount. Callbacks never begin executing around the same time so it is unlikely there was any issue with the atomic check before starting a callback except when initially launching. There are no mutexes or interthread communication issues of any kind, that I am aware of, each callback that produces data for another is finished before the consuming callback begins. There are no blocking threads (in this example), although it would be possible to create a situation where one thread would not be able to perform work because there are unmet dependencies. My experiments so far indicate that having unmet dependencies in the middle of a frame is unlikely. The chances of unmet dependencies can be reduced if the callbacks are more numerous and smaller. Shrinking the amount of callbacks and increasing their number also reduces the unused time near the end of the frame. For reference, the overhead on each black line between callbacks is only a function call, a few pointer dereferences, a hardware level atomic write, and the function call that starts the next callback. I am sure this is not perfect, but I am sure it has many advantages. It is plain to see this is better than running a service in a thread and waiting on mutexes(or waiting at all) and capping the scalability to the number of services the game needs. This can also be improved with a better Job scheduler I am sure, but the current one is simple. I bet there is more room for improvement in prioritising the tasks/callbacks in the beginning too. Another possible improvement is finding a way to split up physics, but that would involve changes to the bullet3d source code(I think).
  5. @thatguyfromthething I am not aware of any locks in what I described. The member function on std::atomic::compare_exchange_strong is the only interthread mechanism I really use. I am relying on things that produce data to be completed before things consume data, So I guess waiting could occur if there weren't a large number of tasks/callbacks. There is a job manager, it is just a single function that each thread calls to start a callback. This function is what determines what can be run when. I am not sure that I see the benefit of lumping these callbacks into groups. If a given thread doesn't run a callback the first time it goes through the entire list of unrun callbacks until it runs one, determines they have all been run(this thread is done for the frame), or determines all have dependencies that are still running (waiting occurs). What reasons do you have to not use stuff in std? I have heard several, but I would like to hear yours. @AllEightUp Could you please share some examples of false Sharing Problems? I think, but I have not tested in running code yet, that a list of dependencies would prevent that. Since I am only using one Atomic and I only check it when starting a callback, which by default I will have 10 to 30 of them I am not seeing this being a major source of memory-bandwidth draw. using std::atomic I can even turn off all the uneeded memory barriers. Hopefully, this scheme will minimise context switching, which has at least some memory bandwidth cost. Rather than locking my dependencies I try to insure that callbacks that write data are listed as dependencies on callbacks that consume that data. This sounds like what you describe. In the algorithm rather than block all threads, when a thread finishes running a callback, it will go find another callback that depends only on callbacks that have finished. So I do not use and atomic check or a mutex for communication, except to guarantee that I do not accidentally start the same callback twice. How much cpu time is wasted waiting to synchronze in the algorithm you described? I can't imagine it would be much if you split up the work evenly, but the more tasks you have to run the more blocking would occur. I would also imagine that splitting the work into smaller chunks for more cores would gradually cause the total waste time to increase. Because the threads may not finish at exactly the same time, teach thread is dormant until the slowest one finishes. I can't help but think that I explained this poorly, I will make a picture that will explain it more clearly.
  6. My first post describes in detail what we thought we wanted to go with. In short, we break the work into an number of smallish tasks/callbacks each with a listing of callbacks that provide data that they will need. This should form and acyclic graph that represents the order everything needs to be run in. Then we try to prioritise callbacks that take longer to run and start running them in as many threads as have been determined to be optimal. I was hoping this would get shot to pieces after I posted it, so we could make it even better, but I think I explained it poorly. We are building a simple puzzle game that really won't be pushing any hardware too hard. We are separating game logic from library interaction logic, to simplify future game development. This first game will be released on Phones(iPhone, Android) and PC platforms(windows Linux, Mac OS X). We want to make shooters, rpgs, MMOs, and other games, but we know not to bite off more than we can chew, and don't want to do the same coding twice. Currently we have development machines that range from 1 to 8 cores (and more cores will be arriving soon). So scaling is a concern, maybe I am focusing too much on it now, but I really would rather do it once and correctly. As for the atomic compare and swap std::atomic has compare_exchange_strong and compare_exchange_weak that. It seems that if I use one of those to check when updating a callback from NotRun to Running, then I guarantee that a callback is run only once. If the atomic set succeeds, then it is free to start the callback, else if it fails then another thread has pre-empted this one and started the callback already so this thread should go find another callback. If each thread is finding its own callbacks to run and it can be guaranteed that no two threads can run the same callbacks, then wouldn't this algorithm be lock-free and wait-free? Kimmi, If we use std::thread in c++0x is there any good way to implement your thread pool idea? I do not see how to give an existing thread work, I only see how to create a thread with work to do. Maybe std::thread is too limited? edit - grammar fixes
  7. @Ashaman73 I will definitely read that carefully, Thank you. @hodgman I have seen a number of things like "compare_and_swap" I guess I need to look into them a little closer. After a cursory look at Ashaman's link it appears it at least touches on this. I see what you mean about missing the optimal frame-rate. I was under the impression that if a frame was missed an vsync was enabled that frame would be either skipped(rendered next vsync frame). So it results in 60fps, but a couple frames might be rendered twice and other skipped. Either way that is still an unintended failure, anything specific you would suggest to work around the issue? Could we just have the default target framerate be 61 fps? Currently data is stored/managed by the managers, and that is where the dependencies come from. I should have explained this more clearly. We have a pretty strong idea of what accesses/changes what. Each manager has a callback associated with it (more if the are added for historical reasons). In general that callback associated with the manager does the bulk of the writing of its data. Anyone making a call back will be responsible for making a good list of dependencies. Ideally this would prevent two call-backs from accessing shared data simultaneously. I appreciate having this glaring omission pointed out, it makes me wonder if we have all the required guarantees in place @Peter We ditched the one thread per subsystem idea a long time ago because it doesn't scale, it performs worse on low end equipment than uni-thread code, introduces complex issues like deadlock, threads will run when the OS tells them to even if they would just idle/sleep, waiting on mutexes/resources in exchange for the advantage that on a small subset of equipment will have a modest increase in performance (Frequently that subset is machines with exactly two cores). A lock less design using atomic operations has the downside that significant forethought and planning must be invested, however it avoids deadlock (or at least makes it easily detectable), atomic operation are not significantly slower than non-atomic operations(and where they are slower they are not numerous enough to make an impact), runs efficiently on one core and tasks will spread evenly by across cores as to average out cpu time and it will not be idling outside of the designated idle time. A job system is very close to what I described, however I avoid mutexes/semaphores completely and try to use atomic operation and pre-assembled knowledge to avoid race conditions. However, this design avoids having one thread spawn off tasks in another thread, and has threads try to find more work for themselves (It seemed simpler to design it this way) We already have a sound library picked out an in operation, actually we are using a large number of open source components to reduce our workloads. Ogre does our Rendering, Bullet for physics, cAudio/OpenAL for sound and the list continues. Frequently, the managers I described wrap a subset of the functionality an external library provides. I will definitely read that article, Thank you. I was trying to think of a good way to not close each thread each frame. The std::thread class does not seem to provide a facility to do anything like this. I would think it would need some non-constructor function that accepted a functor or function pointer. Maybe I am not reading it correctly. Would it be prudent to split the list of dependencies into two lists? One for writing and another for reading? This way it could be more easily verified at runtime that no data contention occurred. If so what changes to the algorithm would you make? I will look into atomic compare and swap functions, they seem like they will let me do what I need too.
  8. I have a game/engine that is currently has a single threaded main loop that I want to perform work in multiple threads. I have a detailed description of the current and a potential new design. I want to completely avoid mutexes for performance reasons and I think it can be accomplished using only atomic operations. I am using c++0x but I try to be high level enough with my descriptions that the core concept could be portable to other object oriented languages. I would like to know if this algorithm sucks or is a good idea, or just in general any thoughts or concerns about it. Current Algorithm for main loop: There are a number of managers each with their own work. Optionally a game developer can assign a function (currently using c-style function pointers) to be run before or after any manager's work. Each manager has a priority. The managers (pointers to them actually) sit in a container that sorts them automatically. Each frame the managers are run in priority order. Before this a microsecond timer is started. For each manager, the pre-work callback is checked and run if present, then the work is actually performed, and finally the Post-work callback is checked an run if present. The timer is checked and stopped. If the timer has gone beyond 1/60 of a second, then we know there will be a framerate drop, if it is less then 1/60 of a second we simply pause here until 1/60 of a second is reached, then begin the next frame. The framerate is configurable by the game developer, but 60 fps is the default. For reference each Prework, work and postwork function can end the main loop with its return value. There are currently several managers: Physics – This steps the physics simulation the required amount each frame Graphics – Does the rendering requires that the physics data be present Events – Aggregates Events from OS, User input, and a few other places Sound – Plays and stops sounds Actor – Doesn't do anything during main loop, right now, a glorified container Camera – Doesn't do anything during main loop, right now, a glorified container Resource – Provides access to file resources efficiently and potentially asynchronously Scene – Provides settings for Sky domes and lighting, does little during main loop Timer – Checks timers and performs work related to timers finishing. UI – Updates the user interface The Multi-Threaded Algorithm: The first change is the replace the c-style function pointers with an actual functor, these are named callback currently. Next the priority value will be moved from the manager to each callback, additionally each callback will have a collection (we actually use a vector) of pointers to all the callbacks that must be finished before this one can start. Each callback will store its current running state in an std::atomic for easy thread-safe access. There are 4 running states a callback can have. 'Failed' is a state to indicate that normal execution is not possible, and an exception is being thrown, and that the thread reading this should be shut down gracefully and not throw more exceptions. A callback can go from any state to this and if setting itself to the failed state must throw an exception. 'NotRun' is the state to indicate it simply hasn't started yet, this is only set once a frame has complete. 'Running' is the state of for a callback that hasn't finished and can only be set from the 'NotRun' state. Finally the 'finished' state will indicate that a callback has finished and this can only be set from 'Running' state. The states are important because a callback can check if its prerequisites callback have completed or not. The main task scheduler can also perform such checks easily. The order of the running states and the exception if going to the failed state guarantee that a thread will never be started twice even if pre-empted. The prerequisites will prevent resource contention and race conditions by guaranteeing that only one task that needs a resource will be running at a time. There is a compile-time option to enable/disable main loop threads. If it is off then the callbacks priority is simply the order they run (prereqs will still checked be checked in debug builds). If main loop multi-threading is enabled then the priority is automatically set by a the callback class using a timer. The priority is the rolling average of the amount of time this callback took to execute in the past several frames. By default each manager will have its work callback and the pre and post work callback. When a pre work callback is added to the manager, it is added to the work callback as a prereq, likewise when a post work callback is added the manager adds the work function as a prerequisite. This will help maintain the simplicity of the existing interface The Callback will also have a function to start the callback work. When called this must atomically check and set the running state. First, it must check if it is already running, if it is not, then is must set that it is running in a functionally atomic way. If the callback is already running, then it means this thread was pre-empted and the callback was started so we just exit the callback and continue thread execution as if a callback ended. Doing this atomic check we can be certain that there is no way that two threads are running the callback at the same time. (I need some help with this part I am not certain how memory barriers/fences or if they are even the right tool.) The last change that is made before execution begins is what goes into the sorted container. Rather than putting each manager in we will insert up to 3 callbacks. Each manager will have its own work to insert, but rather than just checking each loop to see if we should run the Pre and Post work functions, we check before hand and only insert callbacks for pre and post work if they exist. Additionally, callbacks with no or very different prereqs can be inserted directly in to the main task scheduler. In the simple version of the multithreaded algorithm we will simply have a number of threads to create each frame. This can be arbitrary or a hint/setting can be used. The Main Task Scheduler will have a function to start a callback. Once started it will launch 1 less than as many new threads are allowed calling this function in each. After it has started these threads, its thread will start callback. The function for starting a callback will start with the highest priority thread, checking if it all of its prerequisites have finished. If it has not started and its prereqs are met it will start that callback. If it does not have its prereqs met, then the main task scheduler will move to the callback with the next highest priority, until it reaches the end or runs a callback. Whenever a thread finishes a callback, that thread will call the function on the task scheduler to pick and run a fresh callback in the same thread. If it cannot start a new thread it ends and joins with the main task scheduler thread. If it does successfully start a thread and the current count of running threads is less than the allowed amount then a new thread will be spawned attempting to start a callback. On some platforms with expensive thread creation and destruction this will minimize that cost (Ideally creating and destroying threads equal to exactly the limit minus one). Additionally, once finished a callback will update its priority for the next thread. To avoid situations were the scheduler is deadlocked because a callback has itself listed as a prereq the implementation can either wait for all the threads to complete running and check that every thread yet to be run has unmet dependencies, or it can recursively check a callback checking its prereqs for an issue. Once all the threads have joined at the end if less than 1/60 of a second has passed, then we wait until it has, if more than has elapsed then we know that the frame rate is suffering. In theory work that is known to take a long time can be split into multiple callbacks at development time. There is an extension to this algorithm that relies on the abundance of timers to change the amount of threads being run for optimal performance. I can try to describe that as well Thanks for reading all of that. Please let me know what you think.
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!