• Content count

  • Joined

  • Last visited

Community Reputation

5947 Excellent

About All8Up

  • Rank

Personal Information

  • Interests
  1. Definitely not, it's just the free threaded nature of how you described your implementation which is not the best way to go. Typically in games you perform several steps in order rather than all at once. In that manner you put 'all' threads to updating the scene graph and only when that completes do you let the render thread walk the graph to start rendering. Or, more commonly, after the update then the cull pass is run which issues the rendering commands and the render thread picks them up. Basically everything in that more common model is a 'push' towards worker threads rather than multiple items trying to process shared data.
  2. To be honest, this is probably the better point. But, the OP suggested a functional threading model in the description. But even with data flow controls, scene graphs are actually a problem in distribution systems and this counter approach usually ends up in both functional and job based solutions as part of the method allowing multiple threads to efficiently traverse the scene graph and not duplicate work. You could do this with OS mutex and probably perform about the same, but I've found many uses for the counter approach in areas beyond simple update and rendering, not to mention the lower memory footprint. The thing I see most often is folks pushing nodes into a job system and then each child is pushed as individual jobs. In most cases I've seen that approach it definitely utilizes all the cores, unfortunately REALLY poorly. The overhead is usually just horrible in such solutions and you'd be better off with a single threaded solution. Tree structures are always difficult to get effective parallel utilization from. The counter trick is actually one of the better ways to partition the tree's. We haven't even touched on the fact that most scene graphs have to be traversed twice, once to update all the transforms and bounds at the leaves and then again to accumulate the bounds back up to the root. That's actually a fairly simple item to solve with the approach suggested. I did forget to mention it of course. Basically you have to choose which direction you intend to perform updates and follow that direction appropriately. Let's say we choose depth first priority, if an item like the gun is locked by the renderer and it steps to it's parent to find it currently locked, it unlocks the gun and spin waits for the atomic to become 'updated'. Any time something is locking children it assumes it will always eventually get the lock so it will wait. Given that the contention is exceptionally rare in most cases, this doesn't hurt performance much and just works. This is a matter of balance and purity. If you want the render thread to do nothing but render things, you have to either remove it from scene graph traversal while others are updating the graph, or block when it runs into something which is not yet updated. On the other hand, if you don't mind that this is turning all of your threads into a more generic 'worker' concept this approach is actually an optimization. Consider that something has to do the work and the render thread would be blocked anyway, why not let it do the work to unblock itself? But, going back to Hodman's point, the approach I'm suggesting is not horrible but there are other solutions which will perform better. Though, as I mentioned, I've ended up using this as part of those other solutions anyway so I don't think it would be completely wasted work. Tagging nodes in tree's is a very common approach with parallel updates, it is common in generic concurrent tree types and many algorithm's which have complex traversals such as this.
  3. Believe it or not, the solution I suggested is exactly this, it just skips over all the multithreaded issues that you run into when implementing it. Threading throws a serious wrench in the mix for many obvious and simple solutions. For instance, the naive solution to implement your suggestion is a deadlock prone item in a threaded codebase. Say the rendering thread starts to render the gun, it locks the gun, the thread decides it needs to lock the parent of the gun so it tries to lock that. Unfortunately the other thread has the parent locked as it updates it and then it tries to lock the gun. Oops, dreaded deadlock issue. As I said, there are many possible solutions, even to the deadlock as described. The problem is that many of the obvious solutions are worse than just giving up on threads and running things single threaded. For a scene graph, the best solution is usually the eventual consistency model I describe simply because it avoids most cases of contention and doesn't need non-uniform work ownership rules. Amdahl's law is unforgiving, the more complexity you put into deciding what work to perform the less you will be able to utilize the threads. As such, what conceptually seems simple in this case is actually really hard to implement and eventually a single threaded approach would be better.
  4. There are a number of ways to approach this problem, personally I like the most simplistic variation. Basically there is a simple trick which can be used involving dependency chains and work ownership. In each of the scene graph nodes add an int32_t or anything which can be used in std::atomic's. Now, every frame, update a counter and then tell the threads about that update such that each thread will have a copy of the old and new value of the counter. At this point the obvious item is to perform an atomic compare/exchange on the counters in the nodes, if it succeeds that means the node needs to be updated before any of it's data is used. Of course there is a problem with this, if one thread succeeds in updating the counter and then starts updating the data in the node, other threads could use the data as it is being updated. So, things get a little tricky though still pretty simple. In order to fix the problem, reserve some value of the counter, I use -1 usually, which will represent a 'locked' state. Now rather than the compare/exchange just updating to the new value, it updates to the locked state first. It does the work needed to update the node and then stores the new counter value when done. Adding a little spin wait on the locked state is trivial and while it may help to do fancy exponential backoffs and such on the spin, 'usually' it is a waste of time since contention in a fair sized scene graph is pretty minimal as long as there is a decent amount of work for each thread to perform. For all intents and purposes, this approach turns your threads accessing the scene graph pretty much into a work stealing job system. The reason for that description: say your render thread wants to render the gun, it will check the counter and if it updates it, then it knows it needs to do the matrix updates. In order to do that it will track to the parent and check it's counter, if it swaps that also, it gets to perform the matrix updates for both items, i.e. stealing the work from your matrix update thread. If the matrix update thread see's already updated or locked nodes, it just ignores them since obviously some other thread is busy updating them. So long as all threads in the scene graph respect the update counter, dependencies get solved properly and everything is thread safe. At the end of the day, this system works quite well and utilizes the threads reasonably.
  5. As others have stated, that is probably not a 64bit result and you are loosing the x position, change the code as follows to guarantee things: int64_t Input::GetMouseXY() { return (int64_t(_mouseX) << 32) | int64_t(_mouseY); } Alternatively, and more difficult to get wrong or mess up later, use a combination of a union and bit fields: union MousePosition { int64_t Data; struct { int32_t X : 32; int32_t Y : 32; } Position; }; Then you can: int64_t Input::GetMouseXY() { MousePosition result; result.Position.X = _mouseX; result.Position.Y = _mouseY; return result.Data; } The bit field makes the code more verbose but also more understandable. It will do the magic of bit shifting and compositing the result for you automatically in a safe manner. (Note that in the above case, if you really want 32 bit values, you don't need the bitfield, I put it in there since I usually pack more data into that structure.) Unpacking the data is the following: MousePosition pos; pos.Data = theInt64ReturnedAbove; int32_t x = pos.Position.X; int32_t y = pos.Position.Y;
  6. Nested namespaces useless typing?

    I tend to agree with most in terms that 'it depends'. Take your path example, oddly enough that is exactly a case I run into very often. IO::Path:* can be appropriate given that you might also have AI::Path::* or DAG::Path::*. Yes, I could rely on prefixing such as "IOPath", "AIPath" or "DAGPath" and avoiding the namespace but to me the purpose of the namespace here is to allow each API to use the most appropriate and obvious names without stomping on each other. Once out of headers and in the CPP's where you use such things, simply typing in 'using namespace IO/AI/DAG' generally takes care of the extra typing involved. The rare exceptions where you need combinations, serialization comes to mind, those are the only places you usually need to continue typing in the prefixes. As to the depth of the namespaces. I tend to believe 2-3 deep is appropriate with the only exceptions usually being things like an internal namespace that pushes into the 4+ range. 2-3 may seem a bit much but it all comes out of usability needs. Probably the most common header name in nearly any library you get: '#include "Math.h[pp]"'. This is a common problem if you are using 3rd party libraries: who's math file are you including with that? So, in order to make sure I don't have that problem, there is *always* a prefix directory in my libraries: '#include "LIB/Math.hpp"' which guarantee's I get *my* math header file. Out of this flows the rule I use that including a file should be as close to namespace of the object it provides as possible. I.e. '#include "LIB/IO/Path.hpp"' gives me LIB::IO::Path ready for use. While my rules don't work for everyone and are a bit excessive for normal everyday work, there is one rule I suggest anyone follow: "Be Consistent". If you adopt a set of rules and styles, apply it everywhere with as few exceptions as possible. Consistency trumps individual decisions folks may dislike pretty much every time.
  7. Yes you can. This was actually the first approach I tried but it is very bad for the GPU in terms that it severely under-utilizes parallelism. As a quick and dirty 'get it running' solution it is fine though. If you read that tutorial again you might catch that they talk about there being implicit 'subpasses' in each render pass, it's kinda 'pre' 'your pass' 'post'. By setting dependencies on those hidden subpasses that is how you can control renderpass to renderpass dependency, or at least ordering. Nope, I still expose quite a bit of that for things not directly within a renderpass since it is common to all three API's, or null op where Metal does some of it for you. The most specific case is the initial upload of a texture where you still have to do the transition to CPU visible, copy the data, transition to GPU visible copy to GPU only memory. While I have considered hiding all these items the goal is to keep the adapter as close to the underlying API's as possible and maintain the free threaded externally synchronized model of DX12/Vulkan and to a lesser degree Metal. Hiding transitions and such would mean building more thread handling and synchronization into that low level than I desire. This would be detrimental since I'm generating command buffers via 32 threads (Thread ripper is fun) which would really suck if the lowest level adapter was introducing synchronization at the CPU level in order to automate hiding a couple barrier & transition calls. Long story short, the middle layer 'rendering engine' will probably hide all those details. I just haven't really bothered much and hand code a lot of it since I'm more focused on game systems than on the rendering right now.
  8. The short answer is that you are likely to have problems and will need to do something to introduce dependencies between render passes. The longer answer is that this is very driver specific and you might get away with it, at least for a while. The problem here is that unlike Dx12, Vulkan does do a little driver level work to help you out which you usually want to duplicate in Dx12 yourself anyway. Basically though, if you issue 10 render passes, the Vulkan driver does not have to heed the order you sent them to the queue unless you have supplied some form of dependency information or explicit synchronization. Vulkan is allowed to, and probably will, issue the renderpasses out of order, in parallel or completely ass backwards depending on the drivers involved. Obviously this means that the draw commands associated with each begin/next subpass executes out of order. When I implemented my wrapper, I ended up duplicating the entire concept of render/sub passes in Dx12, it was the most appropriate solution I could find which would allow me to solve the various issues on the three primary rendering API's I wanted to support. The primary reason for putting the work into this was exactly the problems you are asking questions about. At least in my case, when I really dug into and understood the render passes I realized I had basically just been doing exactly the same things except in my rendering code. By pushing it down to an abstraction it cleaned things up quite a lot and made for a much cleaner API. Additionally, at a later time, it will make it considerably easier to deal with optimizing for better parallelism and GPU utilization since all the transitions and issuance is in one place that I can get clever with, without having to re-organize large swaths of code. So, yup, I'd suggest you consider implementing the subpass portion because it has a lot of benefits and solves the problem you are asking about.
  9. Engine Core Features

    Hardware/OS specific items should still have an interface layer in your core such that it is consistent no matter how you implement the backend. In regards to that, you really should add input device handling (keyboard, mouse & joystick) and window management items. Depending on goals, a lot of folks roll window management into the rendering engine but I tend to think it should be separated for quite a few reasons. Just my $0.02.
  10. I guess I didn't explain it well enough or our terminology is getting mixed up. What you are describing is a multi-step process merged into a single step which is inherently going to have redundant data and cause this sort of problem. Break down the pipeline into several steps and this is what I usually end up with: Intermediate (My 'all' format or Collada, Obj whatever) -> Basic prepared data (All data but in a GPU usable form, just not optimized and data which will be ignored by various materials.) -> Material bound data (Unused data removed and optionally full vertex cache optimization.) -> GPU ready What I was describing is that the offline asset processing only does the first two steps, the material binding that reduces the vertex elements to only what is needed is what I was describing as the runtime cache portion. There are a number of reasons to leave the material binding till runtime, primary among these is that the runtime is the only thing which actually knows what makes sense. For instance, if I use a mesh with a pipeline that expects uv's and another pipeline that is the same except it doesn't use UV's, it is generally best to just reuse the same mesh and ignore the uv's in the second case. All said and done, this is a case of pre-mature optimization until you actually understand what the game needs and what combinations make the most sense. So, a little runtime cost is well worth it until later.
  11. I look at this a little different. I tend to break this into two separate items, what you have and then another class which represents what the graphics API expects. The general idea is that from Max/Maya I spit out the intermediate structure which contains all the data available. This is a 'slow' item since it is bloated and not formatted in a manner usable by the graphics API's. Then I create the low level immutable graphics representation from the intermediate data which has done all the copies and interleaving you are mentioning. This does mean that when I ask to render a mesh I load the big bloated data and perform the conversion step. This probably sounds like what you are already doing, but there is a trick here. When I make a request I generate a unique hash that represents the intermediate mesh, the target graphics api and any custom settings involved with the intermediate to graphics conversion process. I then look in a cache for that key, if I have already performed the conversion I just grab the immutable cache version and use it, if I have not, perform the one time conversion and store it in the cache, potentially even saving it to disk for next run. Later in development, or whenever it becomes appropriate, you can offline generate all these variations, remove the intermediates and 'only' use the final graphics data. This split becomes your asset pipeline eventually and if you leave the intermediate handling in engine, you can still use it for fallbacks to older API capabilities as needed. A one time startup and processing overhead is not too much to ask of the end user, so long as it is not hours of processing of course.
  12. The cost of virtual functions are usually greatly exaggerated in many posts on the subject. That is not to say they are free but assuming they are evil is simply short sighted. Basically you should only concern yourself with the overhead if you think the function in question is going to be called >10000 times a frame for instance. An example, say I have the two API calls: "virtual void AddVertex(Vector3& v);" & "virtual void AddVertices(Vector<Vector3>& vs);" If you add 100000 vertices with the first call the overhead of the indirection and lack of inlining is going to kill your performance. On the other hand, if you fill the vector with the vertices (where the addition is able to be inlined and optimized by the compiler) and then use the second call, there is very little overhead to be concerned with. So, given that the 3D API's do not supply individual vertex get/set functions anymore and everything is in bulk containers such as vertex buffers and index buffers, there is almost nothing to be worried about regarding usage of virtual functions. My API wrapper around DX12, Vulkan & Metal is behind a bunch of pure virtual interfaces and performance does not change when I compile the DX12 lib statically and remove the interface layer. As such, I'm fairly confident that you should have no problems unless you do something silly like the above example. Just keep in mind there are many caveats involved in this due to CPU variations, memory speed, cache hit/miss based on usage patterns etc and the only true way to get numbers is to profile something working. I would consider my comments as rule of thumb safety in most cases though.
  13. While folks are correct that this is the poster child use case of mutable, keep in mind that the contract for mutable has changed in C++ 11 if this code is ever to be multi-threaded. As of C++ 11, the contract for mutable now also includes a statement of thread safety. A use case such as this in a multi-threaded engine will likely fail pretty miserably and you need to protect the cacheResult_ value. I'm only pointing this out 'in case' you intend to multi thread any of this code, if not it doesn't impact you..
  14. In a general way, that is fairly close to a very simplistic solution. Unfortunately at this level it is really all about how clever the drivers get when they solve the path through the dag generated by the subpasses. They could do the very simplistic solution of just issuing a vkCmdPipelineBarrier with top and bottom of pipe flags set between subpasses with dependencies or they could look at the subpass attachments in detail and figure out a more refined approach. Since this is all just a state transition chain, building a simple DAG allows for a much more optimized approach to issuing a mix of pipeline and memory barriers. I can't find the article I remember that describes some of this but this one may be of interest: as it is related.
  15. In the subpass descriptions you have arrays of VkAttachmentReference which is a uint and layout. The uint is the 0 based index into the VkRenderPassCreateInfo structure's pAttachment array where you listed all of the attachments for the render pass. So, effectively, what I'm saying with those is: // assume you have pRenderPass and pSubPass pointers to the respective Vk structures. theImageWeWantToMessWith = pRenderPass->pAttachments[ pSubPass->pInputAttachments.attachment ] That is effectively what is going on behind the scenes to figure out which image to call memory barriers on. So, when I said attachment 0 and 1, I was talking about the index into the VkRenderPassCreateInfo structure's pAttachments array. Note that render pass info does not separate inputs/outputs etc, it just takes one big list, only subpasses care about usage. Hope that clarifies things.