Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 14 Feb 2007
Offline Last Active Today, 05:35 AM

#5307703 Low level serialisation strategies

Posted by on Yesterday, 03:22 PM

I've my own engine for my Indie game, and I contract on another game engine at the moment. Both make use of this strategy extensively for data files that are compiled by the asset pipeline (so are automatically rebuilt if the format changes).

Often we don't even fix up pointers on load. I've a template library containing pointer-as-relative-offset, pointer-as-absolute-offset, fixed array, fixed hash table, fixed string (with optional length prepended to the chars, and the hash appearing next to the offset to the chars), etc.
Instead of doing a pass over the data to deserialize these, they're just left as is, and pointers are computed on the fly in operator->, etc. This can be a big win where you know a 2byte offset is enough, but a pointer would be 8bytes.

As above, I find that this KISS solution is often less stress and time than the more over-engineered solutions.
Also as above, I don't spend too much time debugging this stuff at all, and it either works fine or breaks spectacularly. Leaving a few unnecessary offsets to strings in the data can be useful if you do have to debug something.
I usually generate my data with just a C# BinaryWriter (and extension classes to make writing things like offsets and fixed size primitives clearer), and use assertions when writing structures that the number of bytes written equals some hard coded magic number. The C++ code also contains static assertions that sizeof(struct) equals a magic number. If you upgrade a structured and forget to update these assertions, the compiler reminds you very quickly.

Save game files, user generated content, and online data tend to use more heavyweight serialisation systems/databases that can deal with version/schema changes, as these don't go through the asset compiler.

#5307609 Bound but unused vertex buffers affecting the input layout?

Posted by on Yesterday, 07:35 AM

1. Nope, the Input Layout says which attributes to load, and which 'slots' to load them from. If a buffer happens to be bound to an unused slot, no big deal.

Note that this isn't true in other API's. In certain other API's, it's a big performance pitfall to accidentally have extra attributes bound to the pipeline like this :)

2. There's no need to unbind buffers, no. I unbind buffers in debug builds, but not shipping builds as a debugging tool.

However... When you bind a resource to the device, the device holds a reference to that resource until it is unbound. If you release a buffer but it happens to still be bound to the infrequently used slot #15, then the memory won't actually be release until you do unbind it. I generally call ID3D11DeviceContext::ClearState at the beginning of every frame to avoid long-term bindings like this.

#5307511 I have an Idea..........

Posted by on 23 August 2016 - 07:21 PM

The game is supposed to an open world game like skyrim My question is are there any easier engines that i can use to make this game
You could make it as a mod for Skyrim. 

#5307505 Horizon:zero Dawn Cloud System

Posted by on 23 August 2016 - 05:56 PM

@Hodgman you planning on sharing an application with your findings ?

Sure. The quick prototype that that picture is from is a shadertoy :D 
https://www.shadertoy.com/view/Mt33zj The code is terrible and terribly inefficient though!
I started by stealing this atmospheric scattering shader by valentingalea (which is based on this article), stealing some noise functions from iq, and adding a cloud layer based on what I could remember from reading this thread.

#5307362 Game architecture

Posted by on 23 August 2016 - 05:07 AM

Determine which procedures in which systems need to load media, and pass a reference to the media loading system into those procedures.

#5307301 Easy way to get static AABB for animated model

Posted by on 22 August 2016 - 07:00 PM

If you assume that bones can only rotate, not translate/scale:

Pick a bone to be the center. Walk outwards from there along all paths to leaf bones. Along the way, sum the distance between each bone, giving the total distance travelled from the center bone. Pick the maximum distance value produced by any of the leaves. A sphere of that size, centered on the center-bone will always enclose all bones.

Vertices form a shell around the bones though, so for each bone, iterate through every vertex that is skinned to this bone (has a weight>0 for this bone) and compute the distance between the bone and this vertex. Keep track of the maximum ditsance between any bone and associated vertex. Add this distance to the sphere generated earlier. Now you've got a sphere that's guaranteed always bound any in the model, no matter how it's bones are rotated.


At runtime, you can cull using these "conservative" spheres before doing animation. After animation, you can generate a slightly tighter bounding volume if required, by taking the min/max of the animated bones and adding the "shell size" that was computed above.

If you happen to use software skinning, then you can compute the min/max vertex position during skinning to get an even tighter bounding volume still -- but not many people use software skinning any more :)

#5306942 Texture Units

Posted by on 20 August 2016 - 07:47 PM

In your performance test, there's 14ms difference between the two tests. To me, that sounds like a bug, and I'd keep drilling into that situation to find out why :o
They really should be quite similar...
The array case should have less CPU overhead as there's less API calls to bind the resources, and requires less set up time per pixel, as there's only one resource descriptor to load instead of three. Perhaps there's more GPU overhead due to having to add a slice index multiplied by the slice stride... But that's like one instruction, which would require a hell of a lot of pixels to be drawn to add up to 14ms!

As for texture units, there used to be a fixed number of registers that would hold texture descriptors (the guts of an SRV), which set a hard limit, and mapped perfectly to these "slot based" APIs.
These days, GPUs are "bindless", where fixed descriptor registers have been replaced with general purpose registers. Descriptors are stored anywhere in RAM, and the first thing that a shader does is fetch the descriptors from RAM and into some GPRs. When performing a texture fetch, the addresses to read pixel data from are computed by reading the GPRs that hold the texture descriptor and the GPRs that hold the texture coordinates. It's mostly just general purpose memory fetching.
In this new model, the number of textures bound at once is essentially limitless. Only a finite number of descriptors (SRVs) will fit into the register space at once (there's a finite number of GPRs available), but you could always fetch a new descriptor from memory right before doing the texture fetch, removing this hard limit. The API limit of 128 on D3D11 is completely arbitrary.

On both old and new hardware, binding the same texture twice is useless - it just adds extra CPU overhead, and on modern GPUs it forces them to fetch these extra descriptors from RAM per wavefront.
The actual texture-fetching hardware (confusingly, also sometimes called a texture unit) is unrelated to and decoupled from these API binding points / slots / SRV's. A GPU computation core will automatically make use of all of its available memory channels, which may include one or more bits of dedicated texture fetching and filtering HW units.

#5306919 Sensible places to look for work with 1 year Unity experience

Posted by on 20 August 2016 - 06:03 PM

In my experience, every entry level job asks for three years experience and three shipped titles, which is a stupid catch 22 :lol:
If you can manage to get to an interview anyway and show that you know what you're talking about, you can still get hired for these jobs. The requirements are more a wish list given to HR/recruiting.

#5306845 Background Command Lists Generated On Execute?

Posted by on 20 August 2016 - 12:01 AM

In general, something working with warp should be valid, right?

I've written a lot of (buggy) code so far that has worked fine on WARP, but has crashed either the NV or AMD drivers (usually just one or the other!).

#5306719 I'm considering engine switch after current project is done. Which one sh...

Posted by on 19 August 2016 - 07:37 AM

I wouldn't recommend CryEngine to anybody, having known people who worked for Crytek.

Can you tell any 'tellable' explanations about this ? Is the API/SDK/code bad in some ways ? Code spaghetti maybe ?

My only experience with CryEngine was that I took a small job to add PSVR support to it one time... I thought, "this should be easy, they already have Oculus support!".
Oh was I wrong.
For starters, their Oculus support was added by someone who had no idea what they were doing, or alternatively, simply didn't care what they were doing -- they took the head-tracking data and converted it into mouse input events, and fed it into a massively complicated, XML-routed event system, which eventually ended up modifying the camera rotation somehow. I found the part of the code that generated this message fairly quickly, but it took hours of wading through spaghetti before I found the part of the code that received the message... Backing up for a second though, mouse inputs are two dimensional, but head-tracking is 6 dimensional, so positional offsets and "roll" of the rotation were simply ignored!
What I thought would've been a simple half a day task turned into a multi-week epic task, because of the unnecessary complexity in their code-base. In the end, this game simply gave up on supporting PSVR, because adding actually decent VR support into Cry (and hitting their performance targets) was too much effort.
Another example -- every engine at some level has a cross-platform wrapper around D3D/etc (e.g. RHI), so that you can port your graphics code to multiple platforms easily... I was very confused when I found PS4-specific code that was calling D3D11 functions! It turns out that their cross-platform graphics layer was such a mess, that part of it is a portable Cry API (like RHI) and other parts just use D3D11 directly... Of course you can't run D3D11 code on PS4... so some bright spark has decided that rather than clean up their code base, they'll just add another layer of complexity to it by emulating select parts of the D3D11 API as GNM (PS4) wrappers :o
You can really, really tell that it's written by 100 different people who simply do not get along with each other at all... and having known a bunch of people who have worked there, apparently this makes sense, because internal cliques, egos and office-politics are a big problem apparently.


tl;dr, it does produce some pretty pictures, but the quality of the engineering is embarrassing.

#5306472 Engine design v0.3, classes and systems. thoughts?

Posted by on 17 August 2016 - 10:44 PM

For example, where would you store a mesh's vtx buffer? In my case it's the Cd3dMesh class, because CMesh is API/ platform independent and just an IO/ data thing.
This depends where you want to draw a line through your code base between code that's implicitly cross-platform, and code that you will re-implement once per platform.


If you implement a "gpu buffer" (index buffers, vertex buffers, structured buffers are all the same) class once per platform (BufferD3D11, BufferVulkan, etc), then your Mesh class becomes portable. It can own a Buffer (which is a BufferD3D11/etc, hidden behind an interface).

Alternatively, Mesh can be an interface for MeshD3D11, MeshVulkan, etc, and you can implement the entire mesh class once per platform.


Personally, I like to create a low-level renderer that's implemented once per platform, and then build the high level renderer (meshes, materials, etc) on top of it (and without using any platform specific code).

#5306446 DirectX 12 by example. A blogpost.

Posted by on 17 August 2016 - 06:40 PM

Nice. I really like the flow, how you start with the shader that you want to run and then treat the D3D code as a solution to this problem.

#5306298 dx11 drawIndexed cost overhead time

Posted by on 17 August 2016 - 01:10 AM

You should use visual studio's debug builds when you want to use breakpoints to step through your code line by line, and otherwise use release.

You should use the D3D11_CREATE_DEVICE_DEBUG flag to check for errors in your usage of the D3D API, and otherwise disable that flag.

i.e. switch back and forth between these 4 different modes depending on your current task.

#5306289 dx11 drawIndexed cost overhead time

Posted by on 16 August 2016 - 11:28 PM

I  have not make any product , so all my program are DEBUG version.

Debug builds will always be extremely slow - never use them for testing performance.
Specifying the D3D11_CREATE_DEVICE_DEBUG flag when creating a D3D device will ruin performance too.

How to paste code like yours ?


I run you code and crash.

Where and what kind?

#5306060 render huge amount of objects

Posted by on 15 August 2016 - 08:00 PM

Lower end gpus should be 500-1000 no problems, mid range 1500-3000, high end can hit 8,000+

My computer runs you demo at 30fps  with 1000+ total render objects.  can you show your code how to update transform ?  1000+ times?

What is your hardware? I got above 60k+ objects before the fps dropped to 30 :o

@hodgman, interesting and tidy approach but does it end up being more efficient that a normal tree traversal? I guess it depends how much changes from frame to frame, it nothing does then a full tree traversal for transform updating only is pointless. But sorting arrays sounds slow also.   I was aiming for a solution that only touches the minimal set of nodes to respond to a change but also scales well from zero changes to changes in every object in the scene. Me wants cake and eating it!

In this case, I would guess that transferring two matrices from RAM and writing one back to RAM will take a lot more time than the clock cycles of computing a matrix multiplication, meaning the CPU will be idle for most of this process... Therefore, the bottleneck would be memory bandwidth, not processing power, and you should optimize for that. Keeping the data in a contiguous array and iterating through it in a predictable patterns allows the hardware pre-fetcher to automatically scan ahead and start downloading future elements of the array before they're required, reducing cache misses / observed memory latency. There's also random accesses to parent data, but these are typically small jumps backwards in the array, which are almost guaranteed to still be present in the L1 cache, so incur no extra bandwidth.

This difference in memory access patterns could mean that using a linear array and processing every element takes the same amount of time as using a randomly allocated tree (e.g. every node is allocated individually with new) and only processing 10% of the elements! :o
Moreover, it's often important in games to optimize for the worst case behaviour of an algorithm, not the average case or best case.
e.g. if you're making an FPS with 60 players, each with 100 bones, that's a worst-case of 6000 character nodes per frame to recompute. If no one is on screen, it's a best case of 0 character nodes :)
It's all well and good to make sure that your framerate is amazing when there's no players on screen, but the most important thing is to make sure that the framerate is acceptable when every player pops up on your screen at once.
You can also add a level of hierarchy on top of this by allocating one array per character. In the above example, that would mean the character models would use 60 individual arrays of 100 nodes each, rather than one global array with 6000 nodes. You can then cull a model if they're off-screen, which saves you from processing 100 nodes in one go.
If different models can be attached to each other, then you just need to sort your models so that attachments update after their parents -- sorting an array of 60 objects is so cheap you almost won't be able to measure it. Within each sub-array, you'll never have to resort the data in most cases -- e.g. the ankle is always a child of the knee, and you're not going to change that at runtime. So this data gets sorted when creating the model file, and doesn't require re-sorting at runtime.

There are 2 things at play with transforms the way I see it, the local update of a matrix when it is changed... then the re-combining of all the child matrices - this is where I am struggling to see the optimal solution.

Yep. The local matrices are the outputs of your animation, physics and specialized gameplay systems -- different game objects will get their local matrix data from different sources.
Once all the local matrix data has been computed, you can recompute all the world matrices by walking down your hierarchy and propagating parent to child.
You can try to skip work here by using dirty flags, etc - only updating nodes that have been modified (and nodes who's parent been modified!)... or you can just recompute all of them :)