Optimizing memory layout (and reducing load times)

Started by
18 comments, last by Zipster 6 years, 10 months ago

Disclaimer: I am writing my own thing, so this does not pertain to UE, Unity or any other professional engine.

I'm taking a critical look at my actor code and trying to figure out how best to serialize/deserialize the data for disk and network streaming. The problem right now is that there are inherently a lot of allocations happening for any given actor.

Initializing the "RPG" component of a pawn actor alone necessitates a slew of allocations as there are quite a few subcomponents that have a metric crapton of fields/arrays that do not have fixed sizes. I'd like to not only reduce load times (which in all fairness aren't an issue at this point), but also memory fragmentation as I haven't really implemented any manner of a compactification in my memory manager.

What are the common approaches and things to streamline in this case, and - more importantly - are they worth it?

The most obvious ones I can think of are:

1) convert dynamic strings to fixed-length char buffers
2) convert any vector-style arrays to fixed-size arrays
3) use field packing and relative offset pointers to coalesce several small chunks of data into a fixed-sized pool (kind of the way uniform buffers work)
4) resolve internal pointers of referenced objects to indexes into an array or hashes into a hashmap (this probably requires some kind of a an additional solution to cache the actual pointers)

The issue here is that I'm working with abstract objects that can be extended fairly arbitrarily. That being said, while I feel reserved about capping objects containing arrays stored within an array stored within an array, I don't really see the harm in setting some fixed size/length caps on certain variables at the expense of storage. My memory allocator is generously aligned anyway, so there's already some inherent waste.

The reason I'm asking is because I'm at a point where I don't have a real-world use case in my hands and I can't really tell if worrying (or worrying to what extend) about this is worth it. Rewriting this stuff is a lot of work, but I'd rather do it before I compound the issue further. And also to have the infrastructure in place as I'd like to write some form of serialization solution in the near future.

Advertisement

When it comes to performance there is a simple rule:

** MEASURE EVERYTHING **

What is slow for serialization in your game may be extremely fast for serialization in my game.

In general terms data conversions and processing slow down serialization. Being able to use the data directly from disk is generally fastest, but not always. Part of that is why games use S3 texture formats (also called DXTx or DDS formats) because they can be loaded direct to the card, instead of formats like jpeg or gif or png that need to be decoded, expanded, and then sent to the card.

However, just because it makes sense in general does not mean it makes sense for your data. Sometimes compressed data is faster than raw data since there is less (slow) disk rates and more (fast) memory transfer rates. Sometimes more complex algorithms to memory-map elements into their final location and use an offset table to access everything is actually more overhead than just loading data the stupid direct way.

Many times the things you think are slow aren't actually slow, and the things you think are fast aren't actually fast. That is why you need to measure for your specific situation, then address the issues based on measurements. Make a change, and measure again to ensure the changes helped.

I am aware of the basic rule of premature optimization :).

Hence, like I mentioned on more than one occasion, my concern isn't so much about speed as it is about practicality and memory fragmentation. I'm looking for ideas, possible input on how a commercial engine might do this and trying to figure out a middle ground, which would allow me to serialize my data without too much hassle while not bending over backwards to then deserialize it.

In this case versioning my code so that I can compare microseconds is the last thing on my mind (okay, maybe not the last), because of the amount of work involved. I don't just want to dig into it and start profiling - instead, I want to write good architecture that doesn't need to be redesigned in a couple of months.

I'd argue it's a bit strange to worry about memory fragmentation, while possibly arguing for fixed-length arrays (where most of the space is unused) and talking about how your memory allocator has 'generous' alignment. Aren't you just moving the fragmentation around?

Personally, I don't recommend trying to make everything into flat data structures for fread/fwrite - I've spent literally weeks debugging problems that arose from code that did this, but which failed to anticipate some future change (to the data structure, to alignment, or padding, or the shift from 32 to 64 bits), and silently trashed or lost some data as a result. But, I know some people will disagree and say that it's worth the effort. Still, the big problem for me is that when you change your format, all your data is dead. A careful versioning system can help.

One thing that is worth doing, if practical, is replacing dynamic length strings with string IDs, hashes, or indices. Unreal has this via the FName concept. Your structures that contain strings end up containing small fixed-sized IDs/hashes/indices, while the actual text lives in a hash table somewhere, and can be stored quite efficiently as one long contiguous string at the end of a file. Sometimes the actual text doesn't need to be loaded at all (e.g. if you use hashes, and just check for equality). Different approaches have different pros and cons.

In that case, your four items are the best options I know of for the general case.

#1 and #2 avoid allocations and reduce memory time. #3 and #4 (often done with an offset table) can help minimize disk time. Do all four and you reduce both disk and memory time. The more packing you do up front the more fragile the system is to modification and extension, so it is a balance.

You mention network, so the extremely high overhead of data transfer is well worth additional processing. Memory fragmentation will be specific to the game; level-based games where EVERYTHING can be loaded at once and EVERYTHING can be discarded at once fit the model of a large bundle, but they aren't particularly extensible. Longer-running systems need more handling of object lifetimes and object pools. While there is a ton of literature on various scenarios, they still boil down to special cases the options you listed: minimize allocations and transforms, minimize expensive I/O.

'd argue it's a bit strange to worry about memory fragmentation, while possibly arguing for fixed-length arrays (where most of the space is unused) and talking about how your memory allocator has 'generous' alignment. Aren't you just moving the fragmentation around?

I'm aligning allocations to 16 bytes by default. Overall that's a whole lot of wasted space. That's what I meant by generous.

What I'm arguing for here is "clumping" allocations more compactly by coalescing as many of them as possible. The only place where this matters presently in practice is when loading a new level. I'm sorta structuring stuff with world streaming in mind, though, so that's kind of the bigger general concern on my mind.

Personally, I don't recommend trying to make everything into flat data structures for fread/fwrite - I've spent literally weeks debugging problems that arose from code that did this, but which failed to anticipate some future change (to the data structure, to alignment, or padding, or the shift from 32 to 64 bits), and silently trashed or lost some data as a result. But, I know some people will disagree and say that it's worth the effort. Still, the big problem for me is that when you change your format, all your data is dead. A careful versioning system can help.

Good points all around. I was thinking of sidestepping architecture/compiler-specific stuff (32/64-bit, struct member ordering and endianess specifics) by writing a strongly typed serializer that breaks everything down to basic types and fills in each field individually. So while the data is loaded from disk in one big chunk, the deserializer still walks through each field individually and does any necessary conversions. This SHOULD avoid data loss on format change.

Basically do something like this:

struct mytype_t { 
   rect_t field1;
   int32 field2;
   int16 field3;
};
IDataSerializer ds;
mytype_t mt;
// do runtime reflection that determines field order and sets up offsets/sizes/types
DEF_SERIALIZED_TYPE(ds, mytype_t, field1, field2, field);
DEF_SERIALIZED_TYPE(ds, rect_t, lft, top, rgt, btm);

// do the same for basic types - int32, int16, etc

// use the definition above to write the object tree to disk, breaking everything down to basic types
ds.serialize(stream, mt);
// if order of fields is not changed, fill in the entire structure one basic type member at a time
ds.deserialize(stream, mt);

I couldn't really describe how I would implement versioning in a robust way right off the bat.

One thing that is worth doing, if practical, is replacing dynamic length strings with string IDs, hashes, or indices. Unreal has this via the FName concept. Your structures that contain strings end up containing small fixed-sized IDs/hashes/indices, while the actual text lives in a hash table somewhere, and can be stored quite efficiently as one long contiguous string at the end of a file. Sometimes the actual text doesn't need to be loaded at all (e.g. if you use hashes, and just check for equality). Different approaches have different pros and cons.

Indeed, many of my strings have a hash and I'v gone to some lengths to avoid data duplication by only storing a HASH-type variable where needed.

You mention network, so the extremely high overhead of data transfer is well worth additional processing.

This is a very good point. I did mention networking somewhat prematurely, though, as I don't really have any real multiplayer at this time. Worth keeping in mind, though.

You say, "the deserializer still walks through each field individually and does any necessary conversions"

Frob says, "data conversions and processing slow down serialization".

Sounds like you're trading away one problem only to get another. Without having measured anything yet, I don't think you're in a position to make that decision either way.

There is also no decent way to do versioning for arbitrarily-nested objects. You can do it at the level of individual struct types where those structs contain basic primitives, but that's about it.

concern isn't so much about speed as it is about practicality and memory fragmentation

For 64-bit builds, fragmentation is nowhere near as scary as it used to be. Modern address space is big!
Having a shitload of tiny objects isn't necessarily bad for fragmentation either. Having a shitload of tiny that are allocated at the same time but have wildly different lifetimes is bad for fragmentation - it's those tiny stray objects with long lifetimes that clog up your address space. If you create 10k objects at the start of a level, and then destroy all 10k objects at the end of the level, then there's no problem.
Many level/scene based games that I've worked on have enforced this constraint upon us -- any allocation created during a level must be free'd at the end of that level, or the game will forcibly crash to loudly let you know that you've broken the contract.

The issue here is that I'm working with abstract objects that can be extended fairly arbitrarily ... objects containing arrays stored within an array stored within an array

You can go back a few steps in the design and make less of an object soup in the first place ;)
As much as I love to shit on the ECS-anti-pattern, the idea of sorting and segregating all objects by type so that they can be treated as massive buckets of bits that get streamed about the place is a very good feature.

1) convert dynamic strings to fixed-length char buffers 2) convert any vector-style arrays to fixed-size arrays

That sounds extreme as in Kylotan pointed out - this is basically deliberately creating fragmentation in order to avoid it :wink:
Instead of this, I'd strive to get to a situation where as many objects of the same type with similar lifetime are allocated at the same time/place. e.g. in a serialization format, this could mean that when someone asks to write a string to the file, you instead add it to a buffered collection of strings and write a placeholder word which will be patched up with a string id/offset at the end. Then at the end of the serialization process, go ahead and write all of the strings into the file end-to-end, and then go back and patch up those id's/offsets. Then when deserializing, don't create one allocation per string, load that entire chunk-of-strings as a single object and then give out non-owning const char*'s to the objects that need them.
This is really easy to achieve if the objects that you're loading from the file all have the exact same lifetime, such as when loading a level containing a bunch of objects that will exist for the duration of the level. In this case, you can create a single, massive allocation big enough to contain all of those objects and then placement-create all the objects into it. Even better if you can design a serialization system that lets you deserialize in place. i.e. if the constructed version of an object will be no bigger than the serialized version of it, then you can placement-create it over the top of the serialized version. This way you stream a file into a buffer, perform the deserialization process, and now that buffer is full of valid C++ objects without any further allocations being made.
FWIW, I go through the trouble of designing things this way for core engine systems -- e.g. models, materials, textures, animations, background scenes... but don't usually put in the effort for higher level gameplay systems.

I couldn't really describe how I would implement versioning in a robust way right off the bat.

In my use-case, I simply don't. I compile my core game assets into this form, and if the data structures change, I recompile them.
I don't use it for anything that can't simply be recompiled from source assets, such as a user's save game -- that kind of data uses a more bloated serialization system that can tolerate schema changes.

I did mention networking somewhat prematurely, though, as I don't really have any real multiplayer at this time.

While you could re-use disk based serialization for multiplayer networking, it's optimal to not do that. Network code really cares about the number of bits used in the encoding. Disk is slow, but still going to be 500-2000x faster than a crappy DSL line. Network data also doesn't have to persist - if there's a version mismatch then one player is unpatched or cheating and shouldn't be playing! Plus as you're repeatedly sending similar data, you can rely on things like delta encoding against previous gameplay states.

This is a follow-up post after a few days of mulling things over. You guys brought up several crucial things, which I tried to take into consideration. Thanks!

Anyway, I took some time to go over my engine code and assess where and how much it would be feasible to start streamlining stuff. Presently I settled my approach to memory allocations with what Hodgman said - not worrying so much about allocations themselves so much as making sure nearby allocations retain similar lifetimes. One of the primary things I was worried about was an I/O mechanism, so I started by automating the serialization code. I haven't been able to do any profiling, because a) I don't have an actual real world dataset to work with yet and b) for now I'm neck deep in debug mode anyway. That being said, I'm using std::is_standard_layout<> to automatically coalesce trivial fields and bin whole arrays of POD-like objects. A lot more preparatory work can likely be done during compile time and I'm hoping to offload some additional infrastructure to the compiler when I finally upgrade to a one that supports constexpr.

As for runtime, using a few of my actual data structures as a baseline, simple field reordering to prioritize grouping of simple/POD-like members and taking of a few steps back to critically assess how I store the stuff in the first place yields results where I can coalesce about 60-80% of data that is read from disk into a single mem copy operation. This does assume some consistency across platforms, though: my current solution to versioning would be to store hashes of object types (which I'm thinking of calculating as a CRC of one long string (which has been stripped of whitespaces) obtained by concatenating all field types of serialized object) in the data file's header. At this point I can still decide to read data as-is, even if the hashes don't match to partially load an object, or - as suggested above - rebuild the dataset.

This does assume some consistency across platforms, though

Yeah, that's exactly what wasted our team about a month of coder time trying to fix up. Not only do you have to consider type sizes, but you have to consider alignment, and padding. And changing compiler defaults. And new platforms.

Take a simple struct of `int a; char b; vector3 c;`. This will give you a single unambiguous CRC hash. But, given any arbitrary platform or compiler setting, do you really know what sizeof(a), offsetof(b), or offsetof( c ) are?

You can smash some of this with #pragma pack. Just hope you don't have any objects that have picky alignment requirements.

This topic is closed to new replies.

Advertisement