Boost Serialization for Resource loading is extremely slow

Started by
18 comments, last by Hodgman 10 years, 9 months ago

Now that there is code, it doesn't seem to do anything suspicious.

  • A 26MB file is large, but is it bloated compared to the source assets it was created from? Did you inspect it?
  • Are textures compressed? In the file, they should be. The Boost Serialization library allows you to read and write arbitrary byte arrays, so integrating jpeglib, pnglib, or other mature libraries to encode and decode image data shouldn't be difficult.
  • Maybe std::vector isn't the most lightweight choice. Did you benchmark arrays of primitive types with their special wrapper objects (http://www.boost.org/doc/libs/1_53_0/libs/serialization/doc/wrappers.html#arrays)?

Omae Wa Mou Shindeiru

Advertisement

Are textures compressed? In the file, they should be. The Boost Serialization library allows you to read and write arbitrary byte arrays, so integrating jpeglib, pnglib, or other mature libraries to encode and decode image data shouldn't be difficult.

Yes and no. Yes, they should be using a compressed format, and no they should not be those formats and they should require absolutely zero decoding.

Games don't use jpeg or png. Those formats require quite a lot of processing to turn into formats that the game can actually use. Game artists don't use those formats either, they use psd.

Games use DXT1-5 for images because they can be fed directly to the video card in both OpenGL and DirectX and they are nicely compressed. Even mobile devices support DXT-format images.

If your game is using jpeg images, that's an issue right there. Jpeg is great for web pages and photos, but horrible for just about everything else.

As an update, doing a release-build (took awhile to get the projects setup), drastically improved the situation, going from 30-40 seconds to a some 400ish MS, like night and day.

I like the idea of a "light-debug" build, with the _HAS_ITERATOR_DEBUGGING 0, is there any other configurations which could improve the debug-build speeds?

There are all kinds of flags. Pulling from my current project (not the latest version of Visual Studio) I get these.

/O2 /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "WIN32_LEAN_AND_MEAN" /D "VC_EXTRALEAN" /D "NOMINMAX" /D "_CRT_SECURE_NO_DEPRECATE" /D "_CRT_NONSTDC_NO_DEPRECATE" /D "_SCL_SECURE_NO_DEPRECATE" /D "_SECURE_SCL=0" /D "_MBCS" /FD /MD /GS- /arch:SSE /fp:fast /GR- /W4 /WX /Zi

There are a few more that I left out, such as precompiled headers, include directories, and output file names. But that should give you an idea for what kinds of options go to the compiler for a faster debug build.

/D "NDEBUG"

Are you usually in the habit of defining NDEBUG for debug builds?

It's a little odd to turn off assertions in favour of speed...

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

FWIW, actually, a lot of games (Quake 3, Xonotic, Minecraft, ...) have used formats like JPEG and PNG.

Rage uses HD Photo / JPEG-XR.

...

OTOH, lots of games have also used raw DXT, such as via DDS or VTF.

generally, we want DXT on the GPU though, either via conversion on load, or using a DXT-based format for storage.

a minor downside of on-disk storage of raw DXTn (DXT1 or DXT5) is that it does often take more space than a JPEG version of the texture.

some additional filtering and compression can help here, making it much more competitive space-wise with JPEG (basically, we want an algorithm to merge and eliminate similar looking pixel-blocks, as well as possibly perform some additional entropy coding, *1).

*1: in my case, I am using a custom format here, which uses "block reduction" and a combination of custom LZ77 and Deflate. I am mostly left considering this as an option for video-mapped textures (vs the use of a customized JPEG variant...).

FWIW: JPEG -> DXTn conversion can be sped up some by making a specialized decoder which decodes directly to DXTn.

/D "NDEBUG"

Are you usually in the habit of defining NDEBUG for debug builds?

It's a little odd to turn off assertions in favour of speed...

Oh, quite. We have a custom assertion system. Yeah, unless you have your own custom assertion library, you probably want that one left on for a mixed debugrelease build.

a minor downside of on-disk storage of raw DXTn (DXT1 or DXT5) is that it does often take more space than a JPEG version of the texture.


Isn't that because JPEG is lossy, though? Or is it still better size with the quality maxxed out?

Edit - nvm, I'm looking at DXT now and I see that it's also lossy.

Question - Do adapters convert DXT textures to table/map when loading, or do they just store in DXT format and do the calculations per-render?
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

a minor downside of on-disk storage of raw DXTn (DXT1 or DXT5) is that it does often take more space than a JPEG version of the texture.


Isn't that because JPEG is lossy, though? Or is it still better size with the quality maxxed out?

Edit - nvm, I'm looking at DXT now and I see that it's also lossy.

Question - Do adapters convert DXT textures to table/map when loading, or do they just store in DXT format and do the calculations per-render?

AFAIK: it is done per-render.

namely, DXT is as it is to save video RAM vs the use of raw RGBA textures.

a lot of JPEGs small size is due to the way the DCT / quantization / entropy coding work. often, JPEG will often end up using a moderately small number of bits for each block (as a lot of values become zero and are simply skipped), so on average the images come out fairly small.

OTOH, DXT reserves bits for each pixel in every block, as well as storing explicit per-block color information, ...

also, if using DXT5 and mipmaps, these can make the image around 4x larger than if using DXT1 with no mipmaps, which is also an issue.

IME, for many images, DXT1 (with no mipmaps) will be around similar size to a 90-95% quality JPEG, but using DXT5 and/or mipmaps makes it a lot bigger (whereas turning down the quality will make a JPEG a lot smaller).

for JPEG images, it is more common to generate the mipmaps dynamically (so they are not stored). (DXTn doesn't leave as much room for efficiently generating mipmaps dynamically).

standard JPEG also lacks alpha-channel support, but this can be hacked on, but will usually compress down pretty well.

whereas, using DXT5 (with a full alpha-channel) effectively doubles the image size.

...

but, granted, some of this is why secondary compression can squeeze down DXTn images somewhat.

for the direct JPEG -> DXT conversion option, basically this means mostly shaving off a lot of the upper-end of the JPEG decoding logic, and basically once the DCT blocks are decoded, then we basically just convert the YUV macroblocks directly into DXT (basically, more directly converting each 16x16 macroblock into DXT blocks).

granted, there are still time costs, so more directly compressed DXT blocks are faster to decode.

They are read-only once they are deserialized.

That solution looks awesome, if I can skip on the serialization library I'm all for it, as I can then so no reason for it. Can you give a brief explanation on how this solution would work?

The function I posted shows that there is no work to be done in deserialization (it's just a cast of bytes to your structure type). The magic behind that is in the changes to your structures included above the function. The header that I include at the top is open-source (MIT license), or you can implement your own pretty easily by copying what the header does.

Pointers have been replaced with Offset<T>'s, which are basically just a 32-bit integer that holds a relative offset value.
To convert an Offset<T> into a T* at runtime, the operation is:


int& offset = ...;//given an integer offset variable
char* bytes = (char*)&offset;//cast the variable's address to a byte address
bytes += offset;//increment the byte pointer by the value of the integer
T* object = (T*)bytes;//cast the new address to the desired object type

In that header, this is done inside Offset<T>::operator->, so that you can use these offset variables as if they were pointers!

Now that we're not using pointers (which are absolute memory addresses), and we're instead using relative/local memory offsets, we're able to save them to disk. The same offset value can be stored in disk and used in RAM, so no processing is required on load -- instead, every time that you use the Offset<T> variable as a pointer, you pay the cost of an extra addition instruction, which is negligible. Also, performance may be improved despite that additional addition, as now the locality of all your data is guaranteed to be great (it's all in a single contiguous allocation, whereas the std::vector's are at the mercy of wherever new feels like storing your data).

The other change to your structures is I've replaced the std::vector<T>'s with List<T>'s from that header. This is a variable-sized structure, which begins with an int32 "header" containing the length of the array, and then the header is followed by the actual array data.

Because List<T> is a variable sized struct, you can't embed it as a member easily, because it's size isn't known at compile time, so your member variable is only big enough to hold the header. The actual array data will overlap with the other members. To fix that, I use offsets to lists in the modified version of your code:


struct Foo_Broken
{
  List<Bar> a;//header followed by data
  List<Bar> b;//uh oh, a's data will overwrite b's header
};
struct Foo_Fixed
{
  Offset<List<Bar>> a;//the list is somewhere else, not right here, no overflow
  List<Bar> b;//this one doesn't *have* to use an offset.
  //Just be aware that now Foo_Fixed is a variable-sized structure, because it will be followed by b's data!
};


So that's the "deserialization"/runtime part covered, which is easy. The tricky part is the serialization routines to get data into this format.

Personally, I generate all my data from some C# tools, so I've made some extensions to C#'s BinaryWriter to help with this task... but the same ideas would work with any kind of "binary file writer" class that can write data of different sizes, can tell you it's position in the file, and lets you jump back and forth in the file.

Say I wanted to write some data to match the C++ struct of:


struct Header
{
  Offset<List<float>> data1;
  Offset<List<u8>> data2;
};

I'd use some C# code like this:


List<float> data1 = ...
List<byte> data2 = ...
//^^ inputs
BinaryWriter writer = ...
//^^ output file

//first write some placeholder data for the header structure (two 32-bit offsets), but remember their positions
long headerData1pos = writer.WriteTemp32();
long headerData2pos = writer.WriteTemp32();

//now to write the List<float> data1 member
//first, rewind to headerData1pos, and overwrite it with the offset from there to here, then fast-forward back to here
writer.OverwriteTemp32(headerData1pos, writer.RelativeOffset(headerData1pos));
writer.Write32( data1.Count )//write the list header - 32bit array size
foreach( var data in data1 )
  writer.WriteFloat( data );//write the array contents

//now to write the List<u8> data2 member
//again, rewind to the header, write the actual offset value in it, then fast-forward back to the end of the file
writer.OverwriteTemp32(headerData2pos, writer.RelativeOffset(headerData2pos));
writer.Write32( data2.Count )//write the list header - 32bit array size
foreach( var data in data2 )
  writer.Write8( data );//write the array contents

The resulting file's bytes can then be loaded into RAM in your C++ app, cast to a Header*, and it will just work.
To save space on disc, your file system / file loader can implement some kind of compression on storing/loading files if you want, like ZLIB/GZ/LZMA/etc...

If the inputs to the above routine were 1 float with hex value 0x12345678, and 4 bytes with the values 1, 2, 3 and 4, the output file would look like this (when interpreted as groups of 32-bit integers expressed as hex):


0: 0x00000008 // data1 offset - jump forward 8 bytes to line #2
1: 0x0000000C // data2 offset - jump forward 12 bytes to line #4
2: 0x00000001 // data1 list header - 1 item in array
3: 0x12345678 // our float value
4: 0x00000004 // data2 list header - 4 items in array
5: 0x04030201 // our 4 byte values (in little endian order, the right hand byte is written/read before the left hand byte).

[hr]

Question - Do adapters convert DXT textures to table/map when loading, or do they just store in DXT format and do the calculations per-render?

GPUs have dedicated hardware to perform DXT decompression on pixels at the last possible moment (right when the shader asks for a pixel). This greatly improves performance because it means that the texture is still compressed even in the texture cache, which means more data can be cached at once, and less bandwidth is required per texture-fetch smile.png

also, if using DXT5 and mipmaps, these can make the image around 4x larger than if using DXT1 with no mipmaps, which is also an issue.

To use this as an excuse to expand on the above statement on DXT performance - mipmaps are also extremely important for performance (regardless of texture format), because they improve locality of texture fetches / reduce bandwidth of fetches (during minification scenarios). So you should usually use them in most-cases. As mentioned by cr88192, with DXT textures, mipmaps are usually saved on disc, but with other formats like JPEG, they're often generated on-load.

This topic is closed to new replies.

Advertisement