Recently, I put together a benchmark for general-purpose lossless data compression. While I was doing so it occurred to me that the standard corpora aren't very representative of the type of data people actually use data compression for these days, and part of what is missing is data from games.
Lossless compression usage
I created a C++ lossless bit compression stream a couple of years back as a quick prototype.
It's something I might use later for a game I'm making?
It probably needs to be RLE before the actual packing; maybe something to add another day?
Sorry, to be clear I'm not looking for more implementations of compression algorithms—well, I am, but that's not what this thread is about. What I am trying to find is the type of data that game developers typically need to compress so I can include it in a set of data I'm putting together to test different algorithms.
In addition to being used by developers to judge which compression algorithm(s) might be appropriate for them, people developing codecs will almost certainly be using this data to help tune their compressors, so getting something like the type of data you want to compress into this corpus should help make codecs better at handling that type of data—meaning better compression for games. AFAIK none of the currently used corpora (such as the Canterbury Corpus, Silesa Compression Corpus, Calgary Corpus, etc.) include any game data.
- image data, mainly textures
(I use lossless compressed, png/tga files or dxt3/5 dx9 format)
- audio, sound effects and music
(wav->ogg for sounds, wav->mp3 for music)
- meshes/ 3d model data, can be any format/compression.
Most of the time a standard export from i.e max or maya, processed through a pipeline, resulting in fitting data for the game engine)
Just a quick note -- when you're talking about what kind of files are used in games, you're really talking about two sets of files -- the ones used during development to iterate on, and the ones that are delivered in the published product that are derived from the development files. You wouldn't normally have cause to compress the development files, but it might be interesting none-the-less -- game assets are huge and getting huger, which ought to be clear from the fact that some large games weigh in at 70GB or more as they sit on store shelves. But even if development files are not interesting, at the very least you'll want people to keep straight which type of files (development or production)they're speaking about in this thread.
You'll also find it common that many games rely heavily on bespoke file formats that might even vary from platform to platform even for a single game. Game developers like to be able to stream on-disk content straight into memory sometimes, so that the file contents don't need to be parsed. This means that platform-specific compatibility considerations such as padding and alignment (or other, performance-impacting considerations) get built into the on-disk formats for each given platform. Sometimes this 'raw' form of the platform specific data is compressed -- though I think this will almost universally be LZW, because at least one of the current platforms has hardware that can essentially DMA a file into memory and do the decode in transit -- I believe both current consoles also do this for Jpeg as well.
Good point—in this case I'm definitely more interested in the published files. The development files could be an interesting data set, but the published files are important to a lot more people. I'm not trying to create a corpus with every different type of data available, I'm trying to create something which will be relevant for the vast majority of people.
As for memory mapped files, I think the only reasonable thing to is ignore the issue. Either such cases cannot benefit from compression or the algorithm is constrained by the hardware, so there isn't really anything we can do here but focus on the use cases which could be served by compression.
The fact that LZW is used for that is interesting—thanks for that. I've been generally uninterested in adding LZW support to Squash, I'll have to rethink that. Any chance you could share the names of some of the platforms which can decode LZW like that?
The asset pipeline basically has the following (and more) purposes:
- strip the data to ony the data your game/ engine needs
- optimize the data structure, to decrease loading times and improve ingame performance
An example setup may be:
1. Artist creates 3d model and stores it in network folder X
2. A batch job processes the stored asset in a standard modelling program format (like max), the processed asset is stored in the engine/ game specific (optimized) format
3. The engine's asset preview tool or a build of the game uses the converted
Another option could be that you create/ use a customer modelling tool export plugin (which in my case is out of my league on knowledge level :))
Note; the same/ a similar process can be in place for audio and/or texture data/assets.
The fact that LZW is used for that is interesting—thanks for that. I've been generally uninterested in adding LZW support to Squash, I'll have to rethink that. Any chance you could share the names of some of the platforms which can decode LZW like that?
Yes, I had to check whether its been revealed on a non-NDA basis first. Its the XBox One. This public powerpoint from Microsoft/AMD reveal that the asynchronous 'Move engines' (a fancier DMA unity, basically) can both compress and decompress LZ data -- I'm not sure how inclusive of all the different LZ compression formats that is, though. It also can decompress JPEG, can swizzle textures, and perform memset-type operations. I believe that the PS4 does at least jpeg, and might do LZ as well, but I'm not certain.
I don't know if you want to ignore the memory-mapped files -- They are platform specific, but in practice the two platforms (Xbox One and PS4) are similar enough that there might be common ground. But I'm speculating' I don't know, and probably couldn't say if I did.
I was indeed talking about compressing the wav's to ogg/mp3 (after they're "ready" for the game).
wav->mp3/ogg/opus is outside the scope of what I'm working on (which is lossless, general purpose compression). Honestly I think lossy audio compression is more interesting, but it's not what the corpus is for. I've filed an issue about 3D models, and I'll start trying to find some information about the formats different engines use—hopefully add some of the output from one of them to the corpus. My current idea is to try to get a 3D model from one of the Blender foundation's open movie projects, then figure out how to get that exported as an asset from Unity or something.
Yes, I had to check whether its been revealed on a non-NDA basis first. Its the XBox One. This public powerpoint from Microsoft/AMD reveal that the asynchronous 'Move engines' (a fancier DMA unity, basically) can both compress and decompress LZ data -- I'm not sure how inclusive of all the different LZ compression formats that is, though. It also can decompress JPEG, can swizzle textures, and perform memset-type operations. I believe that the PS4 does at least jpeg, and might do LZ as well, but I'm not certain.
I suspected an NDA was why you didn't just come out and say it in you initial post; thanks for taking the time to research this, I have a feeling a lot of people in the compression community weren't aware, maybe we can get some better encoders.
According to http://www.redgamingtech.com/xbox-one-sdk-leak-part-3-move-engines-memory-bandwidth-performance-tech-tribunal/ the LZ is LZ77. It also seems to imply that it is either DEFLATE, zlib, or gzip (though I'm not sure how reliable that article is…) which would make sense. If that's true, I would definitely be looking at zopfli for things you only have to compress when building a release but decompress regularly.
I don't know if you want to ignore the memory-mapped files -- They are platform specific, but in practice the two platforms (Xbox One and PS4) are similar enough that there might be common ground. But I'm speculating' I don't know, and probably couldn't say if I did.
Right, but I don't really think much can be done about the issue. Using that feature will obviously restrict people to the supported codecs, and people will probably be more interested in in writing improved implementations of the relevant codecs, but AFAICT from the perspective of developing a corpus nothing changes; you are still using the same type of data.