Lossless compression usage

Started by
14 comments, last by Waterlimon 8 years, 12 months ago

Recently, I put together a benchmark for general-purpose lossless data compression. While I was doing so it occurred to me that the standard corpora aren't very representative of the type of data people actually use data compression for these days, and part of what is missing is data from games.

To address this I'm putting together a new corpus and I would like to make sure it includes data relevant for game developers. Since I have virtually no experience with game development I was hoping some of the developers around here could tell me what kind of data they usually use compression for, as well as how they use it (i.e., bundled with other content and the entire container is compressed, or is each piece of data compressed individually), so I can include something like it in the corpus.
For a list of things I'm already thinking about, please take a look at the project's issue tracker. If you have any suggestions, thoughts on the existing items, ideas for data sources, etc., I would definitely like to know.
Advertisement

I created a C++ lossless bit compression stream a couple of years back as a quick prototype.

It's something I might use later for a game I'm making?

It probably needs to be RLE before the actual packing; maybe something to add another day?

https://github.com/cole-anstey/bit-compression-stream/wiki

Sorry, to be clear I'm not looking for more implementations of compression algorithms—well, I am, but that's not what this thread is about. What I am trying to find is the type of data that game developers typically need to compress so I can include it in a set of data I'm putting together to test different algorithms.

In addition to being used by developers to judge which compression algorithm(s) might be appropriate for them, people developing codecs will almost certainly be using this data to help tune their compressors, so getting something like the type of data you want to compress into this corpus should help make codecs better at handling that type of data—meaning better compression for games. AFAIK none of the currently used corpora (such as the Canterbury Corpus, Silesa Compression Corpus, Calgary Corpus, etc.) include any game data.

My share:
- image data, mainly textures
(I use lossless compressed, png/tga files or dxt3/5 dx9 format)
- audio, sound effects and music
(wav->ogg for sounds, wav->mp3 for music)
- meshes/ 3d model data, can be any format/compression.
Most of the time a standard export from i.e max or maya, processed through a pipeline, resulting in fitting data for the game engine)

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

Compressed textures are definitely something I'm interested in including, though I'm a bit unsure about the format… it seems to me (very much an outsider) that things are moving away from S3TC towards something like ETC2 or ASTC, perhaps it would be better to use one of those?
As for audio, do you compress the OGG/MP3 (or, trying to look towards the future again, Opus), or are you just talking about the compression being the WAV->OGG/MP3 encoding process? If you do compress the OGG/MP3, is that just for the installer or also for your installed data?
3D model data seems like it would be a very good idea, but I don't understand the "processed through a pipeline, resulting in fitting data for the game engine)" part. Does each game engine have its own format, and that is what you compress?
Finally, for all this stuff, is it all lumped together in some sort of container file which is then all compressed/decompressed at once, or is each file compressed independently?

Just a quick note -- when you're talking about what kind of files are used in games, you're really talking about two sets of files -- the ones used during development to iterate on, and the ones that are delivered in the published product that are derived from the development files. You wouldn't normally have cause to compress the development files, but it might be interesting none-the-less -- game assets are huge and getting huger, which ought to be clear from the fact that some large games weigh in at 70GB or more as they sit on store shelves. But even if development files are not interesting, at the very least you'll want people to keep straight which type of files (development or production)they're speaking about in this thread.

You'll also find it common that many games rely heavily on bespoke file formats that might even vary from platform to platform even for a single game. Game developers like to be able to stream on-disk content straight into memory sometimes, so that the file contents don't need to be parsed. This means that platform-specific compatibility considerations such as padding and alignment (or other, performance-impacting considerations) get built into the on-disk formats for each given platform. Sometimes this 'raw' form of the platform specific data is compressed -- though I think this will almost universally be LZW, because at least one of the current platforms has hardware that can essentially DMA a file into memory and do the decode in transit -- I believe both current consoles also do this for Jpeg as well.

throw table_exception("(? ???)? ? ???");

Good point—in this case I'm definitely more interested in the published files. The development files could be an interesting data set, but the published files are important to a lot more people. I'm not trying to create a corpus with every different type of data available, I'm trying to create something which will be relevant for the vast majority of people.

As for memory mapped files, I think the only reasonable thing to is ignore the issue. Either such cases cannot benefit from compression or the algorithm is constrained by the hardware, so there isn't really anything we can do here but focus on the use cases which could be served by compression.

The fact that LZW is used for that is interesting—thanks for that. I've been generally uninterested in adding LZW support to Squash, I'll have to rethink that. Any chance you could share the names of some of the platforms which can decode LZW like that?

I was indeed talking about compressing the wav's to ogg/mp3 (after they're "ready" for the game).

The asset pipeline basically has the following (and more) purposes:
- strip the data to ony the data your game/ engine needs
- optimize the data structure, to decrease loading times and improve ingame performance

An example setup may be:
1. Artist creates 3d model and stores it in network folder X
2. A batch job processes the stored asset in a standard modelling program format (like max), the processed asset is stored in the engine/ game specific (optimized) format
3. The engine's asset preview tool or a build of the game uses the converted

Another option could be that you create/ use a customer modelling tool export plugin (which in my case is out of my league on knowledge level :))

Note; the same/ a similar process can be in place for audio and/or texture data/assets.

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me


The fact that LZW is used for that is interesting—thanks for that. I've been generally uninterested in adding LZW support to Squash, I'll have to rethink that. Any chance you could share the names of some of the platforms which can decode LZW like that?

Yes, I had to check whether its been revealed on a non-NDA basis first. Its the XBox One. This public powerpoint from Microsoft/AMD reveal that the asynchronous 'Move engines' (a fancier DMA unity, basically) can both compress and decompress LZ data -- I'm not sure how inclusive of all the different LZ compression formats that is, though. It also can decompress JPEG, can swizzle textures, and perform memset-type operations. I believe that the PS4 does at least jpeg, and might do LZ as well, but I'm not certain.

I don't know if you want to ignore the memory-mapped files -- They are platform specific, but in practice the two platforms (Xbox One and PS4) are similar enough that there might be common ground. But I'm speculating' I don't know, and probably couldn't say if I did.

throw table_exception("(? ???)? ? ???");

I was indeed talking about compressing the wav's to ogg/mp3 (after they're "ready" for the game).

wav->mp3/ogg/opus is outside the scope of what I'm working on (which is lossless, general purpose compression). Honestly I think lossy audio compression is more interesting, but it's not what the corpus is for. I've filed an issue about 3D models, and I'll start trying to find some information about the formats different engines use—hopefully add some of the output from one of them to the corpus. My current idea is to try to get a 3D model from one of the Blender foundation's open movie projects, then figure out how to get that exported as an asset from Unity or something.

Yes, I had to check whether its been revealed on a non-NDA basis first. Its the XBox One. This public powerpoint from Microsoft/AMD reveal that the asynchronous 'Move engines' (a fancier DMA unity, basically) can both compress and decompress LZ data -- I'm not sure how inclusive of all the different LZ compression formats that is, though. It also can decompress JPEG, can swizzle textures, and perform memset-type operations. I believe that the PS4 does at least jpeg, and might do LZ as well, but I'm not certain.

I suspected an NDA was why you didn't just come out and say it in you initial post; thanks for taking the time to research this, I have a feeling a lot of people in the compression community weren't aware, maybe we can get some better encoders.

According to http://www.redgamingtech.com/xbox-one-sdk-leak-part-3-move-engines-memory-bandwidth-performance-tech-tribunal/ the LZ is LZ77. It also seems to imply that it is either DEFLATE, zlib, or gzip (though I'm not sure how reliable that article is…) which would make sense. If that's true, I would definitely be looking at zopfli for things you only have to compress when building a release but decompress regularly.

I don't know if you want to ignore the memory-mapped files -- They are platform specific, but in practice the two platforms (Xbox One and PS4) are similar enough that there might be common ground. But I'm speculating' I don't know, and probably couldn't say if I did.

Right, but I don't really think much can be done about the issue. Using that feature will obviously restrict people to the supported codecs, and people will probably be more interested in in writing improved implementations of the relevant codecs, but AFAICT from the perspective of developing a corpus nothing changes; you are still using the same type of data.

This topic is closed to new replies.

Advertisement