Sign in to follow this  
nemequ

Lossless compression usage

Recommended Posts

Recently, I put together a benchmark for general-purpose lossless data compression.  While I was doing so it occurred to me that the standard corpora aren't very representative of the type of data people actually use data compression for these days, and part of what is missing is data from games.

 
To address this I'm putting together a new corpus and I would like to make sure it includes data relevant for game developers. Since I have virtually no experience with game development I was hoping some of the developers around here could tell me what kind of data they usually use compression for, as well as how they use it (i.e., bundled with other content and the entire container is compressed, or is each piece of data compressed individually), so I can include something like it in the corpus.
 
For a list of things I'm already thinking about, please take a look at the project's issue tracker.  If you have any suggestions, thoughts on the existing items, ideas for data sources, etc., I would definitely like to know.

Share this post


Link to post
Share on other sites

I created a C++ lossless bit compression stream a couple of years back as a quick prototype. 

It's something I might use later for a game I'm making?

It probably needs to be RLE before the actual packing; maybe something to add another day?

 

https://github.com/cole-anstey/bit-compression-stream/wiki

Edited by collie

Share this post


Link to post
Share on other sites

Sorry, to be clear I'm not looking for more implementations of compression algorithms—well, I am, but that's not what this thread is about.  What I am trying to find is the type of data that game developers typically need to compress so I can include it in a set of data I'm putting together to test different algorithms.

 

In addition to being used by developers to judge which compression algorithm(s) might be appropriate for them, people developing codecs will almost certainly be using this data to help tune their compressors, so getting something like the type of data you want to compress into this corpus should help make codecs better at handling that type of data—meaning better compression for games.  AFAIK none of the currently used corpora (such as the Canterbury Corpus, Silesa Compression Corpus, Calgary Corpus, etc.) include any game data.

Share this post


Link to post
Share on other sites
Compressed textures are definitely something I'm interested in including, though I'm a bit unsure about the format…  it seems to me (very much an outsider) that things are moving away from S3TC towards something like ETC2 or ASTC, perhaps it would be better to use one of those?
 
As for audio, do you compress the OGG/MP3 (or, trying to look towards the future again, Opus), or are you just talking about the compression being the WAV->OGG/MP3 encoding process?  If you do compress the OGG/MP3, is that just for the installer or also for your installed data?
 
3D model data seems like it would be a very good idea, but I don't understand the "processed through a pipeline, resulting in fitting data for the game engine)" part.  Does each game engine have its own format, and that is what you compress?
 
Finally, for all this stuff, is it all lumped together in some sort of container file which is then all compressed/decompressed at once, or is each file compressed independently?

Share this post


Link to post
Share on other sites

Just a quick note -- when you're talking about what kind of files are used in games, you're really talking about two sets of files -- the ones used during development to iterate on, and the ones that are delivered in the published product that are derived from the development files. You wouldn't normally have cause to compress the development files, but it might be interesting none-the-less -- game assets are huge and getting huger, which ought to be clear from the fact that some large games weigh in at 70GB or more as they sit on store shelves. But even if development files are not interesting, at the very least you'll want people to keep straight which type of files (development or production)they're speaking about in this thread.

 

You'll also find it common that many games rely heavily on bespoke file formats that might even vary from platform to platform even for a single game. Game developers like to be able to stream on-disk content straight into memory sometimes, so that the file contents don't need to be parsed. This means that platform-specific compatibility considerations such as padding and alignment (or other, performance-impacting considerations) get built into the on-disk formats for each given platform. Sometimes this 'raw' form of the platform specific data is compressed -- though I think this will almost universally be LZW, because at least one of the current platforms has hardware that can essentially DMA a file into memory and do the decode in transit -- I believe both current consoles also do this for Jpeg as well.

Share this post


Link to post
Share on other sites

Good point—in this case I'm definitely more interested in the published files.  The development files could be an interesting data set, but the published files are important to a lot more people.  I'm not trying to create a corpus with every different type of data available, I'm trying to create something which will be relevant for the vast majority of people.

 

As for memory mapped files, I think the only reasonable thing to is ignore the issue.  Either such cases cannot benefit from compression or the algorithm is constrained by the hardware, so there isn't really anything we can do here but focus on the use cases which could be served by compression.

 

The fact that LZW is used for that is interesting—thanks for that.  I've been generally uninterested in adding LZW support to Squash, I'll have to rethink that.  Any chance you could share the names of some of the platforms which can decode LZW like that?

Share this post


Link to post
Share on other sites
I was indeed talking about compressing the wav's to ogg/mp3 (after they're "ready" for the game).

The asset pipeline basically has the following (and more) purposes:
- strip the data to ony the data your game/ engine needs
- optimize the data structure, to decrease loading times and improve ingame performance

An example setup may be:
1. Artist creates 3d model and stores it in network folder X
2. A batch job processes the stored asset in a standard modelling program format (like max), the processed asset is stored in the engine/ game specific (optimized) format
3. The engine's asset preview tool or a build of the game uses the converted

Another option could be that you create/ use a customer modelling tool export plugin (which in my case is out of my league on knowledge level :))

Note; the same/ a similar process can be in place for audio and/or texture data/assets.

Share this post


Link to post
Share on other sites

The fact that LZW is used for that is interesting—thanks for that.  I've been generally uninterested in adding LZW support to Squash, I'll have to rethink that.  Any chance you could share the names of some of the platforms which can decode LZW like that?

 

Yes, I had to check whether its been revealed on a non-NDA basis first. Its the XBox One. This public powerpoint from Microsoft/AMD reveal that the asynchronous 'Move engines' (a fancier DMA unity, basically) can both compress and decompress LZ data -- I'm not sure how inclusive of all the different LZ compression formats that is, though. It also can decompress JPEG, can swizzle textures, and perform memset-type operations. I believe that the PS4 does at least jpeg, and might do LZ as well, but I'm not certain.

 

I don't know if you want to ignore the memory-mapped files -- They are platform specific, but in practice the two platforms (Xbox One and PS4) are similar enough that there might be common ground. But I'm speculating' I don't know, and probably couldn't say if I did.

Edited by Ravyne

Share this post


Link to post
Share on other sites

I was indeed talking about compressing the wav's to ogg/mp3 (after they're "ready" for the game).

 

wav->mp3/ogg/opus is outside the scope of what I'm working on (which is lossless, general purpose compression).  Honestly I think lossy audio compression is more interesting, but it's not what the corpus is for.  I've filed an issue about 3D models, and I'll start trying to find some information about the formats different engines use—hopefully add some of the output from one of them to the corpus.  My current idea is to try to get a 3D model from one of the Blender foundation's open movie projects, then figure out how to get that exported as an asset from Unity or something.

 

Yes, I had to check whether its been revealed on a non-NDA basis first. Its the XBox One. This public powerpoint from Microsoft/AMD reveal that the asynchronous 'Move engines' (a fancier DMA unity, basically) can both compress and decompress LZ data -- I'm not sure how inclusive of all the different LZ compression formats that is, though. It also can decompress JPEG, can swizzle textures, and perform memset-type operations. I believe that the PS4 does at least jpeg, and might do LZ as well, but I'm not certain.

 

I suspected an NDA was why you didn't just come out and say it in you initial post; thanks for taking the time to research this, I have a feeling a lot of people in the compression community weren't aware, maybe we can get some better encoders.

 

According to http://www.redgamingtech.com/xbox-one-sdk-leak-part-3-move-engines-memory-bandwidth-performance-tech-tribunal/ the LZ is LZ77.  It also seems to imply that it is either DEFLATE, zlib, or gzip (though I'm not sure how reliable that article is…) which would make sense.  If that's true, I would definitely be looking at zopfli for things you only have to compress when building a release but decompress regularly.

 

I don't know if you want to ignore the memory-mapped files -- They are platform specific, but in practice the two platforms (Xbox One and PS4) are similar enough that there might be common ground. But I'm speculating' I don't know, and probably couldn't say if I did.

 

Right, but I don't really think much can be done about the issue.  Using that feature will obviously restrict people to the supported codecs, and people will probably be more interested in in writing improved implementations of the relevant codecs, but AFAICT from the perspective of developing a corpus nothing changes; you are still using the same type of data.

Share this post


Link to post
Share on other sites


I would like to make sure it includes data relevant for game developers.

The problem is that games use practically everything there is to store.

 

You've got the basics of audio, video, and still images, and each of these may already have their own compression including mp4 or H.264 or VP9 or Bink Video, or wav or mp3 or ogg or something else for audio, or jpg or png or dxt or s3 or pvr or something else for stills. You've got saves that can be row/column style data and sometimes is implemented as an actual SQL data file from SQL-Lite or similar, other times it is hash tables or other data dictionaries. You've got plain text files, you've potentially got encrypted archives, you've potentially got content already compressed with zip or bz2, or 7z or other formats. And you've got the general purpose struct dump.

 

 

So that leads me to the ever-important question of WHY you are doing this?

 

All you've mentioned is that you want "a general corpus" for "lossless, general purpose compression".  

 

The difficulty is that the industry really doesn't need that at the moment.

 

Like so many algorithms and systems out there, compression is a method of exchanging compute time for storage space.  Locally stored save files are usually small and the player does not want to wait a long time for them, so games and players are typically accepting of large space requirements.  Installation files traveling the Internet are often large, but they can be heavily compressed with algorithms like 7z that are extremely slow to compress but result in a highly efficient data dictionary or Markov chains for dynamically probabilistic decompression tables -- trading an enormous compute effort up front in exchange for extremely tight compression to minimize what is transferred.  

 

Similarly, specialized lossless encoding for audio, video, images, and databases already exist that tend to far exceed general purpose algorithms in their specialized domain.

 

That gets back to WHY.  What specific problem are you attempting to solve?

 

The world does not really need "yet another general purpose compression algorithm". Collectively we've reached the point where even the advancements of experts in the field struggle to reach even a fraction of a percent improvement in storage.

 

What are you trying to solve with this? Your GitHub Tracker makes it look like you are trying to be everything to everybody, from databases to pre-compressed textures, from plain text and floating point arrays to pre-compressed office documents.  It feels like you are in search of a problem to solve.

Share this post


Link to post
Share on other sites
When profiling compression algorithms, you can be bottlenecked by IO speed.
Do you load the entire input data into RAM first, and then run the algorithm? Or do you run the algorithm directly from/to disk?

I'm going off topic from your search for a corpus... but one of the issues gamedevs face is slow IO speeds causing bottlenecks that otherwise wouldn't appear. DVD and BluRay are ridiculously slow compared to HDD, which is slow compared to SSD...

Also, gamedevs generally don't care about compression costs (as we do this once, before shipping the game), so it's all about the decompression algorithm - and we're generally decompressing directly into RAM so we can use the data immediately.

On a console game (with traditional loading screens - not in-game streaming), you're doing most of your loading from optical disk, so you generally end up using a compression algorithm with very high ratios, because you're so heavily IO-bound that you can afford to waste a lot of CPU cycles on decompression. That is to say, beyond a certain point, a faster decompression routine offers no benefit, because once it's faster than IO, making it any faster than that does nothing to reduce your critical path.
The optical disk read speed is identical from customer to customer, so you can tweak this algorithm to balance IO vs CPU time very carefully.

On top of this, you can generally rely on a small amount of either HDD or flash, which may have a variable read speed from SKU to SKU. Some customers might have a slow HDD, other's might have fast flash... so this is less predictable. In either case though, it's going to be way faster than optical. You might be able to choose a percentage of your data to "install" to the user's HDD/flash -- and you want to be smart when making that choice. Certain files will gain the most benefit from this -- possibly ones that you want to stream with a lower CPU impact?

Games that do continuous streaming (instead of traditional loading screens) throw a spanner in the works -- this is where you want a low-CPU usage algorithm, or something that maps well to the hardware -- e.g. parallelizable, or the new HW LZ77 decoder, etc smile.png

On prev-gen consoles, available memory is EXTREMELY tight, so you also want an algorithm with low memory usage.
A file load operation probably looks something like:
1) Load the decompressed file size,
2) have the user allocate that much memory,
3) stream in a SMALL chunk of compressed memory,
4) decompress that small chunk directly into the user's buffer (using small internal working buffers),
5) go to (3) if not finished.

Share this post


Link to post
Share on other sites

So that leads me to the ever-important question of WHY you are doing this?

 

Because the existing corpora suck—they don't include most of the things people are actually compressing these days, which makes them of limited use for benchmarking.  I want a corpus that most people can use to get a decent idea of what codecs might be appropriate for their use case, and in order to do that I need a reasonably diverse corpus which is likely to contain some data similar to what they need to compress.

 

Take a look at The Squash Benchmark, which will probably be the first user for the new corpus—it's also probably the most exhaustive compression benchmark around (others have more codecs, but aren't run on nearly as many data sets or machines).  It currently contains data from three corpora (Silesa, Canterbury, and enwik8), plus some extra data from Snappy's tests because the standard corpora didn't cover it.  28 data sets in all, and its still missing very common things like log files, RPC requests/responses, ODF/OOXML files, and data from games like compressed textures and 3D models.  It has a lot of plain text, though almost all of it is in English, but how often to you need to compress a plain text file, relatively speaking (not JSON, XML, or anything like that… things like the collected works of Charles Dickens)?

 

The difficulty is that the industry really doesn't need that at the moment.

 

Like so many algorithms and systems out there, compression is a method of exchanging compute time for storage space.  Locally stored save files are usually small and the player does not want to wait a long time for them, so games and players are typically accepting of large space requirements.  Installation files traveling the Internet are often large, but they can be heavily compressed with algorithms like 7z that are extremely slow to compress but result in a highly efficient data dictionary or Markov chains for dynamically probabilistic decompression tables -- trading an enormous compute effort up front in exchange for extremely tight compression to minimize what is transferred.  

 

Similarly, specialized lossless encoding for audio, video, images, and databases already exist that tend to far exceed general purpose algorithms in their specialized domain.

 

That gets back to WHY.  What specific problem are you attempting to solve?

 

The world does not really need "yet another general purpose compression algorithm". Collectively we've reached the point where even the advancements of experts in the field struggle to reach even a fraction of a percent improvement in storage.

 

What are you trying to solve with this? Your GitHub Tracker makes it look like you are trying to be everything to everybody, from databases to pre-compressed textures, from plain text and floating point arrays to pre-compressed office documents.  It feels like you are in search of a problem to solve.

 

It seems like you're assuming here that I'm trying to create a new compression codec, I'm not.  That said, I disagree with a lot of the stuff here, so…

 

First off, there are definitely trade-offs between compression speed, decompression speed, ratio, memory usage, but it's not like there is some equation where you plug in the compression ratio and decompression speed and get back the compression speed.  The type of data you're trying to compress plays a huge role in what codec to choose, as does the architecture.  Going back to the Squash Benchmark, look at some different pieces of data.  The performance varies wildly based on the type of data.  Check out density's speed for enwik8 on x86_64, then look at what happens when you feed it binary data like x-ray, or already compressed data like fireworks.jpeg.

 

You mention 7zip, which is basically LZMA.  Compare that to LZHAM—similar ratios and compression speeds, but LZHAM has much better decoding speed.  LZHAM was actually designed for games, so they could achieve LZMA-like compression but still be fast enough to decompress without too big of a performance hit.  Brotli is in much the same situation as LZHAM, though performance is a bit more variable.  Obviously there is room for improvement over LZMA in terms of speed.  Now look at ZPAQ and bsc—they both dominate LZMA in terms of ratio, and bsc isn't prohibitively slow.  Again, plenty of room for improvement over LZMA.

 

Sticking with games, take a look at RAD Game Tools' Oodle.  They have plenty of customers, so obviously there is a demand for good compression codecs from game studios.

 

Your GitHub Tracker makes it look like you are trying to be everything to everybody, from databases to pre-compressed textures, from plain text and floating point arrays to pre-compressed office documents.

 

It kind of depends on your definition of "everybody".  You might want to take a look at the README in that project—basically I'm trying to get something that 99% of people will find useful, at least as a starting point.  I'm ignoring plenty of use cases, like DNA databases, geological data, oil exploration, etc.  These are actually big users of compression who pay for a lot of the development, but like I mentioned by goal isn't to create a new codec, it's something to help most people decide on one.

 

It feels like you are in search of a problem to solve.

 

That would be nice—then I could just abandon the project without remorse.  Unfortunately it's not the case :(

Share this post


Link to post
Share on other sites

When profiling compression algorithms, you can be bottlenecked by IO speed.
Do you load the entire input data into RAM first, and then run the algorithm? Or do you run the algorithm directly from/to disk?

 

I run everything directly from disk (using memory mapped files), but I use CPU time not wall clock so I/O shouldn't really be a big factor.  Loading everything into memory isn't really feasible since several of the machines don't have enough memory to do it—even when using mmap for the input and output several of them end up thrashing pretty badly (which kills wall-clock time but doesn't significantly alter CPU time).  I think trying to get I/O out of the equation is really the only sane thing to do, since the costs vary so much.  Once you get to that point you should really be doing your own benchmarks (which, BTW, Squash makes very easy).

 

I'm going off topic from your search for a corpus... but one of the issues gamedevs face is slow IO speeds causing bottlenecks that otherwise wouldn't appear. DVD and BluRay are ridiculously slow compared to HDD, which is slow compared to SSD...


Also, gamedevs generally don't care about compression costs (as we do this once, before shipping the game), so it's all about the decompression algorithm - and we're generally decompressing directly into RAM so we can use the data immediately.

On a console game (with traditional loading screens - not in-game streaming), you're doing most of your loading from optical disk, so you generally end up using a compression algorithm with very high ratios, because you're so heavily IO-bound that you can afford to waste a lot of CPU cycles on decompression. That is to say, beyond a certain point, a faster decompression routine offers no benefit, because once it's faster than IO, making it any faster than that does nothing to reduce your critical path.
The optical disk read speed is identical from customer to customer, so you can tweak this algorithm to balance IO vs CPU time very carefully.

 

Yep, you might want to take a look at LZHAM and Brotli—they are both designed to be very asymmetrical with faster decompression speeds.  Brotli is more of a work in progress right now, but in some time it could be pretty nice.

Share this post


Link to post
Share on other sites

It may be challenging to make a corpus that is very suitable for games in general.

 

The motivation for using compression in games is first and foremost to preserve GPU memory (for this, fixed-rate encodings that are directly suppored in hardware and allow for trivial random access are used almost exclusively, although there have been experimental implementations of other techniques), and to reduce load times. Special encoders exist that use a slightly less than optimal color fidelity (not really noticeable) but create an output which is much better suitable for lossless general purpose compressors. Not all encoders work that way, though, and in general compressed textures don't necessarily re-compress all too well.

 

Yes, there may be other motives, such as a smaller overall memory footprint or reducing PCIe volume, but they don't apply that generally (even consoles have reasonable amounts of memory nowadays, and on PC it's pretty abundant, also PCIe bandwidth is higher than what you can compress/decompress). I leave audio data out of consideration, too, since you can decompress ogg/mp3 just fine on the fly, if your audio API doesn't already accept that format natively, and ogg/mp3 don't really compress that well with another lossless compressor pass anyway.

 

The thing about reducing load times is that you have to consider two effects. There's the time it takes to load data from disk, and there's the time it takes to decompress. The time it takes to load from disk is, for large data, dependent on how good the compression works (in terms of minimal size). The time it takes to decompress depends on how fast the decompressor runs. For small data, seek times may very well be the dominating factor (game developers therefore try hard not to load any single "small" things).

 

So, for being useful, a compression algorithm must not only offer good compression, but it must offer a good ratio between compressed size and decompression speed (compression speed is widely uninteresting, it usually happens just once offline).

 

If you load 10MB and compression can save 5MB (= 50%) on disk but the decompression runs at only 30MB/s (so it takes 0.33 seconds for 10MB), it is better not to compress at all and just read in an extra 5MB at 100MB/s on a conventional harddisk or at 400MB on a SSD (which takes 0.1 or 0.025 seconds alltogether, respectively). As a bonus, if you don't compress at all, you could just memory-map the file and just read everything from the mapping.

On the other hand, if decompression runs at roughly 1.5GB/s (like lz4 or such) but only saves, say 20% in size, then it is still very much worth doing it. It's less size, and it runs much faster than the disk can deliver.

Edited by samoth

Share this post


Link to post
Share on other sites


I was hoping some of the developers around here could tell me what kind of data they usually use compression for, as well as how they use it

 

back in the day, a 24 bit tga was converted to a palletized 256 color bitmap, header info was stripped down to width,height, and then it was RLL compressed into the game's resource (WAD) file. the read function from the resource file API read from disk and decompressed into memory. 

 

wav files were mixed down from CD quality stereo to 8 bit mono.

 

LZH (LHAarc) was used to create a self extracting zip of the demo for downloading.

 

nowadays, with hard drive space relatively plentiful, faster load times would be the primary motivator for me to use compression.   

 

the only "compression" i use is binary vs text .x files in final release mode, and any compression used by inno setup.  textures are 24 bit or 32 bit (with alpha) uncompressed. wavs are cd quality.  i can get away with this due to heavy reuse of assets, and thus lower than usual RAM requirements, which means i can load it all at game start, with no load screens or streaming required.

 

i develop only for the PC, so i don't have the issues one sometimes has in console development. so for me, faster downloads, and faster disk reads are the only reason to compress.  and all my users have HDDs or flash drives, so faster i/o via compression is unlikely (?).

 

in general, in games, compression is seen as a sometimes necessary evil. its saves space, but takes time. and anything that takes time is bad. especially anything that takes the user's time. only under weird circumstances like "read and decompress" being faster than "read uncompressed" would it not be considered a necessary evil. in that case it would instead be considered a clever optimization.

 

the place we really need compression is across the bus between RAM and cache!  realtime, hardware accelerated, zero overhead, loss-less compression.   One can at least dream of such a thing...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this