How to avoid slow loading problem in games

Started by
34 comments, last by cr88192 11 years, 3 months ago

While resource loading is probably the biggest hit, another viable thing to consider is inplace loading for code "resources".

ie:

http://entland.homelinux.com/blog/2007/02/21/fast-file-loading-ii-load-in-place/

Advertisement

While resource loading is probably the biggest hit, another viable thing to consider is inplace loading for code "resources".

ie:
http://entland.homelinux.com/blog/2007/02/21/fast-file-loading-ii-load-in-place/

while the basic strategy works, there are a few potential drawbacks:
data-portability, done poorly, one will end up with files specific to a particular target architecture (such as 32-bit x86, or 64-bit x86, and, depending, a person still needs to do things like address fixups);
adjustments to address the above to improve portability, such as defined width and endianess for values, using fixed base-relative addresses, ... may risk hurting overall performance (if done poorly);
unless the data is also deflated or similar, disk IO is still a likely the notable bottleneck.


the solution then, in the design of a binary format, is typically to make a compromise:
the basic format may be loaded as an "image", maybe with the contents deflated;
unpacking the data into a usable form is made fairly trivial.

(while Deflate's compression process can be slow, decompression can be fairly fast).


loading in individual contents may be like:
fetch the relevant data-lump from the image;
inflate the data-lump if needed;
decode the data with a lightweight "unpacking" process, of, building index structures and (potentially, if-needed) doing an endianess swap of any values (can usually be skipped).

so, index-structures would contain any pointers or similar, and the actual data will consist mostly of simple arrays and tables.
(say, we have some tables, and a few structs with pointers into these tables).

for example, in my case, for a few things (mostly VM related) I am using a format originally based fairly closely on the Quake WAD2 format (which I call "ExWAD"), just it added support for deflate (and slightly changed the main header, mostly with a larger magic value). a subsequent version diverged from WAD2, mostly by expanding the name field to 32-bytes (with a 64-byte directory entry), and adding support for directly representing directory trees (more like in a filesystem).

I can see how DXT encoding is nice for textures you want to use for a longer time, but is it really needed to do all that encoding on CPU for a video when you display each frame only once and then throw it away? Also, why not skip the YUV->RGB step on CPU and do it in a shader instead as there is probably not much else going on in the GPU when just watching a video?

I can see how DXT encoding is nice for textures you want to use for a longer time, but is it really needed to do all that encoding on CPU for a video when you display each frame only once and then throw it away?

Video is not usually DXT-compressed, but using other methods (which not only compress a single frame but also use deltas between frames). It is of course possible to use DXT to compress individual video frames, and since that is a simple solution it might just work for some cases.

As to why one would want to do such a silly amount of work for a frame that one only watches for a split second, the answer is that the amount of data you need to stream in for video is forbidding if you possibly ever want to do anything else. It may be overall forbidding too, depensing on resolution and frame rate.

Let's say you want to play a 480x270 video (1/16 the size of 1080p) at 25fps. Assuming 32bpp, that's 138MiB/s of data. Try and find a harddisk that consistenly delivers at such a rate (my SSD has no trouble doing that, but sadly not every consumer has a SSD just yet). And now imagine you maybe want a somewhat larger video. Make it only twice as large in every dimension, and you reach the theoretical maximum of SATA-600.

Compression will reduce this immense bandwidth pressure by a factor of maybe 500 or 1000, which is just what you need.

About doing YUV->RGB on the GPU, there is no objection to that.

I was assuming already the video is compressed on hdd as MPEG/MJPEG/whatever. What seemed weird to me was that after decompressing it people wanted to use additional CPU time to recompress it as DXT just for speeding up the way from main memory to gpu memory which is much faster than the hdd anyway.

samoth, on 08 Feb 2013 - 09:22, said:

wintertime, on 08 Feb 2013 - 07:49, said:
I can see how DXT encoding is nice for textures you want to use for a longer time, but is it really needed to do all that encoding on CPU for a video when you display each frame only once and then throw it away?

Video is not usually DXT-compressed, but using other methods (which not only compress a single frame but also use deltas between frames). It is of course possible to use DXT to compress individual video frames, and since that is a simple solution it might just work for some cases.

As to why one would want to do such a silly amount of work for a frame that one only watches for a split second, the answer is that the amount of data you need to stream in for video is forbidding if you possibly ever want to do anything else. It may be overall forbidding too, depensing on resolution and frame rate.

Let's say you want to play a 480x270 video (1/16 the size of 1080p) at 25fps. Assuming 32bpp, that's 138MiB/s of data. Try and find a harddisk that consistenly delivers at such a rate (my SSD has no trouble doing that, but sadly not every consumer has a SSD just yet). And now imagine you maybe want a somewhat larger video. Make it only twice as large in every dimension, and you reach the theoretical maximum of SATA-600.

Compression will reduce this immense bandwidth pressure by a factor of maybe 500 or 1000, which is just what you need.

About doing YUV->RGB on the GPU, there is no objection to that.

my video is typically 256x256 @10fps.
512x512 @10fps is also possible, but not currently used.

the problem with NPOT resolutions (480x270, for example) is that the GPU doesn't like them (and resampling is expensive).
generally, since these are going into textures and mapped onto geometry, we want power-of-two sizes.

the downside of YUV->RGB in shaders is mostly that it would require dedicated shaders, vs being able to use them in roughly all the same ways as a normal texture (as input to various other shaders, ...). though, this would probably be fine for cutscenes or similar though (ironically, not really using video for cutscenes at present...).


otherwise, video-mapping is a similar idea to id Software's RoQ videos.
however, RoQ was using a different compression strategy (Vector Quantization).
for example, Doom3 used RoQ for lots of little things in-game (decorative in-game video-displays, fire effects, ...).

theoretically, a DXT-based VQ codec could be derived (and could potentially decode very quickly, since you could unpack directly to DXT). I had started looking into this. I didn't like my initial design results though (too complicated for what it was doing).


as for MJPEG:
I was mostly using code I already had at the time (*1), and the format has a few advantages for animated-texture use, namely that it is possible to decode frames in arbitrary order and also easily skip frames (and, also it is not legally encumbered). also, for animated textures, raw compression rate is less important and there is often less opportunity for effective use of motion compensation.

the downside is mostly the narrow time-window to decode frames during rendering (at least, with the current single-threaded design).


*1: originally for texture loading, and also I had some AVI handling code around (years earlier, I had written some more-generic video-playback stuff). (and, the JPEG loader was originally written due to frustration with libjpeg).

also, it is possible to see the basic animated textures in normal video players (like media-player classic), but granted, this is only a minor detail.

admittedly, when I first tried using video mapping (on a Radeon 9000), it was too slow to be worthwhile. some years later, hardware is faster, so now it works basically ok.

wintertime, on 08 Feb 2013 - 09:41, said:
I was assuming already the video is compressed on hdd as MPEG/MJPEG/whatever. What seemed weird to me was that after decompressing it people wanted to use additional CPU time to recompress it as DXT just for speeding up the way from main memory to gpu memory which is much faster than the hdd anyway.

well, yes, there is little to say that there would actually be benefit in doing so.
the speed of my "DXT1F" encoder doesn't seem to be "that" bad though, so there is at least a small chance it could derive some benefit.

as noted, it mostly depends on the ability to shim it into the JPEG decoding process, so it isn't so much "encoding DXTn" as much as "skipping fully decoding to RGB" (and only using about 1/4 as much arithmetic as the full conversion).

theoretically, this route "could" actually outperform the use of uncompressed RGBA for the final steps of decoding the JPEG images. (and, otherwise, you still need to put the DCT block-planes into raster order one way or another...).

as for time-usage, if we are rendering at 30fps, but the video framerate is 10fps, then each video-frame will be on-screen for roughly 3 frames (or 4-5 frames if rendering at 40-50fps).


(decided to leave out some stuff)

For those interested in fast CPU side texture compression, say for streaming JPG->DXT etc., check out the code and article here: http://software.intel.com/en-us/vcsource/samples/dxt-compression

Some searching should also help find a host of other approaches. I'd not recommend this in general for loading textures in most situations as the better quality/performance ratios are likely simply using packed (Zip etc.) pre-compressed texture loading, but if you have huge amounts of detailed textures you need to stream this is a good approach.

For those interested in fast CPU side texture compression, say for streaming JPG->DXT etc., check out the code and article here: http://software.intel.com/en-us/vcsource/samples/dxt-compression

Some searching should also help find a host of other approaches. I'd not recommend this in general for loading textures in most situations as the better quality/performance ratios are likely simply using packed (Zip etc.) pre-compressed texture loading, but if you have huge amounts of detailed textures you need to stream this is a good approach.

felt curious, tested my own "DXT1F" encoder (with MSVC):

I am getting 112Mp/s if compiled with optimizations turned on ("/O2"), and 42Mp/s debug ("/Z7").

granted, this is single-threaded scalar code.
and it also assumes only 2 colors.


the version written to shim into the JPEG decoder (by itself) is actually pulling off 314Mp/s (180Mp/s debug).
still single-threaded scalar code.

the main difference is that it works on planar YUV input, and assumes 4:2:0, and requires less math than for RGBA (and we only need the min and max Y, UV=simple average).

(EDIT/ADD: after changing it to take on more of the work, basically working on raw IDCT block output, rather than precooked output, it is 170Mp/s optimized).

not entirely sure what a SIMD or multithreaded version would do here.

granted, this version would be used on the front-end of a JPEG decoder, which would make it slower.

(EDIT/ADD:
http://pastebin.com/emDK9jwc
http://pastebin.com/EyEY5W9P
)

CPU speed (my case) = 2.8 GHz.

or such...

What seemed weird to me was that after decompressing it people wanted to use additional CPU time to recompress it as DXT just for speeding up the way from main memory to gpu memory which is much faster than the hdd anyway.

This may still be a valid reason.

Although you typically have about 8 GiB/s of bandwidth over PCIe, several hundreds of megabytes are still a non-neglegible share. If you do nothing else, that's no problem, but if you possibly have other stuff to transfer, it may be.

Transfers also have a fixed overhead and cause a complete GPU stall on typical present-day consumer hardware (the driver will either let the GPU render, or it will do a PCIe transfer, not both at the same time), so transferring a frame at a time is not an efficient solution. Transferring many frames at a time results in much better parallelism, however, it takes forbidding amounts of GPU memory. DXT compression alleviates that.


What seemed weird to me was that after decompressing it people wanted to use additional CPU time to recompress it as DXT just for speeding up the way from main memory to gpu memory which is much faster than the hdd anyway.

This may still be a valid reason.

Although you typically have about 8 GiB/s of bandwidth over PCIe, several hundreds of megabytes are still a non-neglegible share. If you do nothing else, that's no problem, but if you possibly have other stuff to transfer, it may be.

Transfers also have a fixed overhead and cause a complete GPU stall on typical present-day consumer hardware (the driver will either let the GPU render, or it will do a PCIe transfer, not both at the same time), so transferring a frame at a time is not an efficient solution. Transferring many frames at a time results in much better parallelism, however, it takes forbidding amounts of GPU memory. DXT compression alleviates that.


pretty much.

nothing here to say this is actually efficient (traditional animated textures are still generally a better solution in most cases, where my engine supports both).


another motivation is something I had suspected already (partly confirmed in benchmarks):
the direct YUV to DXTn route is apparently slightly faster than going all the way to raw RGB.

still need more code though to confirm that everything is actually working, probably followed by more fine-tuning.

(EDIT/ADD: sadly, it turns out my JPEG decoder isn't quite as fast as I had thought I had remembered, oh well...).


(EDIT/ADD 2: above, as-in, the current JPEG->DXT5 transcoding route pulls off only about 38Mp/s (optimized "/O2", ~ 20Mp/s debug), whereas previously I had thought I had remembered things being faster. (granted, am getting tempted to use SIMD intrinsics for a few things...).

note that current RGBA video frames have both an RGB(YUV) image and an embedded Alpha image (mono, also encoded as a JPEG).
both layers are decoded/transcoded and recombined into a composite DXT5 image.


while looking around online, did run across this article though:
http://www.nvidia.com/object/real-time-ycocg-dxt-compression.html
nifty idea... but granted this would need special shaders...).

This topic is closed to new replies.

Advertisement