• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Nicholas Kong

How to avoid slow loading problem in games

35 posts in this topic

feeling like it, went and wrote a "vaguely fast" DXT1 encoder (DXT1F, where the F means "Fast").

the goal here was mostly to convert reasonably quickly, with image quality not necessarily being as much of a high priority.
as-is, I don't know of any good ways to make it notably faster (I know a few possible ways, but they aren't pretty).

Nice work smile.png

If you still find yourself interested in DXT1F:

* The code looks like it would be possible to port over to entirely use the SSE registers instead of general purpose int ones, which might reduce the instruction count a lot.

* You could also make it so that the user can use multiple threads to perform the processing -- e.g. Instead of (or as well as) having an API that encodes a whole image at once, you could add two extra parameters -- the row to begin working from, and the row to end on (which should both be multiples of 4, or whatever the block size is). The user could then call that function multiple times with different start/end parameters on different threads to produce different rows of blocks concurrently.

Edited by Hodgman
2

Share this post


Link to post
Share on other sites

I am writing mostly from personal experience, which tends to be that HDD's aren't very fast.

More precisely, one should say:
Harddisks are fast. Fast enough for pretty much everything. SSDs are faster, in fact, ridiculously fast. They're becoming more and more common (and thus cheaper, which is a positive feedback loop), and I'm hoping that in maybe 3-5 years from now you'll be able to assume a SSD as "standard" in every computer. Harddisk seeks are slow, but if you pay only a little attention (e.g. not ten thousand files), not so much that they're ever an issue.

DVDs are... oh well... about 1/3 to 1/5 the speed of a harddisk, but you normally need not worry. DVD seeks, on the other hand, are the devil.

Which boils down to something as simple as: Try and avoid seeks, and you're good. If you ship on DVD, avoid them as if your life depended on it.

A slow harddisk (some 3-4 year old 7200 rpm disk) realistically delivers upwards of 50MiB/s if you don't have too many seeks. That isn't so bad. With NCQ, you often don't even notice a few dozen seeks at all, and even without, the impact is very small if you have several concurrent reads.

A SSD delivers so fast that decompression can be the actual bottleneck. Seriously. You just don't care about its speed or about seek times.

Now consider a DVD. A typical, standard 16x drive will, in the best case, deliver something close to 20 MiB/s, so between 1/3 and 1/2 of what you get from a slow harddisk. In the worst case (sectors close to the middle), you'll have about half that.
Now throw in 30 seeks, which on the average take around 150ms each on a DVD drive. That's a full 4.5 seconds spent in seeking. Compared to these 4.5 seconds during which nothing at all happens (nothing useful from your point of view, anyway), pretty much every other consideration regarding speed becomes insignificant.
0

Share this post


Link to post
Share on other sites

samoth, on 06 Feb 2013 - 09:17, said:

Quote
I am writing mostly from personal experience, which tends to be that HDD's aren't very fast.

More precisely, one should say:
Harddisks are fast. Fast enough for pretty much everything. SSDs are faster, in fact, ridiculously fast. They're becoming more and more common (and thus cheaper, which is a positive feedback loop), and I'm hoping that in maybe 3-5 years from now you'll be able to assume a SSD as "standard" in every computer. Harddisk seeks are slow, but if you pay only a little attention (e.g. not ten thousand files), not so much that they're ever an issue.


seeks are basically why data can be packaged up, then a person can read a single larger package in place of a lot of little files.

as for HDD speed, it depends:
I am currently using 5400RPM 1TB drives, which have a max IO speed of roughly 100MB/s or so (though, 50-70MB/s is typical for OS file-IO).

although, yes, 7200RPM drives are also common, they are more expensive for the size, and SSDs are still rather expensive (like $140 for 128GB drive and similar), vs like $89 for 2TB or similar for a 5400RPM drive (vs like $109 for a similar-size 7200RPM drive).

Quote
DVDs are... oh well... about 1/3 to 1/5 the speed of a harddisk, but you normally need not worry. DVD seeks, on the other hand, are the devil.

Which boils down to something as simple as: Try and avoid seeks, and you're good. If you ship on DVD, avoid them as if your life depended on it.

A slow harddisk (some 3-4 year old 7200 rpm disk) realistically delivers upwards of 50MiB/s if you don't have too many seeks. That isn't so bad. With NCQ, you often don't even notice a few dozen seeks at all, and even without, the impact is very small if you have several concurrent reads.

4 year old 5400RPM drives (3.5 inch internal), in my case...

things are slow enough to stall trying to do screen-capture from games if using uncompressed video, or so help you if you try to do much of anything involving the HDD if a virus scan is running...

better capture is generally gained via a simple codec, like MJPEG or the FFDShow codec, then transcoding into something better later.


I have a 10 year old laptop drive (60GB 4800RPM), which apparently pulls off a max read speed of around 12MB/s, and a newer laptop drive (320GB 5400RPM) which pulls off around 70MB/s (or around 30-40MB/s inside an OS).

Quote
A SSD delivers so fast that decompression can be the actual bottleneck. Seriously. You just don't care about its speed or about seek times.

Now consider a DVD. A typical, standard 16x drive will, in the best case, deliver something close to 20 MiB/s, so between 1/3 and 1/2 of what you get from a slow harddisk. In the worst case (sectors close to the middle), you'll have about half that.
Now throw in 30 seeks, which on the average take around 150ms each on a DVD drive. That's a full 4.5 seconds spent in seeking. Compared to these 4.5 seconds during which nothing at all happens (nothing useful from your point of view, anyway), pretty much every other consideration regarding speed becomes insignificant.

my experience is roughly pulling in a few hundred MB of data over the course of a number of a few seconds or so.
more time is generally spent reading the data than decompressing it.


like, with deflate:
I have previously tested getting decompression-speeds of around several GB/s or so (and in a few past tests, namely with readily compressible data, hitting the RAM speed limit). (mostly as the actual "work" of inflating data consists mostly of memory-copies and RLE flood-fills).


implemented sanely, decoding a Huffman symbol is mostly a matter of something like:
v=hufftab[(win>>pos)&32767];
pos+=lentab[v];
if(pos>=8) { pos-=8; win=(win>>8)|((*cs++)<<24); }

which isn't *that* expensive (and more so, it is only a small part of the activity except mostly in poorly-compressible data).


( EDIT/ADD: more so, the poorly-compressible data edge-case is generally handled because the deflate encoder is like "hey, this data doesn't compress for crap" and falls back to "store" encoding, which amounts to essentially a straight memory copy in the decoder.

side note: modern CPUs also include dedicated silicon to optimize special memory-copying and flood-fill related instruction sequences, so these are often fairly fast. )


granted, I am using a custom inflater, rather than zlib (it makes a few optimizations under the assumption of being able to compress/decompress an entire buffer at once, rather than assuming making use of a piecewise stream).

this doesn't necessarily apply to things like LZMA, which involve a much more costly encoding, and have a hard time breaking 60-100MB/s for decoding speeds IME, but is more specific to deflate (which can be pretty fast on current HW).


for things like PNG, most of the time goes into running the filters over the decompressed buffers, which has lead to some amount of optimization (generally, dedicated functions which apply a single filter over a scanline with a fixed format). this is why the Paeth optimization trick became important: the few conditionals inside the predictor themselves became a large bottleneck. this is because (unlike deflate) it is necessary to execute some logic for every pixel (like, add the prior pixel, or a prediction based on the adjacent pixels).

JPEG is also something that would normally be expected to be dead-slow, but isn't actually all that bad (more so, as my codec also assumes the ability to work a whole image at once), and most operations boil down to a "modest" amount of fixed-point arithmetic (like in the DCT, no cosines or floating-point is actually involved, just a gob of fixed-point). sort of like with PNG, the main time-waster tends to become the final colorspace transformation (YUV -> RGB), but this can be helped along in a few ways:
writing logic specifically for configurations like 4:2:0 and 4:4:4, which can allow more fixed-form logic;
generally transforming the image in a block-by-block manner (related to the above, often it is an 8x8 or 16x16 macroblock which is broken down into 2x2 or 4x4 sub-blocks for the color-conversion, partly to allow reusing math between nearby pixels);
using special logic to largely skip over costly checks (like individually range-clamping pixeks, ..., which can be done per-block rather than per-pixel);
for faster operation, one can also skip over time-wasters like rounding and similar;
...

with a lot tricks like this, one can get JPEG decoding speeds of ~ 100 Mpix/s or so (somehow...), ironically not being too far off from a PNG decoder (or, FWIW, a traditional video codec).

getting stuff like this fed into OpenGL efficiently is a little bit more of an issue, but, the driver-supplied texture compression was at least moderately fast, and in my own experiments I was able to write faster texture compression code.


the main (general) tricks mostly seem to be:
avoid conditionals where possible (straight-through arithmetic is often faster, as pipeline stalls will often cost more than the arithmetic);
if at all possible, avoid using floating point in per-pixel calculations (the conversions between bytes and floats can kill performance, so straight fixed-point is usually better here).

SIMD / SSE can also help here, but has to be balanced with its relative ugliness and reduced portability.



as for an SSD: dunno...

I suspect at-present, SSDs are more of a novelty though...

even if the SSD is very fast, the speed of the SATA bus will still generally limit them to about 400MB/s or so, so using compression for speedups still isn't completely ruled out (though, granted, it will make much less of a difference than it does with a 50MB/s or 100MB/s disk-IO speed, and avoidance of "costly" encodings may make more sense).


granted, it seems with the newly optimized PNG Paeth filter, and faster DXT encoding, the main time waster (besides disk-IO) during loading is now... apparently... the code for scrolling the console buffer...
(this happens whenever a console print message prints a newline character, which involves moving the entire console up by 1 line, and is basically just a memory copy). (sode note: this console operation just naively uses "for()" loops to copy memory...).


and, shaved a few seconds off the startup time (down to about 3 seconds to start up the engine, and around 9 seconds to load the world). Edited by cr88192
0

Share this post


Link to post
Share on other sites

Hodgman, on 06 Feb 2013 - 07:28, said:


cr88192, on 06 Feb 2013 - 03:30, said:
feeling like it, went and wrote a "vaguely fast" DXT1 encoder (DXT1F, where the F means "Fast").

the goal here was mostly to convert reasonably quickly, with image quality not necessarily being as much of a high priority.
as-is, I don't know of any good ways to make it notably faster (I know a few possible ways, but they aren't pretty).

Nice work smile.png
If you still find yourself interested in DXT1F:
* The code looks like it would be possible to port over to entirely use the SSE registers instead of general purpose int ones, which might reduce the instruction count a lot.
* You could also make it so that the user can use multiple threads to perform the processing -- e.g. Instead of (or as well as) having an API that encodes a whole image at once, you could add two extra parameters -- the row to begin working from, and the row to end on (which should both be multiples of 4, or whatever the block size is). The user could then call that function multiple times with different start/end parameters on different threads to produce different rows of blocks concurrently.

it is possible.
SSE works, but the use of the compiler intrinsics is a little ugly, and would require some use of #ifdef's.
if I were writing it in ASM, I would probably consider this a lot more.


the downside IME with multithreaded encoders/decoders is generally the overhead of fine-grained thread synchronization will often outweigh any performance gains. typically I use threads for more coarse-grained operations (in this case, it would probably amount to entire frames).

in this case, a possible scenario looks something like:
main (renderer) thread requests a frame-decode (adds a job-entry to a work queue);
worker thread comes along, fetches and then executes the work item (decoding the video frame, and probably DXT encoding it);
when done, the worker marks the job as completed;
the next frame, the renderer may check the job, see that the frame was decoded, and then upload the compressed image(s) to OpenGL (not yet researched into using OpenGL from multiple threads).


currently video-mapping is single-threaded though (all done inline in the main thread), and works mostly by trying to keep the decoding reasonably fast (to hopefully avoid impacting frame-rate).

another possible solution though would be just moving all of the video-map decoding to a single big thread, then use flags to indicate when each texture-buffer has been updated (and possibly a lock to avoid tearing). this should work ok, except maybe that having too many video-maps active at once would likely cause a drop in update-rates for animated textures (say, person has 30 different video-map textures visible on-screen at once or something, and then their animated textures start skipping frames or similar).


calculating... actually, as-is, just with the current M-JPEG decoder speed, assuming each stream was 256x256 (and a flat image), this could take closer to around 150 concurrent video streams, though in practice it would probably be a little lower... (and I have nowhere near this many video-maps) so, probably no immediate need for work-queues.

(as-is, mostly just has to fit in a fairly small time-window to avoid being annoying, where say 1ms is a long time when the whole frame is ideally under 30ms).

( EDIT/ADD: although 150 streams may seem like a lot, considering the low resolutions and frame-rates involved, it is vaguely comparable to the cost of decoding a 1080p video in a pixels-per-second sense. granted, real-world performance may be worse, as these results were mostly seen while decoding the same image in a tight loop, so likely a lot more stuff was in cache, and adding DXT encoding would add another stage to this process. I would need more realistic tests, and more actual code, to get a better estimate of how many video streams it would take to bog it down. )


as-is, I also have video-recording built into the engine, but in this case, there is a single dedicated encoder thread (so, the main thread reads the screen contents, then passes them to the video encoder via a shared context). this generally works ok, but as-is, I limit the record frame-rate to 16Hz (with typical rendering resolutions of 800x600 or 1024x768), partly to the encoder from bogging down (or using too much HDD space during recording).

granted, the encoder does more work than strictly needed (*1), but most other "simple" options tend to use large amounts of HDD space. granted, with video-capture, I guess it is sort of expected that capture will chew through large amounts of HDD space.

*1: basically, as-is it is closer to a full JPEG encoder, but I could probably hard-code tables or similar to speed it up some.
maybe doing 1024x768 @24Hz recording could be a goal though.

don't really want to deal with using multiple threads for encoding though.


or such...


EDIT / ADD:
for the video-mapping case, an alternate scenario could be using a video-map stored in a DXT-based format, thus avoiding the whole issue of needing to transcode at render-time. more consideration is needed here (when/where/how). current "most likely practical" solution is: simple deflated TLV format containing DXT frames, probably stored in an AVI.

format still too much under mental debate. leaning towards something vaguely JPEG-like WRT file-structure.
in any case, the "decoding" would probably be a loop+switch making the relevant OpenGL calls. Edited by cr88192
0

Share this post


Link to post
Share on other sites

While resource loading is probably the biggest hit, another viable thing to consider is inplace loading for code "resources".
 
ie:
http://entland.homelinux.com/blog/2007/02/21/fast-file-loading-ii-load-in-place/

while the basic strategy works, there are a few potential drawbacks:
data-portability, done poorly, one will end up with files specific to a particular target architecture (such as 32-bit x86, or 64-bit x86, and, depending, a person still needs to do things like address fixups);
adjustments to address the above to improve portability, such as defined width and endianess for values, using fixed base-relative addresses, ... may risk hurting overall performance (if done poorly);
unless the data is also deflated or similar, disk IO is still a likely the notable bottleneck.


the solution then, in the design of a binary format, is typically to make a compromise:
the basic format may be loaded as an "image", maybe with the contents deflated;
unpacking the data into a usable form is made fairly trivial.

(while Deflate's compression process can be slow, decompression can be fairly fast).


loading in individual contents may be like:
fetch the relevant data-lump from the image;
inflate the data-lump if needed;
decode the data with a lightweight "unpacking" process, of, building index structures and (potentially, if-needed) doing an endianess swap of any values (can usually be skipped).

so, index-structures would contain any pointers or similar, and the actual data will consist mostly of simple arrays and tables.
(say, we have some tables, and a few structs with pointers into these tables).

for example, in my case, for a few things (mostly VM related) I am using a format originally based fairly closely on the Quake WAD2 format (which I call "ExWAD"), just it added support for deflate (and slightly changed the main header, mostly with a larger magic value). a subsequent version diverged from WAD2, mostly by expanding the name field to 32-bytes (with a 64-byte directory entry), and adding support for directly representing directory trees (more like in a filesystem).
0

Share this post


Link to post
Share on other sites

I can see how DXT encoding is nice for textures you want to use for a longer time, but is it really needed to do all that encoding on CPU for a video when you display each frame only once and then throw it away? Also, why not skip the YUV->RGB step on CPU and do it in a shader instead as there is probably not much else going on in the GPU when just watching a video?

0

Share this post


Link to post
Share on other sites

I can see how DXT encoding is nice for textures you want to use for a longer time, but is it really needed to do all that encoding on CPU for a video when you display each frame only once and then throw it away?

Video is not usually DXT-compressed, but using other methods (which not only compress a single frame but also use deltas between frames). It is of course possible to use DXT to compress individual video frames, and since that is a simple solution it might just work for some cases.

As to why one would want to do such a silly amount of work for a frame that one only watches for a split second, the answer is that the amount of data you need to stream in for video is forbidding if you possibly ever want to do anything else. It may be overall forbidding too, depensing on resolution and frame rate.

Let's say you want to play a 480x270 video (1/16 the size of 1080p) at 25fps. Assuming 32bpp, that's 138MiB/s of data. Try and find a harddisk that consistenly delivers at such a rate (my SSD has no trouble doing that, but sadly not every consumer has a SSD just yet). And now imagine you maybe want a somewhat larger video. Make it only twice as large in every dimension, and you reach the theoretical maximum of SATA-600.

Compression will reduce this immense bandwidth pressure by a factor of maybe 500 or 1000, which is just what you need.

About doing YUV->RGB on the GPU, there is no objection to that.
0

Share this post


Link to post
Share on other sites

I was assuming already the video is compressed on hdd as MPEG/MJPEG/whatever. What seemed weird to me was that after decompressing it people wanted to use additional CPU time to recompress it as DXT just for speeding up the way from main memory to gpu memory which is much faster than the hdd anyway.

0

Share this post


Link to post
Share on other sites

samoth, on 08 Feb 2013 - 09:22, said:

wintertime, on 08 Feb 2013 - 07:49, said:
I can see how DXT encoding is nice for textures you want to use for a longer time, but is it really needed to do all that encoding on CPU for a video when you display each frame only once and then throw it away?

Video is not usually DXT-compressed, but using other methods (which not only compress a single frame but also use deltas between frames). It is of course possible to use DXT to compress individual video frames, and since that is a simple solution it might just work for some cases.

As to why one would want to do such a silly amount of work for a frame that one only watches for a split second, the answer is that the amount of data you need to stream in for video is forbidding if you possibly ever want to do anything else. It may be overall forbidding too, depensing on resolution and frame rate.

Let's say you want to play a 480x270 video (1/16 the size of 1080p) at 25fps. Assuming 32bpp, that's 138MiB/s of data. Try and find a harddisk that consistenly delivers at such a rate (my SSD has no trouble doing that, but sadly not every consumer has a SSD just yet). And now imagine you maybe want a somewhat larger video. Make it only twice as large in every dimension, and you reach the theoretical maximum of SATA-600.

Compression will reduce this immense bandwidth pressure by a factor of maybe 500 or 1000, which is just what you need.

About doing YUV->RGB on the GPU, there is no objection to that.

my video is typically 256x256 @10fps.
512x512 @10fps is also possible, but not currently used.

the problem with NPOT resolutions (480x270, for example) is that the GPU doesn't like them (and resampling is expensive).
generally, since these are going into textures and mapped onto geometry, we want power-of-two sizes.

the downside of YUV->RGB in shaders is mostly that it would require dedicated shaders, vs being able to use them in roughly all the same ways as a normal texture (as input to various other shaders, ...). though, this would probably be fine for cutscenes or similar though (ironically, not really using video for cutscenes at present...).


otherwise, video-mapping is a similar idea to id Software's RoQ videos.
however, RoQ was using a different compression strategy (Vector Quantization).
for example, Doom3 used RoQ for lots of little things in-game (decorative in-game video-displays, fire effects, ...).

theoretically, a DXT-based VQ codec could be derived (and could potentially decode very quickly, since you could unpack directly to DXT). I had started looking into this. I didn't like my initial design results though (too complicated for what it was doing).


as for MJPEG:
I was mostly using code I already had at the time (*1), and the format has a few advantages for animated-texture use, namely that it is possible to decode frames in arbitrary order and also easily skip frames (and, also it is not legally encumbered). also, for animated textures, raw compression rate is less important and there is often less opportunity for effective use of motion compensation.

the downside is mostly the narrow time-window to decode frames during rendering (at least, with the current single-threaded design).


*1: originally for texture loading, and also I had some AVI handling code around (years earlier, I had written some more-generic video-playback stuff). (and, the JPEG loader was originally written due to frustration with libjpeg).

also, it is possible to see the basic animated textures in normal video players (like media-player classic), but granted, this is only a minor detail.

admittedly, when I first tried using video mapping (on a Radeon 9000), it was too slow to be worthwhile. some years later, hardware is faster, so now it works basically ok.

wintertime, on 08 Feb 2013 - 09:41, said:
I was assuming already the video is compressed on hdd as MPEG/MJPEG/whatever. What seemed weird to me was that after decompressing it people wanted to use additional CPU time to recompress it as DXT just for speeding up the way from main memory to gpu memory which is much faster than the hdd anyway.

well, yes, there is little to say that there would actually be benefit in doing so.
the speed of my "DXT1F" encoder doesn't seem to be "that" bad though, so there is at least a small chance it could derive some benefit.

as noted, it mostly depends on the ability to shim it into the JPEG decoding process, so it isn't so much "encoding DXTn" as much as "skipping fully decoding to RGB" (and only using about 1/4 as much arithmetic as the full conversion).

theoretically, this route "could" actually outperform the use of uncompressed RGBA for the final steps of decoding the JPEG images. (and, otherwise, you still need to put the DCT block-planes into raster order one way or another...).

as for time-usage, if we are rendering at 30fps, but the video framerate is 10fps, then each video-frame will be on-screen for roughly 3 frames (or 4-5 frames if rendering at 40-50fps).


(decided to leave out some stuff)
0

Share this post


Link to post
Share on other sites

For those interested in fast CPU side texture compression, say for streaming JPG->DXT etc., check out the code and article here: http://software.intel.com/en-us/vcsource/samples/dxt-compression

 

Some searching should also help find a host of other approaches. I'd not recommend this in general for loading textures in most situations as the better quality/performance ratios are likely simply using packed (Zip etc.) pre-compressed texture loading, but if you have huge amounts of detailed textures you need to stream this is a good approach.

0

Share this post


Link to post
Share on other sites

For those interested in fast CPU side texture compression, say for streaming JPG->DXT etc., check out the code and article here: http://software.intel.com/en-us/vcsource/samples/dxt-compression
 
Some searching should also help find a host of other approaches. I'd not recommend this in general for loading textures in most situations as the better quality/performance ratios are likely simply using packed (Zip etc.) pre-compressed texture loading, but if you have huge amounts of detailed textures you need to stream this is a good approach.

felt curious, tested my own "DXT1F" encoder (with MSVC):

I am getting 112Mp/s if compiled with optimizations turned on ("/O2"), and 42Mp/s debug ("/Z7").

granted, this is single-threaded scalar code.
and it also assumes only 2 colors.


the version written to shim into the JPEG decoder (by itself) is actually pulling off 314Mp/s (180Mp/s debug).
still single-threaded scalar code.

the main difference is that it works on planar YUV input, and assumes 4:2:0, and requires less math than for RGBA (and we only need the min and max Y, UV=simple average).

(EDIT/ADD: after changing it to take on more of the work, basically working on raw IDCT block output, rather than precooked output, it is 170Mp/s optimized).

not entirely sure what a SIMD or multithreaded version would do here.

granted, this version would be used on the front-end of a JPEG decoder, which would make it slower.

(EDIT/ADD:
http://pastebin.com/emDK9jwc
http://pastebin.com/EyEY5W9P
)

CPU speed (my case) = 2.8 GHz.

or such... Edited by cr88192
0

Share this post


Link to post
Share on other sites

What seemed weird to me was that after decompressing it people wanted to use additional CPU time to recompress it as DXT just for speeding up the way from main memory to gpu memory which is much faster than the hdd anyway.

This may still be a valid reason.

Although you typically have about 8 GiB/s of bandwidth over PCIe, several hundreds of megabytes are still a non-neglegible share. If you do nothing else, that's no problem, but if you possibly have other stuff to transfer, it may be.

Transfers also have a fixed overhead and cause a complete GPU stall on typical present-day consumer hardware (the driver will either let the GPU render, or it will do a PCIe transfer, not both at the same time), so transferring a frame at a time is not an efficient solution. Transferring many frames at a time results in much better parallelism, however, it takes forbidding amounts of GPU memory. DXT compression alleviates that.
0

Share this post


Link to post
Share on other sites


What seemed weird to me was that after decompressing it people wanted to use additional CPU time to recompress it as DXT just for speeding up the way from main memory to gpu memory which is much faster than the hdd anyway.

This may still be a valid reason.

Although you typically have about 8 GiB/s of bandwidth over PCIe, several hundreds of megabytes are still a non-neglegible share. If you do nothing else, that's no problem, but if you possibly have other stuff to transfer, it may be.

Transfers also have a fixed overhead and cause a complete GPU stall on typical present-day consumer hardware (the driver will either let the GPU render, or it will do a PCIe transfer, not both at the same time), so transferring a frame at a time is not an efficient solution. Transferring many frames at a time results in much better parallelism, however, it takes forbidding amounts of GPU memory. DXT compression alleviates that.


pretty much.

nothing here to say this is actually efficient (traditional animated textures are still generally a better solution in most cases, where my engine supports both).


another motivation is something I had suspected already (partly confirmed in benchmarks):
the direct YUV to DXTn route is apparently slightly faster than going all the way to raw RGB.

still need more code though to confirm that everything is actually working, probably followed by more fine-tuning.

(EDIT/ADD: sadly, it turns out my JPEG decoder isn't quite as fast as I had thought I had remembered, oh well...).


(EDIT/ADD 2: above, as-in, the current JPEG->DXT5 transcoding route pulls off only about 38Mp/s (optimized "/O2", ~ 20Mp/s debug), whereas previously I had thought I had remembered things being faster. (granted, am getting tempted to use SIMD intrinsics for a few things...).

note that current RGBA video frames have both an RGB(YUV) image and an embedded Alpha image (mono, also encoded as a JPEG).
both layers are decoded/transcoded and recombined into a composite DXT5 image.


while looking around online, did run across this article though:
http://www.nvidia.com/object/real-time-ycocg-dxt-compression.html
nifty idea... but granted this would need special shaders...). Edited by cr88192
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0