Jump to content

  • Log In with Google      Sign In   
  • Create Account

We're offering banner ads on our site from just $5!

1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


How to avoid slow loading problem in games


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
38 replies to this topic

#21 Hodgman   Moderators   -  Reputation: 31786

Like
6Likes
Like

Posted 03 February 2013 - 07:27 PM

block compress the texture first, offline, then compress that (e.g. with Zlib). This will give around another 50% saving, and the only work to do on loading is decompression, direct to the final format.

Indeed, this is standard practice these days.
Instead of using standard compression on them though, another option is the crunch library, which offers two options --
* an R/D optimised DXT compressor, which reduces quality slightly, but produces files that can be much better compressed by standard compression algorithms.
* it's own compressed format "CRN", which is also a lossy block based format, but has the ability to be directly (and efficiently) transcoded from CRN to DXT, for small on-disk sizes and fast loading.

My other question would be I'm sure the programmers know about the slow loading times but why not fix it?

Fixing things takes time, and time is money... which makes that a question for the business managers, not the engineers tongue.png

or example, in my case (on a PC), loading textures is a good part of the startup time, and a lot of this due to resampling the (generally small) number of non-power-of-2 textures to be power-of-2 sizes. this is then followed by the inner loops for doing the inverse-filtering for PNG files
...
parsing text files

Just say no! Don't perform any image filtering/re-sampling/transcoding or parsing at load-time; move that work to build-time!
As phantom mentioned, DXT compression is very slow, so if you want fast texture fetching and low VRAM usage, then you'll also be wasting a lot of load-time recompressing the image data that you just decompressed from PNG too!

during development, a disadvantage of ZIP though is that it can't be readily accessed by the OS or by "normal" apps

The past 3 engines I've used, we've used ZIP-like archives for final builds, and just loose files in the OS's file-system for development builds, because building/editing the huge archive files is slow.

However, the above issue (that your content tools can't write to your archive directly) isn't actually an issue, because even when we're using the OS's file-system, the content tools can't write to those files either, because they've been compiled into runtime-efficient formats!

The data flow looks something like:
[Content Tools] --> Content Source Repository  --> [Build tools] --> Data directory --> [Build tools] --> Archive file
                                                                            |                                  |
                                                                           \|/                                \|/
                                                                    In-Development game                    Retail game
Just how we don't manually compile our code any more -- everyone uses an IDE or at least a makefile -- you should also be using an automated system for building the data that goes into your game. The 3 engine that I mentioned above all used a workflow similar to the diagram, where when an artist saves a new/edited "source" art file, the build system automatically compiles that file and updates the data directory and/or the "ZIP archive".

For example, if someone saves out a NPOT PNG file, the build tools will automatically load, decode, filter, resample that data, then compress it using an expensive DXT compression algorithm, then save it in the platform specific format (e.g. DDS) in the data directory for the game to use. Then at load-time, the game has no work to do, besides streaming in the data.

Sponsor:

#22 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 03 February 2013 - 08:31 PM

phantom, on 03 Feb 2013 - 17:28, said:

Endurion, on 03 Feb 2013 - 00:51, said:
being nice to all kind of mangled formats/dimensions is useful during production, but once you hit the finish line, store the stuff in the exact format you need in memory.


We don't even allow that.

EVERY resource which is loaded, during every stage of development, is a processed one. All images are converted first to DDS then to our custom format (basically a smaller header with a platform specific payload attached which is in the hardware format required) so they can be directly streamed in as fast as possible.

The only difference to development vs final build is that for the final build the various packaged files are consolidated into larger volumes which are compressed using zlib.

There is no good reason to be dicking about at runtime with custom formats and if you aren't giving the GPU DXTn/BCn compressed data to work with then you are Doing It Wrong™ and a GOOD DXTn/BCn compresseor can take quite some time to run so if you are trying to run time compress to any decent degree then there is no way you'll get fast loading times.

not entirely sure here, but OpenGL seems able to compress textures relatively quickly.
or, is the idea that the built-in texture-compressor provided by OpenGL isn't "good", or isn't very fast, or something else?...


granted, yes, the cost of compressing textures can be observed some when streaming video into textures, which is generally a reason I use uncompressed textures for video-mapping (there is less of a framerate drop in this case).

I guess a relevant experiment could be creating a "compressed texture cache" (basically, like a large WAD file for deflated+DXTn-compressed images), and maybe also evaluate the use of deflated DXTn textures for video-maps (though, another experiment could be trying to convert directly from DCT-macroblocks into DXTn, rather than decoding an MJPG AVI frame into RGBA frames and then handing these to OpenGL). (note: video-maps exist alongside more traditional animated textures).


or such...

#23 Hodgman   Moderators   -  Reputation: 31786

Like
1Likes
Like

Posted 03 February 2013 - 09:23 PM

not entirely sure here, but OpenGL seems able to compress textures relatively quickly.
or, is the idea that the built-in texture-compressor provided by OpenGL isn't "good", or isn't very fast, or something else?...

Generally, if a DXT compressor is very fast, then it's probably producing low quality results.

One other downside of asking GL to compress your data for you is that (on Windows) this is implemented in the graphics driver code, which is likely to be different on each of your user's PCs. This means that maybe one user's driver has a slow DXT compressor, while another's is fast. Maybe one user gets really bad quality textures, while others get decent quality? It's hard to ensure a consistent experience when you outsource some behaviour of your game to an unknown 3rd party plugin like this.

 

For an example of objectively measuring the quality of different compression approaches, see L. Spiro's DXT compression blog post here, where he talks about measuring signal to noise ratios, or this blog post that LS links to has some good visual examples of how different the results of different algorithms can look.

 

 

As for video, there's probably not much point in DXT compressing individual frames (unless, yes, you somehow could directly transcode from the video format to DXT blocks!). The time spent performing the DXT compression would probably outweigh the theoretical benefits, which include:

* quicker time to transfer the frame to VRAM (but this time is probably already small compared to the MPEG/etc decoding time)

* faster pixel shader execution due to faster texture fetching (but pixel shading isn't likely a bottleneck)

* reduced VRAM usage (which isn't that important as you only need a frame at a time)


Edited by Hodgman, 03 February 2013 - 09:36 PM.


#24 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 04 February 2013 - 12:13 AM


not entirely sure here, but OpenGL seems able to compress textures relatively quickly.
or, is the idea that the built-in texture-compressor provided by OpenGL isn't "good", or isn't very fast, or something else?...

Generally, if a DXT compressor is very fast, then it's probably producing low quality results.
One other downside of asking GL to compress your data for you is that (on Windows) this is implemented in the graphics driver code, which is likely to be different on each of your user's PCs. This means that maybe one user's driver has a slow DXT compressor, while another's is fast. Maybe one user gets really bad quality textures, while others get decent quality? It's hard to ensure a consistent experience when you outsource some behaviour of your game to an unknown 3rd party plugin like this.


interesting.

will have to look into this.

generally, it seems moderately fast, and has "tolerable" quality, at least on the cards I have typically used (recent ATI and NVIDIA cards), though sometimes does introduce a slight banded/patchy look.

I had generally used it because it seems to help with the framerate.

For an example of objectively measuring the quality of different compression approaches, see L. Spiro's DXT compression blog post here, where he talks about measuring signal to noise ratios, or this blog post that LS links to has some good visual examples of how different the results of different algorithms can look.
 
 
As for video, there's probably not much point in DXT compressing individual frames (unless, yes, you somehow could directly transcode from the video format to DXT blocks!). The time spent performing the DXT compression would probably outweigh the theoretical benefits, which include:
* quicker time to transfer the frame to VRAM (but this time is probably already small compared to the MPEG/etc decoding time)
* faster pixel shader execution due to faster texture fetching (but pixel shading isn't likely a bottleneck)
* reduced VRAM usage (which isn't that important as you only need a frame at a time)

(edit: add, after a quick skim of blog post (will probably read more).
ok, so I guess the idea is that the common patchy/banded look of DXT compressed textures isn't an inherent property, but rather a side effect of quick/dirty encoders possibly not really doing any dithering? nifty... well, I guess this gives more reason to look more into these matters. ).


fair enough...


for the codecs I am using (Motion-JPEG and Motion-BTJ), converting from macroblocks to DXT blocks could be possible, but admittedly I don't know it it would save that much over going the full YUV (blocks) -> RGB -> DXT route.

and, could very well still be slower than the (current) strategy of using uncompressed textures for video, or maybe not really make a big difference (since, as-noted, decoding video frames isn't entirely free).


note (going off original topic, mostly general information):

in Motion-JPEG, each frame is basically an independent JPEG image (the video format is, essentially, just playing a series of JPEG images).

for Motion-BTJ, it is basically similar to Motion-JPEG, except that BTJ supports an alpha-channel, lossless coding, normal-maps, luminance and specular maps, layer stacks, embedded shader-info files, ... so can be used for some more elabotate effects (BTJ is essentially a JPEG containing a collection of other modified-format JPEG images inside of a makeshift TLV container format). the "BTJ" basically means "BGBTech JPEG", but I now call it BTJ mostly as "it isn't really JPEG anymore..." (and has since broken strict backwards compatibility).

in both cases, an AVI texture is slightly abnormal (only works correctly if drawn via the "shader system", compare: Quake 3 "shaders" or Doom 3 "materials").


note: since BTJ descended directly out of my JPEG codec, there is the side effect that a few basic BTJ features (such as alpha-channels) work with JPEG images (with a ".jpg" extension), and also with the "MJPG" FOURCC, but this is technically non-standard. however, since that point, the codecs were forked (mostly as I had reason to have both a "sane" JPEG codec, and also a "highly customized mutant format").


wandering further off original topic / asside:

BTJ was originally developed, mostly because AVI didn't provide any good way to provide this stuff otherwise, and potentially using a stack of parallel AVIs was not desirable (and I didn't feel like switching to a different container format), so it seemed preferable to basically just unleash some serious hacks on the JPEG format.

the analogy is basically if something like RIFF were shoved inside of a JPEG image, and in-turn contained more JPEG images.

so, decoding a frame generally consists of decoding the base JPEG image, along with any "component layers" (such as alpha-channel or normal map), followed by any contained "tag-layers" (essentially independent images). a shader-info file or script can refer to these layers, treating them like images (the video then is basically like an animated layer stack). (note that any images contained in a given frame will be uploaded to their respective GL textures).

currently, the AVIs are "compiled" typically from a pile of PNG (or BTJ) images, and some number of control-files (such as shader-info files and a frame-list).

BTJ images have some use as standalone images as well, basically as a feature for "compound" or "layered" images, and has a Paint.NET plugin, and supports many of the same features as the native PDN format. the engine then basically treats each layer as if it were its own image, for example "textures/base_foo/bar.btj::Background" or "textures/base_foo/bar.btj::Forground" and may refer to components like: "textures/base_foo/bar.btj::Background:Normal".

like with AVI videos, BTJ images are currently only really usable via the shader system.


thus far, I haven't done a whole lot "notable" with all this, apart from making a few random animation videos and putting them on my YouTube channel.

Edited by cr88192, 04 February 2013 - 01:08 AM.


#25 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 06 February 2013 - 03:24 AM

feeling like it, went and wrote a "vaguely fast" DXT1 encoder (DXT1F, where the F means "Fast").

part of the logic has been put here:
http://pastebin.com/emDK9jwc

(pastebin used as my webserver has apparently catastrophically failed...).


the goal here was mostly to convert reasonably quickly, with image quality not necessarily being as much of a high priority.

as-is, I don't know of any good ways to make it notably faster (I know a few possible ways, but they aren't pretty).

existing, but omitted, are alternate encoders for a more normal DXT1 and DXT5 (and use a variation of range-fit).
these encoders still seem to be a bit faster but a little worse-looking than the ones the OpenGL driver uses. (this was unexpected actually, kind of expected my code would be slower, but apparently not...).

either way, little seems particularly slow here.

#26 Hodgman   Moderators   -  Reputation: 31786

Like
2Likes
Like

Posted 06 February 2013 - 07:22 AM

feeling like it, went and wrote a "vaguely fast" DXT1 encoder (DXT1F, where the F means "Fast").

the goal here was mostly to convert reasonably quickly, with image quality not necessarily being as much of a high priority.
as-is, I don't know of any good ways to make it notably faster (I know a few possible ways, but they aren't pretty).

Nice work smile.png

If you still find yourself interested in DXT1F:

* The code looks like it would be possible to port over to entirely use the SSE registers instead of general purpose int ones, which might reduce the instruction count a lot.

* You could also make it so that the user can use multiple threads to perform the processing -- e.g. Instead of (or as well as) having an API that encodes a whole image at once, you could add two extra parameters -- the row to begin working from, and the row to end on (which should both be multiples of 4, or whatever the block size is). The user could then call that function multiple times with different start/end parameters on different threads to produce different rows of blocks concurrently.


Edited by Hodgman, 06 February 2013 - 07:24 AM.


#27 samoth   Crossbones+   -  Reputation: 5032

Like
0Likes
Like

Posted 06 February 2013 - 09:11 AM

I am writing mostly from personal experience, which tends to be that HDD's aren't very fast.

More precisely, one should say:
Harddisks are fast. Fast enough for pretty much everything. SSDs are faster, in fact, ridiculously fast. They're becoming more and more common (and thus cheaper, which is a positive feedback loop), and I'm hoping that in maybe 3-5 years from now you'll be able to assume a SSD as "standard" in every computer. Harddisk seeks are slow, but if you pay only a little attention (e.g. not ten thousand files), not so much that they're ever an issue.

DVDs are... oh well... about 1/3 to 1/5 the speed of a harddisk, but you normally need not worry. DVD seeks, on the other hand, are the devil.

Which boils down to something as simple as: Try and avoid seeks, and you're good. If you ship on DVD, avoid them as if your life depended on it.

A slow harddisk (some 3-4 year old 7200 rpm disk) realistically delivers upwards of 50MiB/s if you don't have too many seeks. That isn't so bad. With NCQ, you often don't even notice a few dozen seeks at all, and even without, the impact is very small if you have several concurrent reads.

A SSD delivers so fast that decompression can be the actual bottleneck. Seriously. You just don't care about its speed or about seek times.

Now consider a DVD. A typical, standard 16x drive will, in the best case, deliver something close to 20 MiB/s, so between 1/3 and 1/2 of what you get from a slow harddisk. In the worst case (sectors close to the middle), you'll have about half that.
Now throw in 30 seeks, which on the average take around 150ms each on a DVD drive. That's a full 4.5 seconds spent in seeking. Compared to these 4.5 seconds during which nothing at all happens (nothing useful from your point of view, anyway), pretty much every other consideration regarding speed becomes insignificant.

#28 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 06 February 2013 - 02:28 PM

samoth, on 06 Feb 2013 - 09:17, said:

Quote
I am writing mostly from personal experience, which tends to be that HDD's aren't very fast.

More precisely, one should say:
Harddisks are fast. Fast enough for pretty much everything. SSDs are faster, in fact, ridiculously fast. They're becoming more and more common (and thus cheaper, which is a positive feedback loop), and I'm hoping that in maybe 3-5 years from now you'll be able to assume a SSD as "standard" in every computer. Harddisk seeks are slow, but if you pay only a little attention (e.g. not ten thousand files), not so much that they're ever an issue.


seeks are basically why data can be packaged up, then a person can read a single larger package in place of a lot of little files.

as for HDD speed, it depends:
I am currently using 5400RPM 1TB drives, which have a max IO speed of roughly 100MB/s or so (though, 50-70MB/s is typical for OS file-IO).

although, yes, 7200RPM drives are also common, they are more expensive for the size, and SSDs are still rather expensive (like $140 for 128GB drive and similar), vs like $89 for 2TB or similar for a 5400RPM drive (vs like $109 for a similar-size 7200RPM drive).

Quote
DVDs are... oh well... about 1/3 to 1/5 the speed of a harddisk, but you normally need not worry. DVD seeks, on the other hand, are the devil.

Which boils down to something as simple as: Try and avoid seeks, and you're good. If you ship on DVD, avoid them as if your life depended on it.

A slow harddisk (some 3-4 year old 7200 rpm disk) realistically delivers upwards of 50MiB/s if you don't have too many seeks. That isn't so bad. With NCQ, you often don't even notice a few dozen seeks at all, and even without, the impact is very small if you have several concurrent reads.

4 year old 5400RPM drives (3.5 inch internal), in my case...

things are slow enough to stall trying to do screen-capture from games if using uncompressed video, or so help you if you try to do much of anything involving the HDD if a virus scan is running...

better capture is generally gained via a simple codec, like MJPEG or the FFDShow codec, then transcoding into something better later.


I have a 10 year old laptop drive (60GB 4800RPM), which apparently pulls off a max read speed of around 12MB/s, and a newer laptop drive (320GB 5400RPM) which pulls off around 70MB/s (or around 30-40MB/s inside an OS).

Quote
A SSD delivers so fast that decompression can be the actual bottleneck. Seriously. You just don't care about its speed or about seek times.

Now consider a DVD. A typical, standard 16x drive will, in the best case, deliver something close to 20 MiB/s, so between 1/3 and 1/2 of what you get from a slow harddisk. In the worst case (sectors close to the middle), you'll have about half that.
Now throw in 30 seeks, which on the average take around 150ms each on a DVD drive. That's a full 4.5 seconds spent in seeking. Compared to these 4.5 seconds during which nothing at all happens (nothing useful from your point of view, anyway), pretty much every other consideration regarding speed becomes insignificant.

my experience is roughly pulling in a few hundred MB of data over the course of a number of a few seconds or so.
more time is generally spent reading the data than decompressing it.


like, with deflate:
I have previously tested getting decompression-speeds of around several GB/s or so (and in a few past tests, namely with readily compressible data, hitting the RAM speed limit). (mostly as the actual "work" of inflating data consists mostly of memory-copies and RLE flood-fills).


implemented sanely, decoding a Huffman symbol is mostly a matter of something like:
v=hufftab[(win>>pos)&32767];
pos+=lentab[v];
if(pos>=8) { pos-=8; win=(win>>8)|((*cs++)<<24); }

which isn't *that* expensive (and more so, it is only a small part of the activity except mostly in poorly-compressible data).


( EDIT/ADD: more so, the poorly-compressible data edge-case is generally handled because the deflate encoder is like "hey, this data doesn't compress for crap" and falls back to "store" encoding, which amounts to essentially a straight memory copy in the decoder.

side note: modern CPUs also include dedicated silicon to optimize special memory-copying and flood-fill related instruction sequences, so these are often fairly fast. )


granted, I am using a custom inflater, rather than zlib (it makes a few optimizations under the assumption of being able to compress/decompress an entire buffer at once, rather than assuming making use of a piecewise stream).

this doesn't necessarily apply to things like LZMA, which involve a much more costly encoding, and have a hard time breaking 60-100MB/s for decoding speeds IME, but is more specific to deflate (which can be pretty fast on current HW).


for things like PNG, most of the time goes into running the filters over the decompressed buffers, which has lead to some amount of optimization (generally, dedicated functions which apply a single filter over a scanline with a fixed format). this is why the Paeth optimization trick became important: the few conditionals inside the predictor themselves became a large bottleneck. this is because (unlike deflate) it is necessary to execute some logic for every pixel (like, add the prior pixel, or a prediction based on the adjacent pixels).

JPEG is also something that would normally be expected to be dead-slow, but isn't actually all that bad (more so, as my codec also assumes the ability to work a whole image at once), and most operations boil down to a "modest" amount of fixed-point arithmetic (like in the DCT, no cosines or floating-point is actually involved, just a gob of fixed-point). sort of like with PNG, the main time-waster tends to become the final colorspace transformation (YUV -> RGB), but this can be helped along in a few ways:
writing logic specifically for configurations like 4:2:0 and 4:4:4, which can allow more fixed-form logic;
generally transforming the image in a block-by-block manner (related to the above, often it is an 8x8 or 16x16 macroblock which is broken down into 2x2 or 4x4 sub-blocks for the color-conversion, partly to allow reusing math between nearby pixels);
using special logic to largely skip over costly checks (like individually range-clamping pixeks, ..., which can be done per-block rather than per-pixel);
for faster operation, one can also skip over time-wasters like rounding and similar;
...

with a lot tricks like this, one can get JPEG decoding speeds of ~ 100 Mpix/s or so (somehow...), ironically not being too far off from a PNG decoder (or, FWIW, a traditional video codec).

getting stuff like this fed into OpenGL efficiently is a little bit more of an issue, but, the driver-supplied texture compression was at least moderately fast, and in my own experiments I was able to write faster texture compression code.


the main (general) tricks mostly seem to be:
avoid conditionals where possible (straight-through arithmetic is often faster, as pipeline stalls will often cost more than the arithmetic);
if at all possible, avoid using floating point in per-pixel calculations (the conversions between bytes and floats can kill performance, so straight fixed-point is usually better here).

SIMD / SSE can also help here, but has to be balanced with its relative ugliness and reduced portability.



as for an SSD: dunno...

I suspect at-present, SSDs are more of a novelty though...

even if the SSD is very fast, the speed of the SATA bus will still generally limit them to about 400MB/s or so, so using compression for speedups still isn't completely ruled out (though, granted, it will make much less of a difference than it does with a 50MB/s or 100MB/s disk-IO speed, and avoidance of "costly" encodings may make more sense).


granted, it seems with the newly optimized PNG Paeth filter, and faster DXT encoding, the main time waster (besides disk-IO) during loading is now... apparently... the code for scrolling the console buffer...
(this happens whenever a console print message prints a newline character, which involves moving the entire console up by 1 line, and is basically just a memory copy). (sode note: this console operation just naively uses "for()" loops to copy memory...).


and, shaved a few seconds off the startup time (down to about 3 seconds to start up the engine, and around 9 seconds to load the world).

Edited by cr88192, 06 February 2013 - 04:15 PM.


#29 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 06 February 2013 - 06:57 PM

Hodgman, on 06 Feb 2013 - 07:28, said:


cr88192, on 06 Feb 2013 - 03:30, said:
feeling like it, went and wrote a "vaguely fast" DXT1 encoder (DXT1F, where the F means "Fast").

the goal here was mostly to convert reasonably quickly, with image quality not necessarily being as much of a high priority.
as-is, I don't know of any good ways to make it notably faster (I know a few possible ways, but they aren't pretty).

Nice work smile.png
If you still find yourself interested in DXT1F:
* The code looks like it would be possible to port over to entirely use the SSE registers instead of general purpose int ones, which might reduce the instruction count a lot.
* You could also make it so that the user can use multiple threads to perform the processing -- e.g. Instead of (or as well as) having an API that encodes a whole image at once, you could add two extra parameters -- the row to begin working from, and the row to end on (which should both be multiples of 4, or whatever the block size is). The user could then call that function multiple times with different start/end parameters on different threads to produce different rows of blocks concurrently.

it is possible.
SSE works, but the use of the compiler intrinsics is a little ugly, and would require some use of #ifdef's.
if I were writing it in ASM, I would probably consider this a lot more.


the downside IME with multithreaded encoders/decoders is generally the overhead of fine-grained thread synchronization will often outweigh any performance gains. typically I use threads for more coarse-grained operations (in this case, it would probably amount to entire frames).

in this case, a possible scenario looks something like:
main (renderer) thread requests a frame-decode (adds a job-entry to a work queue);
worker thread comes along, fetches and then executes the work item (decoding the video frame, and probably DXT encoding it);
when done, the worker marks the job as completed;
the next frame, the renderer may check the job, see that the frame was decoded, and then upload the compressed image(s) to OpenGL (not yet researched into using OpenGL from multiple threads).


currently video-mapping is single-threaded though (all done inline in the main thread), and works mostly by trying to keep the decoding reasonably fast (to hopefully avoid impacting frame-rate).

another possible solution though would be just moving all of the video-map decoding to a single big thread, then use flags to indicate when each texture-buffer has been updated (and possibly a lock to avoid tearing). this should work ok, except maybe that having too many video-maps active at once would likely cause a drop in update-rates for animated textures (say, person has 30 different video-map textures visible on-screen at once or something, and then their animated textures start skipping frames or similar).


calculating... actually, as-is, just with the current M-JPEG decoder speed, assuming each stream was 256x256 (and a flat image), this could take closer to around 150 concurrent video streams, though in practice it would probably be a little lower... (and I have nowhere near this many video-maps) so, probably no immediate need for work-queues.

(as-is, mostly just has to fit in a fairly small time-window to avoid being annoying, where say 1ms is a long time when the whole frame is ideally under 30ms).

( EDIT/ADD: although 150 streams may seem like a lot, considering the low resolutions and frame-rates involved, it is vaguely comparable to the cost of decoding a 1080p video in a pixels-per-second sense. granted, real-world performance may be worse, as these results were mostly seen while decoding the same image in a tight loop, so likely a lot more stuff was in cache, and adding DXT encoding would add another stage to this process. I would need more realistic tests, and more actual code, to get a better estimate of how many video streams it would take to bog it down. )


as-is, I also have video-recording built into the engine, but in this case, there is a single dedicated encoder thread (so, the main thread reads the screen contents, then passes them to the video encoder via a shared context). this generally works ok, but as-is, I limit the record frame-rate to 16Hz (with typical rendering resolutions of 800x600 or 1024x768), partly to the encoder from bogging down (or using too much HDD space during recording).

granted, the encoder does more work than strictly needed (*1), but most other "simple" options tend to use large amounts of HDD space. granted, with video-capture, I guess it is sort of expected that capture will chew through large amounts of HDD space.

*1: basically, as-is it is closer to a full JPEG encoder, but I could probably hard-code tables or similar to speed it up some.
maybe doing 1024x768 @24Hz recording could be a goal though.

don't really want to deal with using multiple threads for encoding though.


or such...


EDIT / ADD:
for the video-mapping case, an alternate scenario could be using a video-map stored in a DXT-based format, thus avoiding the whole issue of needing to transcode at render-time. more consideration is needed here (when/where/how). current "most likely practical" solution is: simple deflated TLV format containing DXT frames, probably stored in an AVI.

format still too much under mental debate. leaning towards something vaguely JPEG-like WRT file-structure.
in any case, the "decoding" would probably be a loop+switch making the relevant OpenGL calls.

Edited by cr88192, 07 February 2013 - 04:34 PM.


#30 zfvesoljc   Members   -  Reputation: 442

Like
0Likes
Like

Posted 07 February 2013 - 06:40 AM

While resource loading is probably the biggest hit, another viable thing to consider is inplace loading for code "resources".

 

ie:

http://entland.homelinux.com/blog/2007/02/21/fast-file-loading-ii-load-in-place/



#31 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 07 February 2013 - 02:43 PM

While resource loading is probably the biggest hit, another viable thing to consider is inplace loading for code "resources".
 
ie:
http://entland.homelinux.com/blog/2007/02/21/fast-file-loading-ii-load-in-place/

while the basic strategy works, there are a few potential drawbacks:
data-portability, done poorly, one will end up with files specific to a particular target architecture (such as 32-bit x86, or 64-bit x86, and, depending, a person still needs to do things like address fixups);
adjustments to address the above to improve portability, such as defined width and endianess for values, using fixed base-relative addresses, ... may risk hurting overall performance (if done poorly);
unless the data is also deflated or similar, disk IO is still a likely the notable bottleneck.


the solution then, in the design of a binary format, is typically to make a compromise:
the basic format may be loaded as an "image", maybe with the contents deflated;
unpacking the data into a usable form is made fairly trivial.

(while Deflate's compression process can be slow, decompression can be fairly fast).


loading in individual contents may be like:
fetch the relevant data-lump from the image;
inflate the data-lump if needed;
decode the data with a lightweight "unpacking" process, of, building index structures and (potentially, if-needed) doing an endianess swap of any values (can usually be skipped).

so, index-structures would contain any pointers or similar, and the actual data will consist mostly of simple arrays and tables.
(say, we have some tables, and a few structs with pointers into these tables).

for example, in my case, for a few things (mostly VM related) I am using a format originally based fairly closely on the Quake WAD2 format (which I call "ExWAD"), just it added support for deflate (and slightly changed the main header, mostly with a larger magic value). a subsequent version diverged from WAD2, mostly by expanding the name field to 32-bytes (with a 64-byte directory entry), and adding support for directly representing directory trees (more like in a filesystem).

#32 wintertime   Members   -  Reputation: 1862

Like
0Likes
Like

Posted 08 February 2013 - 07:43 AM

I can see how DXT encoding is nice for textures you want to use for a longer time, but is it really needed to do all that encoding on CPU for a video when you display each frame only once and then throw it away? Also, why not skip the YUV->RGB step on CPU and do it in a shader instead as there is probably not much else going on in the GPU when just watching a video?



#33 samoth   Crossbones+   -  Reputation: 5032

Like
0Likes
Like

Posted 08 February 2013 - 09:16 AM

I can see how DXT encoding is nice for textures you want to use for a longer time, but is it really needed to do all that encoding on CPU for a video when you display each frame only once and then throw it away?

Video is not usually DXT-compressed, but using other methods (which not only compress a single frame but also use deltas between frames). It is of course possible to use DXT to compress individual video frames, and since that is a simple solution it might just work for some cases.

As to why one would want to do such a silly amount of work for a frame that one only watches for a split second, the answer is that the amount of data you need to stream in for video is forbidding if you possibly ever want to do anything else. It may be overall forbidding too, depensing on resolution and frame rate.

Let's say you want to play a 480x270 video (1/16 the size of 1080p) at 25fps. Assuming 32bpp, that's 138MiB/s of data. Try and find a harddisk that consistenly delivers at such a rate (my SSD has no trouble doing that, but sadly not every consumer has a SSD just yet). And now imagine you maybe want a somewhat larger video. Make it only twice as large in every dimension, and you reach the theoretical maximum of SATA-600.

Compression will reduce this immense bandwidth pressure by a factor of maybe 500 or 1000, which is just what you need.

About doing YUV->RGB on the GPU, there is no objection to that.

#34 wintertime   Members   -  Reputation: 1862

Like
0Likes
Like

Posted 08 February 2013 - 09:35 AM

I was assuming already the video is compressed on hdd as MPEG/MJPEG/whatever. What seemed weird to me was that after decompressing it people wanted to use additional CPU time to recompress it as DXT just for speeding up the way from main memory to gpu memory which is much faster than the hdd anyway.



#35 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 08 February 2013 - 03:02 PM

samoth, on 08 Feb 2013 - 09:22, said:

wintertime, on 08 Feb 2013 - 07:49, said:
I can see how DXT encoding is nice for textures you want to use for a longer time, but is it really needed to do all that encoding on CPU for a video when you display each frame only once and then throw it away?

Video is not usually DXT-compressed, but using other methods (which not only compress a single frame but also use deltas between frames). It is of course possible to use DXT to compress individual video frames, and since that is a simple solution it might just work for some cases.

As to why one would want to do such a silly amount of work for a frame that one only watches for a split second, the answer is that the amount of data you need to stream in for video is forbidding if you possibly ever want to do anything else. It may be overall forbidding too, depensing on resolution and frame rate.

Let's say you want to play a 480x270 video (1/16 the size of 1080p) at 25fps. Assuming 32bpp, that's 138MiB/s of data. Try and find a harddisk that consistenly delivers at such a rate (my SSD has no trouble doing that, but sadly not every consumer has a SSD just yet). And now imagine you maybe want a somewhat larger video. Make it only twice as large in every dimension, and you reach the theoretical maximum of SATA-600.

Compression will reduce this immense bandwidth pressure by a factor of maybe 500 or 1000, which is just what you need.

About doing YUV->RGB on the GPU, there is no objection to that.

my video is typically 256x256 @10fps.
512x512 @10fps is also possible, but not currently used.

the problem with NPOT resolutions (480x270, for example) is that the GPU doesn't like them (and resampling is expensive).
generally, since these are going into textures and mapped onto geometry, we want power-of-two sizes.

the downside of YUV->RGB in shaders is mostly that it would require dedicated shaders, vs being able to use them in roughly all the same ways as a normal texture (as input to various other shaders, ...). though, this would probably be fine for cutscenes or similar though (ironically, not really using video for cutscenes at present...).


otherwise, video-mapping is a similar idea to id Software's RoQ videos.
however, RoQ was using a different compression strategy (Vector Quantization).
for example, Doom3 used RoQ for lots of little things in-game (decorative in-game video-displays, fire effects, ...).

theoretically, a DXT-based VQ codec could be derived (and could potentially decode very quickly, since you could unpack directly to DXT). I had started looking into this. I didn't like my initial design results though (too complicated for what it was doing).


as for MJPEG:
I was mostly using code I already had at the time (*1), and the format has a few advantages for animated-texture use, namely that it is possible to decode frames in arbitrary order and also easily skip frames (and, also it is not legally encumbered). also, for animated textures, raw compression rate is less important and there is often less opportunity for effective use of motion compensation.

the downside is mostly the narrow time-window to decode frames during rendering (at least, with the current single-threaded design).


*1: originally for texture loading, and also I had some AVI handling code around (years earlier, I had written some more-generic video-playback stuff). (and, the JPEG loader was originally written due to frustration with libjpeg).

also, it is possible to see the basic animated textures in normal video players (like media-player classic), but granted, this is only a minor detail.

admittedly, when I first tried using video mapping (on a Radeon 9000), it was too slow to be worthwhile. some years later, hardware is faster, so now it works basically ok.

wintertime, on 08 Feb 2013 - 09:41, said:
I was assuming already the video is compressed on hdd as MPEG/MJPEG/whatever. What seemed weird to me was that after decompressing it people wanted to use additional CPU time to recompress it as DXT just for speeding up the way from main memory to gpu memory which is much faster than the hdd anyway.

well, yes, there is little to say that there would actually be benefit in doing so.
the speed of my "DXT1F" encoder doesn't seem to be "that" bad though, so there is at least a small chance it could derive some benefit.

as noted, it mostly depends on the ability to shim it into the JPEG decoding process, so it isn't so much "encoding DXTn" as much as "skipping fully decoding to RGB" (and only using about 1/4 as much arithmetic as the full conversion).

theoretically, this route "could" actually outperform the use of uncompressed RGBA for the final steps of decoding the JPEG images. (and, otherwise, you still need to put the DCT block-planes into raster order one way or another...).

as for time-usage, if we are rendering at 30fps, but the video framerate is 10fps, then each video-frame will be on-screen for roughly 3 frames (or 4-5 frames if rendering at 40-50fps).


(decided to leave out some stuff)

#36 dougbinks   Members   -  Reputation: 489

Like
0Likes
Like

Posted 09 February 2013 - 04:18 AM

For those interested in fast CPU side texture compression, say for streaming JPG->DXT etc., check out the code and article here: http://software.intel.com/en-us/vcsource/samples/dxt-compression

 

Some searching should also help find a host of other approaches. I'd not recommend this in general for loading textures in most situations as the better quality/performance ratios are likely simply using packed (Zip etc.) pre-compressed texture loading, but if you have huge amounts of detailed textures you need to stream this is a good approach.



#37 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 09 February 2013 - 09:52 AM

For those interested in fast CPU side texture compression, say for streaming JPG->DXT etc., check out the code and article here: http://software.intel.com/en-us/vcsource/samples/dxt-compression
 
Some searching should also help find a host of other approaches. I'd not recommend this in general for loading textures in most situations as the better quality/performance ratios are likely simply using packed (Zip etc.) pre-compressed texture loading, but if you have huge amounts of detailed textures you need to stream this is a good approach.

felt curious, tested my own "DXT1F" encoder (with MSVC):

I am getting 112Mp/s if compiled with optimizations turned on ("/O2"), and 42Mp/s debug ("/Z7").

granted, this is single-threaded scalar code.
and it also assumes only 2 colors.


the version written to shim into the JPEG decoder (by itself) is actually pulling off 314Mp/s (180Mp/s debug).
still single-threaded scalar code.

the main difference is that it works on planar YUV input, and assumes 4:2:0, and requires less math than for RGBA (and we only need the min and max Y, UV=simple average).

(EDIT/ADD: after changing it to take on more of the work, basically working on raw IDCT block output, rather than precooked output, it is 170Mp/s optimized).

not entirely sure what a SIMD or multithreaded version would do here.

granted, this version would be used on the front-end of a JPEG decoder, which would make it slower.

(EDIT/ADD:
http://pastebin.com/emDK9jwc
http://pastebin.com/EyEY5W9P
)

CPU speed (my case) = 2.8 GHz.

or such...

Edited by cr88192, 09 February 2013 - 11:45 AM.


#38 samoth   Crossbones+   -  Reputation: 5032

Like
0Likes
Like

Posted 09 February 2013 - 03:35 PM

What seemed weird to me was that after decompressing it people wanted to use additional CPU time to recompress it as DXT just for speeding up the way from main memory to gpu memory which is much faster than the hdd anyway.

This may still be a valid reason.

Although you typically have about 8 GiB/s of bandwidth over PCIe, several hundreds of megabytes are still a non-neglegible share. If you do nothing else, that's no problem, but if you possibly have other stuff to transfer, it may be.

Transfers also have a fixed overhead and cause a complete GPU stall on typical present-day consumer hardware (the driver will either let the GPU render, or it will do a PCIe transfer, not both at the same time), so transferring a frame at a time is not an efficient solution. Transferring many frames at a time results in much better parallelism, however, it takes forbidding amounts of GPU memory. DXT compression alleviates that.

#39 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 09 February 2013 - 11:50 PM


What seemed weird to me was that after decompressing it people wanted to use additional CPU time to recompress it as DXT just for speeding up the way from main memory to gpu memory which is much faster than the hdd anyway.

This may still be a valid reason.

Although you typically have about 8 GiB/s of bandwidth over PCIe, several hundreds of megabytes are still a non-neglegible share. If you do nothing else, that's no problem, but if you possibly have other stuff to transfer, it may be.

Transfers also have a fixed overhead and cause a complete GPU stall on typical present-day consumer hardware (the driver will either let the GPU render, or it will do a PCIe transfer, not both at the same time), so transferring a frame at a time is not an efficient solution. Transferring many frames at a time results in much better parallelism, however, it takes forbidding amounts of GPU memory. DXT compression alleviates that.


pretty much.

nothing here to say this is actually efficient (traditional animated textures are still generally a better solution in most cases, where my engine supports both).


another motivation is something I had suspected already (partly confirmed in benchmarks):
the direct YUV to DXTn route is apparently slightly faster than going all the way to raw RGB.

still need more code though to confirm that everything is actually working, probably followed by more fine-tuning.

(EDIT/ADD: sadly, it turns out my JPEG decoder isn't quite as fast as I had thought I had remembered, oh well...).


(EDIT/ADD 2: above, as-in, the current JPEG->DXT5 transcoding route pulls off only about 38Mp/s (optimized "/O2", ~ 20Mp/s debug), whereas previously I had thought I had remembered things being faster. (granted, am getting tempted to use SIMD intrinsics for a few things...).

note that current RGBA video frames have both an RGB(YUV) image and an embedded Alpha image (mono, also encoded as a JPEG).
both layers are decoded/transcoded and recombined into a composite DXT5 image.


while looking around online, did run across this article though:
http://www.nvidia.com/object/real-time-ycocg-dxt-compression.html
nifty idea... but granted this would need special shaders...).

Edited by cr88192, 10 February 2013 - 08:13 PM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS