Jump to content

  • Log In with Google      Sign In   
  • Create Account

#Actualcr88192

Posted 09 February 2013 - 11:45 AM

For those interested in fast CPU side texture compression, say for streaming JPG->DXT etc., check out the code and article here: http://software.intel.com/en-us/vcsource/samples/dxt-compression
 
Some searching should also help find a host of other approaches. I'd not recommend this in general for loading textures in most situations as the better quality/performance ratios are likely simply using packed (Zip etc.) pre-compressed texture loading, but if you have huge amounts of detailed textures you need to stream this is a good approach.

felt curious, tested my own "DXT1F" encoder (with MSVC):

I am getting 112Mp/s if compiled with optimizations turned on ("/O2"), and 42Mp/s debug ("/Z7").

granted, this is single-threaded scalar code.
and it also assumes only 2 colors.


the version written to shim into the JPEG decoder (by itself) is actually pulling off 314Mp/s (180Mp/s debug).
still single-threaded scalar code.

the main difference is that it works on planar YUV input, and assumes 4:2:0, and requires less math than for RGBA (and we only need the min and max Y, UV=simple average).

(EDIT/ADD: after changing it to take on more of the work, basically working on raw IDCT block output, rather than precooked output, it is 170Mp/s optimized).

not entirely sure what a SIMD or multithreaded version would do here.

granted, this version would be used on the front-end of a JPEG decoder, which would make it slower.

(EDIT/ADD:
http://pastebin.com/emDK9jwc
http://pastebin.com/EyEY5W9P
)

CPU speed (my case) = 2.8 GHz.

or such...

#2cr88192

Posted 09 February 2013 - 11:03 AM

For those interested in fast CPU side texture compression, say for streaming JPG->DXT etc., check out the code and article here: http://software.intel.com/en-us/vcsource/samples/dxt-compression
 
Some searching should also help find a host of other approaches. I'd not recommend this in general for loading textures in most situations as the better quality/performance ratios are likely simply using packed (Zip etc.) pre-compressed texture loading, but if you have huge amounts of detailed textures you need to stream this is a good approach.

felt curious, tested my own "DXT1F" encoder (with MSVC):

I am getting 112Mp/s if compiled with optimizations turned on ("/O2"), and 42Mp/s debug ("/Z7").

granted, this is single-threaded scalar code.
and it also assumes only 2 colors.


the version written to shim into the JPEG decoder (by itself) is actually pulling off 314Mp/s (180Mp/s debug).
still single-threaded scalar code.

the main difference is that it works on planar YUV input, and assumes 4:2:0, and requires less math than for RGBA (and we only need the min and max Y, UV=simple average).

(EDIT/ADD: after changing it to take on more of the work, basically working on raw IDCT block output, rather than precooked output, it is 170Mp/s optimized).

not entirely sure what a SIMD or multithreaded version would do here.

granted, this version would be used on the front-end of a JPEG decoder, which would make it slower.


CPU speed (my case) = 2.8 GHz.

or such...

#1cr88192

Posted 09 February 2013 - 09:52 AM

For those interested in fast CPU side texture compression, say for streaming JPG->DXT etc., check out the code and article here: http://software.intel.com/en-us/vcsource/samples/dxt-compression
 
Some searching should also help find a host of other approaches. I'd not recommend this in general for loading textures in most situations as the better quality/performance ratios are likely simply using packed (Zip etc.) pre-compressed texture loading, but if you have huge amounts of detailed textures you need to stream this is a good approach.

felt curious, tested my own "DXT1F" encoder (with MSVC):

I am getting 112Mp/s if compiled with optimizations turned on ("/O2"), and 42Mp/s debug ("/Z7").

granted, this is single-threaded scalar code.
and it also assumes only 2 colors.


the version written to shim into the JPEG decoder (by itself) is actually pulling off 314Mp/s (180Mp/s debug).
still single-threaded scalar code.

the main difference is that it works on planar YUV input, and assumes 4:2:0, and requires less math than for RGBA (and we only need the min and max Y, UV=simple average).

not entirely sure what a SIMD or multithreaded version would do here.

granted, this version would be used on the front-end of a JPEG decoder, which would make it slower.


CPU speed (my case) = 2.8 GHz.

or such...

PARTNERS