felt curious, tested my own "DXT1F" encoder (with MSVC):
For those interested in fast CPU side texture compression, say for streaming JPG->DXT etc., check out the code and article here: http://software.intel.com/en-us/vcsource/samples/dxt-compression
Some searching should also help find a host of other approaches. I'd not recommend this in general for loading textures in most situations as the better quality/performance ratios are likely simply using packed (Zip etc.) pre-compressed texture loading, but if you have huge amounts of detailed textures you need to stream this is a good approach.
I am getting 112Mp/s if compiled with optimizations turned on ("/O2"), and 42Mp/s debug ("/Z7").
granted, this is single-threaded scalar code.
and it also assumes only 2 colors.
the version written to shim into the JPEG decoder (by itself) is actually pulling off 314Mp/s (180Mp/s debug).
still single-threaded scalar code.
the main difference is that it works on planar YUV input, and assumes 4:2:0, and requires less math than for RGBA (and we only need the min and max Y, UV=simple average).
(EDIT/ADD: after changing it to take on more of the work, basically working on raw IDCT block output, rather than precooked output, it is 170Mp/s optimized).
not entirely sure what a SIMD or multithreaded version would do here.
granted, this version would be used on the front-end of a JPEG decoder, which would make it slower.
CPU speed (my case) = 2.8 GHz.