feeling like it, went and wrote a "vaguely fast" DXT1 encoder (DXT1F, where the F means "Fast").
the goal here was mostly to convert reasonably quickly, with image quality not necessarily being as much of a high priority.
as-is, I don't know of any good ways to make it notably faster (I know a few possible ways, but they aren't pretty).
If you still find yourself interested in DXT1F:
* The code looks like it would be possible to port over to entirely use the SSE registers instead of general purpose int ones, which might reduce the instruction count a lot.
* You could also make it so that the user can use multiple threads to perform the processing -- e.g. Instead of (or as well as) having an API that encodes a whole image at once, you could add two extra parameters -- the row to begin working from, and the row to end on (which should both be multiples of 4, or whatever the block size is). The user could then call that function multiple times with different start/end parameters on different threads to produce different rows of blocks concurrently.