I'm now going to work on a few optimizations. Even if my blur is 500 times faster than ATI's one, there's still room for improvements. There are a few "ifs" that can be put outside a loop, and i can generate the blur in monochromatic space instead of blurring a real RGB image. I am trying to reach a few tens of seconds max to generate a single 1024x1024x6 cube starfield (it's still a couple of minutes right now, but better than the 2 or 3 days required with ATI's blur :)).
By the way, if you are wondering, i'm using a 2-pass blur with an o(n^2) loop, while ATI's code seems to be using a brute-force blur with an o(n^4) loop. This is only possible because i'm using a box filter (with a gaussian filter i'd have to use an o(n^3) loop).
In German, this would be a "Wink mit dem Zaunpfahl";-)