FWIW I also use a system similar to what is presented in that Battlefield paper. On my Xeon W3550, my SSE optimized code clocks in at 16k sphere-frustum checks in at about 0.12ms (1 thread), which is plenty fast enough for me. It didn't take too much effort to rework the data structures so the spheres were accessed linearly in memory and really the code ended up considerably simpler (IMO).
And just to emphasize - even if as you said you aren't so concerned with SIMD optimization right now, such data oriented design is still important because of cache performance. e.g. if your particle struct packs position and color info together but for a certain effect you are never updating the color - well, you are just wasting time loading and polluting your cache with that color information.