The same question was posted by somebody on stackoverflow recently. The top answer also seems to be an interesting suggestion, basically make sure that for any dependent texture reads, calculate the coordinates in the vertex shader rather than the fragment shader. This allows the GPU to optimize texture fetches in the fragment shader by caching etc.
I'd already looked into separable filters. But the other two methods seem promising, especially summed area tables, which i think is the same as Integral images. Also I was wondering, since the texture fetches are offset and the same fetches will be repeated for every texels is there any caching technique we can make use of. (I agree its tough because the operations are happening in parallel)