Of course the problem is the memory to pci-e bus then process then back up the pci-e bus and back into memory
One thing that I believe your over looking is modern processors can also process 4 float operations at once so that probably accounts for quite a bit of it as well. The cpu will be limited by memory in this scenario as well.
Between those two issues you could do it on the cpu before it even transfers down the pci-e bus.