What you're doing is simply not fast because it just stresses bus bandwidth without ripping the benefits of using a GPU. Btw, check the surfaces may not been created with the dynamic flag (and your locks may not be using the discard flag), if that's the case it would help you with performance a lot.
The intrigue lies in what you mean by "arbitrary image data". If "arbitrary image data" means (for example) you're using libcairo to render nice & complex 2D graphics and then send it to multiple D3D Surfaces, then that's not going to be fast. You're wasting your time trying to use the GPU.Just combine them on CPU and send the final result to only one D3D surfaceIf by "arbitrary image data" you mean loading a few icons or pictures from a file, then you should do the update only once, not every frame.If by "arbitrary image data" you mean images created through compositing (eg. static images, or rectangles layered on top of each other with different alpha blending operations, eg. photoshop-like blend modes) then your method is not the right way to do it; you should upload the static data once, and then use pixel shaders to do the operations you were doing on the CPU to achieve the same result.