To read back results on the CPU you have to create two buffers of the same size. The first you create with D3D11_USAGE_DEFAULT and you use that as the output of your compute shader. For the other buffer you create D3D11_USAGE_STAGING and CPU read access. Then after you run your compute shader, you use CopyResource to copy the data from the GPU buffer to the staging buffer. You can then call Map on the staging buffer to read its contents. Just be aware that doing this will cause the CPU flush its command buffers and wait around while the GPU finishes executing commands, which will kill parallelism and hurt performance. You can alleviate this by waiting as long as possible after calling CopyResource but before calling Map.
Also just so you're aware, while global atomics are the most straightforward way to do this they're definitely not the fastest. Running a multi-pass parallel reduction is likely to be much faster.
Edited by MJP, 02 August 2013 - 12:13 PM.