Copy from UAV to cbuffer in DX11 without Map/Unmap

Hi guys,
is it possible to copy from RWStructuredBuffer<float2x4> to a cbuffer of the same size using CopyResource function?
According MSDN if size, format, etc is the same, it should work.
There is a note "You can't use an Immutable resource as a destination." - I guess by immutable they mean D3D11_USAGE_IMMUTABLE, so I used radher D3D11_USAGE_DEFAULT.

the RWStructuredBuffer<float2x4> is created as this:

        D3D11_BUFFER_DESC desc;
        desc.ByteWidth = 2048; //64 lights * size of float2x4
        desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS;
        desc.StructureByteStride = 32; //size of float2x4
        desc.Usage = D3D11_USAGE_DEFAULT;
        hr = m_p_device->CreateBuffer(&desc, 0, &sourceBuffer);

        uavd.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
        uavd.Format = DXGI_FORMAT_UNKNOWN;
        uavd.Buffer.NumElements = 64;
        hr = m_p_device->CreateUnorderedAccessView(sourceBuffer, &uavd, &sourceBufferView);

        // generating 64 lights and store them in the sourceBuffer

then the cbuffer is created as this:

        D3D11_BUFFER_DESC desc;
        desc.ByteWidth = 2048; //64 lights * size of float2x4
        desc.BindFlags = D3D11_BIND_CONSTANT_BUFFER;
        desc.Usage = D3D11_USAGE_DEFAULT;
        hr = m_p_device->CreateBuffer(&desc, 0, &destinationBuffer);

then the copy is done via deferred context:

        m_p_deferred_context->CopyResource(destinationBuffer, sourceBuffer);

        // call the final lighting shader

In my lighting shader, I have 64 lights, float4 for color, float4 for position in view space, therefore float2x4.
The colors and positions of the lights are generated in another shader on the fly, so I store them in RWStructuredBuffer<float2x4>.
Then in my final lighting shader, I have to read all 64 lights per pixel, so I could just read the data again from RWStructuredBuffer<float2x4>.
However, since I'm doing tons of other texture reading, I think it totally breaks the texture cache, because I get a huge fps drop.
So I tried to move the RWStructuredBuffer<float2x4> data into a cbuffer and I got almost double performance.
The problem is, it appears that the data layout of these buffers is somehow different.

For debuging, I divided the screen into 8x8=64 squares and every square displayes a color of the light from the RWStructuredBuffer<float2x4>;
If I read it as RWStructuredBuffer<float2x4>, everything is correct a few red, green and white lights:


However if I read it now from the copied cbuffer, I got this, the color channels are somehow messed up.
Obviously, some data was copied and even the pattern was preserved:


Any idea, what could happend, how to do it correctly?

I could just do Map/Unmap, but since it's a deferred context, it's a bit tricky, moreover, I'd like to avoid any CPU communication and another staging buffer, so I'd like to just use CopyResource.


float2x4 stuff[64];   - Is not 2048 bytes, it's 4096 bytes as each 'register' in a constant buffer is padded to float4.

No such padding will occur with a StructuredBuffer, so perhaps you're copying a 2048 byte structured buffer into the first half of a constant buffer that the compiler is expecting to be 4096? You probably wanted float4x2 stuff[64] instead?

Can you show me your cbuffer layout so we can be sure that that's the problem? I expect either you've only got half the data in the right place or it has been transposed between float2x4 and float4x2.

It seems you are right.
My cbuffer looks as you wrote.

cbuffer GILights : register(b2)
    float2x4 GIColorViewPosition[64];

But when I change it to the float4x2, the problem is when I try to read this:

float4 color = GIColorViewPosition[ i ][ 0 ];

the compiler complains, it cannot convert float2 to float4, perhaps it's related to the fact I compile the shader with D3DCOMPILE_PACK_MATRIX_ROW_MAJOR.

Is it really that, this flag packs not just matrix type, but all the float#x# types and all related int, bool, etc versions of this type?

When I store lights in RWStructuredBuffer<float4x2> then read them from RWStructuredBuffer<float2x4>, I will get exactly the same broken image, so it must be the problem you just described.

float4x2 and float2x4 are every bit as much a 'matrix' as float4x4 for the purposes of packing.

/Zpr (Row Major Packing) will affect float2x4/float4x2 and will cause them to take 4096 bytes instead of 2048 and vice versa depending on whether that flag is set.

This shader, when compiled with /Zpr is a 2048 byte constant buffer and reads float4's:

cbuffer B
    float2x4 stuff[64];

float4 main(uint i : I) : SV_TARGET
    return stuff[i][0] + stuff[i][1];


