[solved][gpgpu] Is there faster way to copy texture back to mem?

Started by
4 comments, last by sirob 16 years, 6 months ago
Thanks, Sirob Yes. //btw, i forgot jack's great suggestions on using memcpy_s instead of for-for: http://www.gamedev.net/community/forums/topic.asp?topic_id=446082 Hi, [dx10, vs2005, geforce8800] below code is a routine of harvesting the gpgpu result. it's slow, actually takes the majority of the total time, about 7x drawScene time. How to speedup it, even if only a little bit? any idea would be much appreciated. Thanks.

//stage pTexRT0 to Mem
static void ReadBack( Int2* Mem, int rLen, ID3D10Texture2D* pTexRT0, int width0, int height0 )
{
	//tex staging
	ID3D10Texture2D* pTexStage = NULL;
	D3D10Int2EXTURE2D_DESC descTex;

	descTex.Width = width0;
	descTex.Height = height0;
	descTex.ArraySize = 1;
	descTex.SampleDesc.Count = 1;
	descTex.SampleDesc.Quality = 0;
	descTex.Format = DXGI_FORMAT_R32G32_SINT;
	descTex.Usage = D3D10_USAGE_STAGING;
	descTex.BindFlags = 0;
	descTex.CPUAccessFlags = D3D10_CPU_ACCESS_READ;
	descTex.MiscFlags = 0;
	descTex.MipLevels = 1;
	V(g_pd3dDevice->CreateTexture2D(&descTex, NULL, &pTexStage));
	////////////////////////

	g_pd3dDevice->CopyResource(pTexStage, pTexRT0);

	SAFE_RELEASE(pTexRT0);	

	D3D10_MAPPEDInt2EXTURE2D mappedTex;
	pTexStage->Map(0, D3D10_MAP_READ, NULL, &mappedTex);
	INT* pTexels = (INT*)mappedTex.pData;
	int idx = rLen;

	int h = descTex.Height;
	int w = descTex.Width;

	for(int row = h-1; row >= 0; row--)
	{
		UINT rowStart = row * mappedTex.RowPitch / sizeof(UINT);	//RowPitch is in Byte
		for(int col = w-1; col >= 0 ; col--)
		{
			UINT colStart = col * 2;
			if(idx <= 0)
			{
				break;
			}
			else
			{
				Mem[idx-1].x = pTexels[rowStart + colStart + 0];	//r
				Mem[idx-1].y = pTexels[rowStart + colStart + 1];	//g
				idx--;
			}
		}
	}
	pTexStage->Unmap(0);
	SAFE_RELEASE(pTexStage);
}
[Edited by - yk_cadcg on October 10, 2007 9:49:36 PM]
Advertisement
First of all you should not recreate the staging texture every time you want data back from the GPU. Create it in advanced and reuse it.

I don’t know how you have measured the times but if you only have done it on the CPU your results can be wrong. CPU commands are pushed to a queue and execute asynchrony. But as soon as you want data back you force a sync. In your case this should happen at the map call.

It is common to use multiple buffers for the read back and use them round robin.
thanks Demirug,
1. yes reusing staging texture is a must. i do so in real code.
2. i run the code to know Pitch beforehand, and then use a single memcpy(). (Sure it's unsafe, i just want to study the perf) But it uses the same time with the 2 for-loops, why?
Thanks!
Quote:Original post by yk_cadcg
2. i run the code to know Pitch beforehand, and then use a single memcpy(). (Sure it's unsafe, i just want to study the perf) But it uses the same time with the 2 for-loops, why?
Thanks!
Because it takes maybe 1000 cycles to do the for loop, 750 for a memcpy() and 1000,000,000,000 cycles to do the Map() call (Completely made up numbers). In short - the for loop isn't your bottleneck, it's the Map() call.
Thanks a lot! it's very impressive explanation.
Would you please give another intuition on:
1, does the mapped size matter? matter to what degree? Is it that to fetch 1 byte from gram to ram is almost as costly as to fetch 1M bytes -- both are almost api time?
2, you mentioned bounded buffers, but that hides Map() latency only if i don't need the texture's result immediately, and render multiple times. but now i want to render only once, maybe the only way out is to insert some cpu code between Map() and copy?
3, you seem to say that g_pd3dDevice->CopyResource(pTexStage, pTexRT0) costs much less than pTexStage->Map(0, D3D10_MAP_READ, NULL, &mappedTex). I can't understand: CopyResource is from gram to ram, Map is in ram, why does the latter slower than the former?
Thanks!

[Edited by - yk_cadcg on October 10, 2007 3:21:23 AM]
Quote:Original post by yk_cadcg
Thanks a lot! it's very impressive explanation.
Would you please give another intuition on:
1, does the mapped size matter? matter to what degree? Is it that to fetch 1 byte from gram to ram is almost as costly as to fetch 1M bytes -- both are almost api time?
2, you mentioned bounded buffers, but that hides Map() latency only if i don't need the texture's result immediately, and render multiple times. but now i want to render only once, maybe the only way out is to insert some cpu code between Map() and copy?
3, you seem to say that g_pd3dDevice->CopyResource(pTexStage, pTexRT0) costs much less than pTexStage->Map(0, D3D10_MAP_READ, NULL, &mappedTex). I can't understand: CopyResource is from gram to ram, Map is in ram, why does the latter slower than the former?
Thanks!

1) You seem to be missing the point of what a Map call actually does. Other than some memory address trickery, what a Map call is really there to do is provide syncronization so that you don't write/read data using the CPU while the GPU is reading/writing. As such, your problem is with the syncronization, and not with the amount of data being mapped.

2) Map is probably taking a while to return, so placing code after it wouldn't do much good. If anything, you'd want to place CPU based code between the render and the lock, to reduce the amount of time you spend idling waiting for the rendering to finish.

3) Profile, Profile, Profile. We don't know function speeds by heart, and there are too many factors to calculate anything worthwhile. Writing a small test-case to compare these methods shouldn't take long at all, and will provide you with all the answers you're looking for. You should be the one providing us this information, not us providing you.
Sirob Yes.» - status: Work-O-Rama.

This topic is closed to new replies.

Advertisement