• Content count

  • Joined

  • Last visited

Community Reputation

357 Neutral

About DayWall

  • Rank
  1. That's some great ideas, thx for sharing! It'll definitely come in handy! ) B.t.w, does anyone know, how exactly Umbra works? I've heard, that Witcher 3 actually use their OC solution
  2. Introduction Recently I came across a awesome presentation from Crytek named Secrets of CryEGNINE 3 Graphics Technology authored by Nickolay Kasyan, Nicolas Schulz and Tiago Sousa. In this paper I've found a brief description of a technique called Coverage Buffer. You can find whole presentation HERE. This technology was presented as main occlusion culling method, actively used since Crysis 2. And, since there was no detailed paper about this technology, I decided to dig into this matter myself. [hr] Coverage Buffer - Occlusion Culling technique Overview Main idea of method was clearly stated by Crytek presentation I mentioned before: Get depth buffer from previous frame Make reprojection to current frame Softwarely rasterize BBoxes of objects to check, whether they can be seen from camera perspective or not - and, based on this test, make a decision to draw or not to draw. There's, of course, nothing revolutionary about this concept. There's another very similar method called Software Occlusion Culling. But there are few differences between these methods, and a crucial one is that in SOC we must separate objects into two different categories - occluders and occludee, which is not always can be done. Let's see some examples. If we have FPS game level, like Doom 3, we have corridors, which are perfect occluders, and objects - barrels, ammo, characters, which are, in turn, perfect occludees. In this case we have a clear approach - to test objects' BBoxes agains corridors. But what if we have, let's say, massive forest. Every tree can be both occluder - imagine large tree right in front of camera, which occludes all other world behind it, - and occludee - when some other tree occludes it. In case of forest we cannot use SOC in its pure form, it'd be counterproductive. So, summarizing cons and pros of Coverage Buffer: PROS: we need not to separate objects into occluders/occludee we can use already filled depth buffer from previous frame, we need not to rasterize large occluder's BBoxes CONS: small artifacts, caused by 1 frame delay (even reprojection doesn't completely solve it) small overhead if there's no occlusion happened (that, I guess, is common for all OC methods I know, but still) [hr] Choice When I started to investigate this matter, it wasn't out of pure academic interest. On a existing and live project there was a particular problem, which needed to be solved. Large procedural forest caused giant lags because of overdraw issue (Dx9 alpha-test stage was disabled due to other issues, which are not discussed in this article, and in Dx11 alpha test kills Early-Z, which also causes massive overdraw). Here's a short summary of initial problem: We need to draw an island, full of different, procedural trees. (Engine used is Torque3D) Engine by default offers nice batching system, which batches distant trees into... well, batches, but decisions about "draw/no draw" is taken based on frustum culling results only. Also, distant trees are rendering as billboards-imposters, which is also a nice optimization. But this approach is not so effective, when we deal with large forest with thousands of trees. In this case there's a lot of overdraw done: batches behind mountains, batches behind the wall, batches behind other trees and so on. All of these overdraws cause FPS to drop gravely: even if we look through wall in the direction towards the center of an island, drawing of invisible trees took about 20-30ms. As a result, players got a dramatic FPS drop by just looking towards the center of an isle. To solve this particular issue it's been decided to use Coverage Buffer. I cannot say that I did not have doubts about this decision, but Crytek recommendations overruled all my other suggestions. Besides, CB fits into this particular issue like a glove - why not try it? Implementation Let's proceed to technical details and code. Obtaining Depth Buffer. First task was to obtain depth buffer. In Dx11 it's no difficult task. In Dx9 it's also not so difficult, there's a certain hack (found in Aras Pranckevi?ius blog, it's a guy, who runs render in Unity3D). Here's link: It appears, that one CAN obtain depth buffer, but only with special format - INTZ. According to official NVidia and AMD papers, most of videocards since 2008 support this feature. For earlier cards there's RAWZ - another hacky format. Links to papers: Usage code is trivial, but I'll put it here - just in case: #define FOURCC_INTZ ((D3DFORMAT)(MAKEFOURCC('I','T','N','Z'))) // Determine if INTZ is supported HRESULT hr; hr = pd3d->CheckDeviceFormat(AdapterOrdinal, DeviceType, AdapterFormat, D3DUSAGE_DEPTHSTENCIL, D3DRTYPE_TEXTURE, FOURCC_INTZ); BOOL bINTZDepthStencilTexturesSupported = (hr == D3D_OK); // Create an INTZ depth stencil texture IDirect3DTexture9 *pINTZDST; pd3dDevice->CreateTexture(dwWidth, dwHeight, 1, D3DUSAGE_DEPTHSTENCIL, FOURCC_INTZ, D3DPOOL_DEFAULT, &pINTZDST, NULL); // Retrieve depth buffer surface from texture interface IDirect3DSurface9 *pINTZDSTSurface; pINTZDST->GetSurfaceLevel(0, &pINTZDSTSurface); // Bind depth buffer pd3dDevice->SetDepthStencilSurface(pINTZDSTSurface); // Bind depth buffer texture pd3dDevice->SetTexture(0, pINTZDST); Next step is processing depth buffer, so we could use it. Processing depth buffer. downscale to low resolution (I picked 256x128) reprojection These steps are trivial. Downscale is performed with operator max - we're taking the elowest distance to camera, so we wouldn't occlude any of actually visible objects. Reprojection is performed by applying inverted ViewProjection matrix of previous frame and applying ViewProjection matrix of current frame to results. Gaps are filled with maxValue to prevent artificial occlusion. Here's some useful parts of code for reprojection: float3 reconstructPos(Texture2D depthTexture, float2 texCoord, float4x4 matrixProjectionInverted ) { float depth = 1-depthTexture.Sample( samplerDefault, texCoord ).r; float2 cspos = float2(texCoord.x * 2 - 1, (1-texCoord.y) * 2 - 1); float4 depthCoord = float4(cspos, depth, 1); depthCoord = mul (matrixProjectionInverted, depthCoord); return / depthCoord.w; } Projection performed trivially. Software rasterization This topic is well known and already implemented a lot of times. Best info which I could find was here: But, just to gather all eggs in one basket, I'll provide my code, which was originally implemented in plain c++, and later translated to SSE, after which it became approximately 3 times faster. My SSE is far from perfect, so, if you find any mistakes or places for optimization - please tell me =) static const int sBBIndexList[36] = { // index for top 4, 8, 7, 4, 7, 3, // index for bottom 5, 1, 2, 5, 2, 6, // index for left 5, 8, 4, 5, 4, 1, // index for right 2, 3, 7, 2, 7, 6, // index for back 6, 7, 8, 6, 8, 5, // index for front 1, 4, 3, 1, 3, 2, }; __m128 SSETransformCoords(__m128 *v, __m128 *m) { __m128 vResult = _mm_shuffle_ps(*v, *v, _MM_SHUFFLE(0,0,0,0)); vResult = _mm_mul_ps(vResult, m[0]); __m128 vTemp = _mm_shuffle_ps(*v, *v, _MM_SHUFFLE(1,1,1,1)); vTemp = _mm_mul_ps(vTemp, m[1]); vResult = _mm_add_ps(vResult, vTemp); vTemp = _mm_shuffle_ps(*v, *v, _MM_SHUFFLE(2,2,2,2)); vTemp = _mm_mul_ps(vTemp, m[2]); vResult = _mm_add_ps(vResult, vTemp); vResult = _mm_add_ps(vResult, m[3]); return vResult; } __forceinline __m128i Min(const __m128i &v0, const __m128i &v1) { __m128i tmp; tmp = _mm_min_epi32(v0, v1); return tmp; } __forceinline __m128i Max(const __m128i &v0, const __m128i &v1) { __m128i tmp; tmp = _mm_max_epi32(v0, v1); return tmp; } struct SSEVFloat4 { __m128 X; __m128 Y; __m128 Z; __m128 W; }; // get 4 triangles from vertices void SSEGather(SSEVFloat4 pOut[3], int triId, const __m128 xformedPos[]) { for(int i = 0; i < 3; i++) { int ind0 = sBBIndexList[triId*3 + i + 0]-1; int ind1 = sBBIndexList[triId*3 + i + 3]-1; int ind2 = sBBIndexList[triId*3 + i + 6]-1; int ind3 = sBBIndexList[triId*3 + i + 9]-1; __m128 v0 = xformedPos[ind0]; __m128 v1 = xformedPos[ind1]; __m128 v2 = xformedPos[ind2]; __m128 v3 = xformedPos[ind3]; _MM_TRANSPOSE4_PS(v0, v1, v2, v3); pOut.X = v0; pOut.Y = v1; pOut.Z = v2; pOut.W = v3; //now X contains X0 x1 x2 x3, Y - Y0 Y1 Y2 Y3 and so on... } } bool RasterizeTestBBoxSSE(Box3F box, __m128* matrix, float* buffer, Point4I res) { //TODO: performance LARGE_INTEGER frequency; // ticks per second LARGE_INTEGER t1, t2; // ticks double elapsedTime; // get ticks per second QueryPerformanceFrequency(&frequency); // start timer QueryPerformanceCounter(&t1); //verts and flags __m128 verticesSSE[8]; int flags[8]; static Point4F vertices[8]; static Point4F xformedPos[3]; static int flagsLoc[3]; // Set DAZ and FZ MXCSR bits to flush denormals to zero (i.e., make it faster) // Denormal are zero (DAZ) is bit 6 and Flush to zero (FZ) is bit 15. // so to enable the two to have to set bits 6 and 15 which 1000 0000 0100 0000 = 0x8040 _mm_setcsr( _mm_getcsr() | 0x8040 ); // init vertices Point3F center = box.getCenter(); Point3F extent = box.getExtents(); Point4F vCenter = Point4F(center.x, center.y, center.z, 1.0); Point4F vHalf = Point4F(extent.x*0.5, extent.y*0.5, extent.z*0.5, 1.0); Point4F vMin = vCenter - vHalf; Point4F vMax = vCenter + vHalf; // fill vertices vertices[0] = Point4F(vMin.x, vMin.y, vMin.z, 1); vertices[1] = Point4F(vMax.x, vMin.y, vMin.z, 1); vertices[2] = Point4F(vMax.x, vMax.y, vMin.z, 1); vertices[3] = Point4F(vMin.x, vMax.y, vMin.z, 1); vertices[4] = Point4F(vMin.x, vMin.y, vMax.z, 1); vertices[5] = Point4F(vMax.x, vMin.y, vMax.z, 1); vertices[6] = Point4F(vMax.x, vMax.y, vMax.z, 1); vertices[7] = Point4F(vMin.x, vMax.y, vMax.z, 1); // transforms for(int i = 0; i < 8; i++) { verticesSSE = _mm_loadu_ps(vertices); verticesSSE = SSETransformCoords(&verticesSSE, matrix); __m128 vertX = _mm_shuffle_ps(verticesSSE, verticesSSE, _MM_SHUFFLE(0,0,0,0)); // xxxx __m128 vertY = _mm_shuffle_ps(verticesSSE, verticesSSE, _MM_SHUFFLE(1,1,1,1)); // yyyy __m128 vertZ = _mm_shuffle_ps(verticesSSE, verticesSSE, _MM_SHUFFLE(2,2,2,2)); // zzzz __m128 vertW = _mm_shuffle_ps(verticesSSE, verticesSSE, _MM_SHUFFLE(3,3,3,3)); // wwww static const __m128 sign_mask = _mm_set1_ps(-0.f); // -0.f = 1
  3. I'm afraid it won't be an easy task, our build of T3D is very much different from current, official version. It won't be easy to port this thing to vanilla torque. Besides, now it's a hot season for our project - release is coming, and there's a lot of work and optimization to be done yet. Maybe after release =)
  4. Well, I guess it's time for me to surrender =) If you're the guy, who invented the very method I struggled to implement, you're obviously much more experienced and knowledgeable in this area than myself =)
  5. This step is quite trivial. You get depth buffer from previous frame, and then softwarely rasterize BBoxes of objects you want to test for occlusion,   bool RasterizeTestBBoxSSE(Box3F box, __m128* matrix, float* buffer, Point4I res) This function accepts bounding box, depth buffer as plain float array and parameters of this buffer. Then it represents BBox as set of triangles and rasterize it, testing with buffer values. I honestly don't know what's more to say. If you want to know more about rasterization process, It's rather trivial - you can find a lot of other sources in internet, - and, more importantly, not the point of this particular article.
  6. That's not exactly correct. In our case, we have our command buffer executed, and all we need to do is to fetch Z-Buffer to CPU. It will cause GPU stall, of course, but a small one. Staging latency is rather small in this case - check for yourself. As for Occlusion Query, delayed one is unreliable - I think we can agree on that, and forcing one to implement immediately causes large GPU stall, much larger than fetching z-buffer, not to mention performing Occlusion Query for a complex scene with a lot of objects is a costly operation - it's basically another full scene pass, and if we can replace some simple objects with BBoxes, we cannot replace complex objects, for example - terrain, with BBox. And, besides that, using Z-Buffer gives much better quality of culling than using not accurate BBoxes in Occlusion Queries. And it's much more flexible - we don't need to adjust BBoxes, as we should do for OQ: That is, in fact, a great article, that gives a lot of points about why you shouldn't use OQ as main scene occlusion technique. I'm not saying, that C-Buffer is always better than Occlusion Queries, but in some cases it surely is. Btw, I wonder if anyone actually uses Occlusion Queries in actual projects or engines,..
  7. Yes, occlusion queries gives serious delay - sometimes up to 3 or 4 frames, thus you cannot rely on it completely, it can cause serious artificial cullings, which in turn leads to objects, appearing out of a thin air. Using Z-Buffer from previous frame combined with reprojection gives, basically, zero delay and completely reliable.