• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Hyunkel

SM5.0 Dynamic Branching Performance

8 posts in this topic

I'm currently trying to implement some form of triplanar multitexturing for my procedural planet renderer.
I'm storing all of my terrain diffuse textures in a dxt1 compressed texture array.
Somehow I was under the impression that it would be horribly inefficient to do conditional sampling on each texture array slice due to dynamic branching.
I couldn't really figure out a good way to calculate the 2 texture array indices I need to sample from ahead of time though, so I decided to use conditional sampling as a temporary solution until I figure out something more efficient.

And to my surprise it worked brilliantly, which made me realize that dynamic branching does not have the performance issues I thought it would have.
At least not on modern cards.

Here's what I'm currently doing:
[CODE]
float3 CDiffuse = 0.0;
[unroll]
for(int i = 0; i < 5; i++)
{
if(TIntensity[i] > 0.0)
CDiffuse += TIntensity[i] * GetTriplanarSample(BlendWeights, coord1, coord2, coord3, i, TADiffuse, TSampler);
}
[/CODE]

TIntensity[] are the texture intensities. They are calculated within the same shader.
GetTriplanarSample() gets 3 texture samples and blends them together using triplanar texturing from array slice i.
There are always only 2 TIntensity values higher than 0.0 at any given time.

And to my surprise, this has the same performance as doing:
[CODE]
CDiffuse += 0.5 * GetTriplanarSample(BlendWeights, coord1, coord2, coord3, 0, TADiffuse, TSampler);
CDiffuse += 0.5 * GetTriplanarSample(BlendWeights, coord1, coord2, coord3, 1, TADiffuse, TSampler);
[/CODE]

On the other hand if I remove the conditional branch:
[CODE]
float3 CDiffuse = 0.0;
[unroll]
for(int i = 0; i < 5; i++)
{
//if(TIntensity[i] > 0.0)
CDiffuse += TIntensity[i] * GetTriplanarSample(BlendWeights, coord1, coord2, coord3, i, TADiffuse, TSampler);
}
[/CODE]

Sampling takes about 2ms longer on my gtx580 in this situation, clearly showing that the card is branching with no noticeable performance loss.

Is dynamic branching in shaders really this powerful nowadays, or is this a special case where it performs well?

Cheers,
Hyu
0

Share this post


Link to post
Share on other sites
The GPU probably doesn't do any branching here. It just clamps TIntensity[i] so that negative values are clamped to zero (which can be done for free on ALU's), which results in equivalent operation. You could try and disassemble your shader to see what is actually happening to be sure it's actually branching.

Although yes, in general dynamic branching is pretty efficient on modern GPU's.
1

Share this post


Link to post
Share on other sites
The actual cost of the condition/branch instructions is quite low -- as cheap as a few ALU operations. Even on old SM3 cards, branches would range between 2-12 ALU ops in cost, so as long as you were [i]on average[/i] skipping more than 12 ALU ops with your branch, then they were 'fast'.

They only become really expensive when neighbouring pixels take opposite paths. e.g. pixel (10,42) is true, but pixel (10,43) is false -- the GPU processes pixels in batches ([i]usually a 2*2px quad[/i]), and it different paths are taken inside a batch, then every pixel in that patch has to take [i]both[/i] paths ([i]and discard the unwanted one[/i]).
1

Share this post


Link to post
Share on other sites
You should check the assembly, I think Bacterius is right about the compiler not actually using branch instructions. You can't use normal Sample instructions inside of a branch, so if you're using Sample then it's definitely not branching. Or at least, it's not executing the actual Sample inside the branch. Sometimes the compiler will move a Sample outside of a branch and emit a warning (always make sure that you output those when you compile your shaders).

As for dynamic branching perf it's definitely really good on modern hardware, as long as it's pretty coherent. Typically branching coherency is a warp or a wavefront, so either 32 or 64 threads.
1

Share this post


Link to post
Share on other sites
That's really interesting, I seem to have had some misconceptions about this.

Here's the asm: (with i = 3)
[CODE]
//
// Generated by Microsoft (R) HLSL Shader Compiler 9.29.952.3111
//
//
// Buffer Definitions:
//
// cbuffer $Globals
// {
//
// bool UseCubeBlending; // Offset: 0 Size: 4
// float4x4 WorldViewProjection; // Offset: 16 Size: 64 [unused]
// float4x4 WorldView; // Offset: 80 Size: 64 [unused]
// float4x4 World; // Offset: 144 Size: 64 [unused]
// float3 CameraPosition; // Offset: 208 Size: 12
// float FarPlane; // Offset: 220 Size: 4 [unused]
// float Radius; // Offset: 224 Size: 4
// float SlopeIntensity; // Offset: 228 Size: 4
// float BumpIntensity; // Offset: 232 Size: 4 [unused]
// float Temperature; // Offset: 236 Size: 4
//
// }
//
//
// Resource Bindings:
//
// Name Type Format Dim Slot Elements
// ------------------------------ ---------- ------- ----------- ---- --------
// AnisoSampler sampler NA NA 0 1
// AnisoClampSampler sampler NA NA 1 1
// DiffuseCube texture float4 cube 0 1
// NormalCube texture float4 cube 1 1
// DiffuseArray texture float4 2darray 2 1
// $Globals cbuffer NA NA 0 1
//
//
//
// Input signature:
//
// Name Index Mask Register SysValue Format Used
// -------------------- ----- ------ -------- -------- ------ ------
// SV_POSITION 0 xyzw 0 POS float
// TEXCOORD 0 xyz 1 NONE float xyz
// TEXCOORD 2 w 1 NONE float w
// TEXCOORD 1 xyz 2 NONE float xyz
// TEXCOORD 3 w 2 NONE float w
//
//
// Output signature:
//
// Name Index Mask Register SysValue Format Used
// -------------------- ----- ------ -------- -------- ------ ------
// SV_TARGET 0 xyzw 0 TARGET float xyzw
// SV_TARGET 1 xyzw 1 TARGET float xyzw
// SV_TARGET 2 xyzw 2 TARGET float xyzw
//
ps_4_0
dcl_constantbuffer cb0[15], immediateIndexed
dcl_sampler s0, mode_default
dcl_sampler s1, mode_default
dcl_resource_texturecube (float,float,float,float) t0
dcl_resource_texturecube (float,float,float,float) t1
dcl_resource_texture2darray (float,float,float,float) t2
dcl_input_ps linear v1.xyz
dcl_input_ps linear v1.w
dcl_input_ps linear v2.xyz
dcl_input_ps linear v2.w
dcl_output o0.xyzw
dcl_output o1.xyzw
dcl_output o2.xyzw
dcl_temps 10
dp3 r0.x, v1.xyzx, v1.xyzx
rsq r0.y, r0.x
mul r1.xyz, r0.yyyy, v1.xyzx
dp3 r0.y, v2.xyzx, v2.xyzx
rsq r0.y, r0.y
mul r0.yzw, r0.yyyy, v2.xxyz
add r2.xyz, |r0.yzwy|, l(-0.200000, -0.200000, -0.200000, 0.000000)
mul r2.xyz, r2.xyzx, l(7.000000, 7.000000, 7.000000, 0.000000)
max r2.xyz, r2.xyzx, l(0.000000, 0.000000, 0.000000, 0.000000)
add r2.w, r2.y, r2.x
add r2.w, r2.z, r2.w
div r2.xyz, r2.xyzx, r2.wwww
mul r3.xyz, v1.zxxz, l(2.000000, 2.000000, 2.000000, 0.000000)
lt r4.xyz, l(0.000000, 0.000000, 0.000000, 0.000000), r0.yzwy
lt r5.xyz, r0.yzwy, l(0.000000, 0.000000, 0.000000, 0.000000)
iadd r4.xyz, r5.xyzx, -r4.xyzx
itof r4.xyz, r4.xyzx
mul r5.xz, r3.xxyx, r4.xxyx
mul r3.x, r3.z, -r4.z
sqrt r0.x, r0.x
div r0.x, r0.x, cb0[14].x
add r0.x, r0.x, l(-1.000000)
mad r2.w, cb0[14].w, l(4500.000000), l(4500.000000)
mad r4.x, -|cb0[14].w|, l(1000.000000), l(1000.000000)
max r4.x, r4.x, l(40.000000)
max r4.y, cb0[14].w, l(0.000000)
dp3 r4.z, r0.yzwy, r1.xyzx
log r4.z, |r4.z|
mul r4.z, r4.z, cb0[14].y
exp r4.z, r4.z
add r4.z, -r4.z, l(1.000000)
max r4.z, r4.z, l(0.000000)
mul r4.zw, r4.zzzz, l(0.000000, 0.000000, 3000.000000, 500.000000)
mad r4.z, r0.x, l(6371000.000000), r4.z
mad r4.y, r4.y, l(3000.000000), l(2000.000000)
add r4.y, -r4.y, r4.z
mul_sat r4.y, r4.y, l(0.000500)
add r4.z, -r4.y, l(1.000000)
mad r0.x, r0.x, l(6371000.000000), -r4.w
add r4.w, r2.w, -r4.x
add r2.w, r2.w, r4.x
add r0.x, r0.x, -r4.w
add r2.w, -r4.w, r2.w
div_sat r0.x, r0.x, r2.w
add r2.w, -r0.x, r4.z
add r4.x, -r0.x, r4.y
lt r4.y, l(0.000000), r2.w
mul r5.y, v1.y, l(-2.000000)
mov r5.w, l(0)
sample r6.xyzw, r5.xywx, t2.xyzw, s0
mov r7.xz, r5.zzwz
mul r7.y, v1.z, l(-2.000000)
sample r8.xyzw, r7.xyzx, t2.xyzw, s0
mul r3.y, v1.y, l(-2.000000)
mov r3.zw, l(0,0,0,1.000000)
sample r9.xyzw, r3.xyzx, t2.xyzw, s0
if_nz r4.y
mul r4.yzw, r2.yyyy, r8.xxyz
mad r4.yzw, r6.xxyz, r2.xxxx, r4.yyzw
mad r4.yzw, r9.xxyz, r2.zzzz, r4.yyzw
mul r4.yzw, r2.wwww, r4.yyzw
else
mov r4.yzw, l(0,0,0,0)
endif
lt r2.w, l(0.000000), r4.x
mov r5.zw, l(0,0,1.000000,2.000000)
sample r6.xyzw, r5.xyzx, t2.xyzw, s0
mov r7.w, l(1.000000)
sample r8.xyzw, r7.xywx, t2.xyzw, s0
sample r9.xyzw, r3.xywx, t2.xyzw, s0
if_nz r2.w
mul r8.xyz, r2.yyyy, r8.xyzx
mad r6.xyz, r6.xyzx, r2.xxxx, r8.xyzx
mad r6.xyz, r9.xyzx, r2.zzzz, r6.xyzx
mad r4.yzw, r4.xxxx, r6.xxyz, r4.yyzw
endif
lt r2.w, l(0.000000), r0.x
sample r5.xyzw, r5.xywx, t2.xyzw, s0
mov r7.z, l(2.000000)
sample r6.xyzw, r7.xyzx, t2.xyzw, s0
mov r3.z, l(2.000000)
sample r3.xyzw, r3.xyzx, t2.xyzw, s0
if_nz r2.w
mul r6.xyz, r2.yyyy, r6.xyzx
mad r2.xyw, r5.xyxz, r2.xxxx, r6.xyxz
mad r2.xyz, r3.xyzx, r2.zzzz, r2.xywx
mad r4.yzw, r0.xxxx, r2.xxyz, r4.yyzw
endif
if_nz cb0[0].x
mov r1.w, -r1.z
sample r2.xyzw, r1.xywx, t1.xyzw, s1
sample r1.xyzw, r1.xywx, t0.xyzw, s1
mad r2.xyz, r2.xyzx, l(2.000000, 2.000000, 2.000000, 0.000000), -r0.yzwy
add r3.xyz, -v1.xyzx, cb0[13].xyzx
dp3 r0.x, r3.xyzx, r3.xyzx
sqrt r0.x, r0.x
div r0.x, r0.x, cb0[14].x
add r0.x, r0.x, l(-0.015000)
mul_sat r0.x, r0.x, l(19.985001)
add r2.xyz, r2.xyzx, l(-1.000000, -1.000000, -1.000000, 0.000000)
mad r2.xyz, r0.xxxx, r2.xyzx, r0.yzwy
dp3 r1.w, r2.xyzx, r2.xyzx
rsq r1.w, r1.w
mul r0.yzw, r1.wwww, r2.xxyz
add r1.xyz, -r4.yzwy, r1.xyzx
mad r4.yzw, r0.xxxx, r1.xxyz, r4.yyzw
endif
add r0.x, v1.w, l(0.001000)
mul_sat r0.x, r0.x, l(1000.000000)
add r1.xyz, r4.yzwy, l(-0.000000, -0.000000, -1.000000, 0.000000)
mad o0.xyz, r0.xxxx, r1.xyzx, l(0.000000, 0.000000, 1.000000, 0.000000)
mov o0.w, l(1.000000)
mov o1.xyz, r0.yzwy
mov o1.w, l(1.000000)
mov o2.x, v2.w
mov o2.yzw, l(0,0,0,1.000000)
ret
// Approximately 117 instruction slots used
[/CODE]

I'm not really sure how to interpret this.
t2 is the texture array in question, so I guess this part is responsible for sampling:
[CODE]
lt r4.y, l(0.000000), r2.w
mul r5.y, v1.y, l(-2.000000)
mov r5.w, l(0)
sample r6.xyzw, r5.xywx, t2.xyzw, s0
mov r7.xz, r5.zzwz
mul r7.y, v1.z, l(-2.000000)
sample r8.xyzw, r7.xyzx, t2.xyzw, s0
mul r3.y, v1.y, l(-2.000000)
mov r3.zw, l(0,0,0,1.000000)
sample r9.xyzw, r3.xyzx, t2.xyzw, s0
if_nz r4.y
mul r4.yzw, r2.yyyy, r8.xxyz
mad r4.yzw, r6.xxyz, r2.xxxx, r4.yyzw
mad r4.yzw, r9.xxyz, r2.zzzz, r4.yyzw
mul r4.yzw, r2.wwww, r4.yyzw
else
mov r4.yzw, l(0,0,0,0)
endif
lt r2.w, l(0.000000), r4.x
mov r5.zw, l(0,0,1.000000,2.000000)
sample r6.xyzw, r5.xyzx, t2.xyzw, s0
mov r7.w, l(1.000000)
sample r8.xyzw, r7.xywx, t2.xyzw, s0
sample r9.xyzw, r3.xywx, t2.xyzw, s0
if_nz r2.w
mul r8.xyz, r2.yyyy, r8.xyzx
mad r6.xyz, r6.xyzx, r2.xxxx, r8.xyzx
mad r6.xyz, r9.xyzx, r2.zzzz, r6.xyzx
mad r4.yzw, r4.xxxx, r6.xxyz, r4.yyzw
endif
lt r2.w, l(0.000000), r0.x
sample r5.xyzw, r5.xywx, t2.xyzw, s0
mov r7.z, l(2.000000)
sample r6.xyzw, r7.xyzx, t2.xyzw, s0
mov r3.z, l(2.000000)
sample r3.xyzw, r3.xyzx, t2.xyzw, s0
if_nz r2.w
mul r6.xyz, r2.yyyy, r6.xyzx
mad r2.xyw, r5.xyxz, r2.xxxx, r6.xyxz
mad r2.xyz, r3.xyzx, r2.zzzz, r2.xywx
mad r4.yzw, r0.xxxx, r2.xxyz, r4.yyzw
endif
[/CODE]

It indeed doesn't seem to branch out the sample instructions.
I know that I'm bottlenecking on texture sampling (bandwidth) though, which isn't a big surprise because I do way too much texture sampling.
When I add the if statements the performance increases massively, so I'm assuming the gpu is somehow able to skip the sample instructions if I add these conditions?
0

Share this post


Link to post
Share on other sites
While it can't branch over Sample() calls it can do so over [url="http://msdn.microsoft.com/en-us/library/windows/desktop/bb509698%28v=vs.85%29.aspx"]SampleGrad()[/url] calls as long as ddx and ddy are calculated outside of the if tests.

You might also find http://developer.amd.com/tools/shader/Pages/default.aspx useful - the dx assembly gets compiled again by the driver and that will show you the result for AMD cards.
1

Share this post


Link to post
Share on other sites
[quote name='Adam_42' timestamp='1335397885' post='4934911']
While it can't branch over Sample() calls it can do so over [url="http://msdn.microsoft.com/en-us/library/windows/desktop/bb509698%28v=vs.85%29.aspx"]SampleGrad()[/url] calls as long as ddx and ddy are calculated outside of the if tests.

You might also find [url="http://developer.amd.com/tools/shader/Pages/default.aspx"]http://developer.amd...es/default.aspx[/url] useful - the dx assembly gets compiled again by the driver and that will show you the result for AMD cards.
[/quote]

Thanks, this is very helpful!

I've looked into the generated assembly code for ati cards, and each iteration of my loop roughly maps to:
[CODE]
09 JUMP POP_CNT(1) ADDR(13) VALID_PIX
10 ALU: ADDR(288) CNT(9)
63 x: MOV R2.x, R14.x
y: MOV R2.y, R14.y
z: MOV R2.z, R9.z
64 x: MOV R1.x, R13.x
y: MOV R1.y, R13.y
z: MOV R1.z, R20.z
65 x: MOV R0.x, R15.x
y: MOV R0.y, R15.y
z: MOV R0.z, R10.z
11 TEX: ADDR(694) CNT(3) VALID_PIX
66 SAMPLE R0.xyz_, R0.xyzx, t2, s0 UNNORM(Z)
67 SAMPLE R1.xyz_, R1.xyzx, t2, s0 UNNORM(Z)
68 SAMPLE R2.xyz_, R2.xyzx, t2, s0 UNNORM(Z)
12 ALU_POP_AFTER: ADDR(297) CNT(12)
69 y: MUL_e ____, R6.z, R0.z
z: MUL_e ____, R6.z, R0.y
w: MUL_e ____, R6.z, R0.x
70 x: MULADD_e ____, R1.x, R6.y, PV69.w
z: MULADD_e ____, R1.z, R6.y, PV69.y
w: MULADD_e ____, R1.y, R6.y, PV69.z
71 x: MULADD_e ____, R2.y, R3.w, PV70.w
y: MULADD_e ____, R2.x, R3.w, PV70.x
w: MULADD_e ____, R2.z, R3.w, PV70.z
72 x: MULADD_e R17.x, R1.w, PV71.y, R17.x
y: MULADD_e R17.y, R1.w, PV71.x, R17.y
z: MULADD_e R17.z, R1.w, PV71.w, R17.z
[/CODE]

I don't know what all of these instructions do but
[CODE]
09 JUMP POP_CNT(1) ADDR(13) VALID_PIX
[/CODE]
looks like a conditional jump to address 13, skipping both texture sampling and related calculations for this loop iteration.
0

Share this post


Link to post
Share on other sites
Not all hardware can physically implement such behavior, so the D3D compiler produces more generic assembly that can be used by all D3D-compatible drivers.

The hardware manufacturer knows exactly the capabilities of their cards, so they can further analyze and modify the logic like this on the driver to optimize it for specific chips.

The sampling restriction inside conditionals is in place because the hardware needs to be able to calculate the derivative of the texture coordinate in order to select the mip leves from which to sample, and conditionals that can affect said texture coordinate in any way make that derivative undefined. If you calculate the derivatives on your own, this ceases to be a problem. In addition, the HLSL compiler will actually try to move the sample instruction out of the conditionals, if there are no circular dependencies on the derivatives that prevent this.
1

Share this post


Link to post
Share on other sites
Yeah, for some reason I thought that the D3D generated assembly would be final when I started this topic.
Obviously this makes no sense because every graphics card has different capabilities and optimizations.

While I now assume that modern Ati & Nvidia cards are perfectly capable of skipping sampling instructions in similar situations, I guess I'll go with the SampleGrad() route to be on the safe side.
Because if I do this for multiple texture arrays containing ~10 textures each, and some random card is lacking this capability, things will go horribly wrong.

Thanks for helping me better understand how these things work!
Cheers,
Hyu
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0