Sign in to follow this  
Oogst

Performance difference between scissor and viewport?

Recommended Posts

I would like to make certain objects only render at certain parts of the screen. Basically exactly what the scissorRect can do. However, the exact same thing can also be achieved by setting the viewport to a smaller region and compensating the matrices for the offset. Does this give better performance? If I understand the documentation correctly the scissorTest is done per fragment while the viewport is done per vertex. This suggests that using the viewport instead of the scissorRect would potentially save significant amounts of fillrate. Is this true or am I misunderstanding this? ScissorRect is a lot easier to implement so I only want to use the viewport for this if I am actually gaining something that way.

 

(Note that this is for our 2D game, in which fillrate is currently the main performance bottleneck. The fragment shaders are extremely simple, so basically most performance goes into large amounts of overdraw of partially transparent objects.)

Share this post


Link to post
Share on other sites

Okay, I'll just happily use scissors then. Thanks! :)

 

Scissors would probably be better. Scissors prevents the render from filling anything out of bounds.

I think view port still fills.

 

From how I understand it the viewport might sometimes fill outside its bounds, for example when doing a wireframe render with a thick wire, because viewports operate on the vertex level. Viewports don't just fill all pixels of the object outside the viewport bounds: that only happens in those rare edge cases like that wireframe edge.

Share this post


Link to post
Share on other sites


(Note that this is for our 2D game, in which fillrate is currently the main performance bottleneck. The fragment shaders are extremely simple, so basically most performance goes into large amounts of overdraw of partially transparent objects.)

Have you tried screen space tiling such that all Frame Buffer reads and writes result in a FB/DB cache hit?

Share this post


Link to post
Share on other sites

What is "screen space tiling"? If I Google for this I get hits for tile based renderers but as far as I know that is a type of graphics hardware, so I suppose you mean something else?

Share this post


Link to post
Share on other sites

Sorry I might be using my own terminology, not sure.  The basic idea is for 2d is:

 

1. figure out how big a tile can fit in the GPU ROP caches (i think this is the proper term)

2. divide the screen into above sized or less tiles.  Create a bin per tile.

3. for all your alpha-blended let say quads find out which tiles are affected and add to bins.

4. set scissor rect for tile.  Draw all geometry for tile from bin.

5. repeat for all tiles.

Share this post


Link to post
Share on other sites


That explodes the number of rendercalls, unless the GPU ROP caches are quite big. Is this actually a good idea when rendering 500 objects per frame?

Yes it increases the number of draw calls but depending on what type of 2d game you are making this might not be an issue.  Then of course there is always instancing with texture atlas's or indirect draw calls with texture atlas's.  ROP caches vary depending on hardware, but you can disable the technique depending on detected hardware, a benchmark or forced with a user option.  I hear DX11 can hit 10,000 calls a frame normally... but like I said there always instancing or indirect draw calls.  How's your performance looking now? on what hardware?

Share this post


Link to post
Share on other sites

Performance is not so bad that I need to try drastic solutions like that one right now, but it is an interesting thought. What kind of order of magnitude are we talking about for how many tiles one would need for a 1920x1080 screen? Order of 10x6 tiles? You mentioned it varies per hardware, but right now I don't even know whether we are talking about 2x2 tiles or 100x100 tiles.

 

(The game is Awesomenauts by the way, and it is in DirectX9 and OpenGL.)

Share this post


Link to post
Share on other sites

I only have one data point for you... on AMD GCN 1 4k depth 16k frame buffer per ROP partition (not ROP).  I think its 4 ROP's per ROP partition, and I think there product line up is 8,16,32, and 64 ROP's.  So a 32 ROP card has 8 ROP partitions... 32k depth and 128k FB.  Math wise thats 64x128 tile for Z, 128x128 tile for frame buffer.  Normally I would go with the smaller number but someone who actually experimented said they were good with 128x128 tiles and attributed it to Zbuffer compression.  Anyway using the 64x128 tile size a 1080p resolution will be about 254 tiles.  BTW check my math, I did it all in my head and its been a while.

 

BTW I got Awesomenauts in a HumbleBundle but I gave it away since I don't play MOBA's - Congrats

 

edit - I might be wrong on the product line up, I didn't look it up... did it from scattered memories.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

That's a lot of tiles... Of course not every object needs to render on every tile, but this would still explode the number of rendercalls. I can see how this can work, but I think I'd need to rebuild half the engine to use this efficiently (including switching to DirectX11).

Share this post


Link to post
Share on other sites

Of course not every object needs to render on every tile, but this would still explode the number of rendercalls.

Thats why I waited for DX12 before I even considered prototyping a test, I'm in middle of learning DX12 now.  But like I said before instancing with a texture atlas and indirect draws with a texture atlas can reduce draw calls.

 

 

 


I can see how this can work, but I think I'd need to rebuild half the engine to use this efficiently (including switching to DirectX11).

Well think of it as something to consider prototyping for the future... DX9 is getting long in the tooth anyway.

 

edit - BTW you could use compute shaders as well.  For example GPUparticles here http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/#downloadsamples uses shared memory for compute shaders to the same effect.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this