Jump to content

  • Log In with Google      Sign In   
  • Create Account


How does this SSAA look?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
19 replies to this topic

#1 DwarvesH   Members   -  Reputation: 459

Like
0Likes
Like

Posted 10 March 2014 - 04:05 AM

I've been developing a brand new set of shaders that can accept both traditional and physically based parameters. I'm testing a lot without diffuse textures to see how the lighting looks, so only rendering with a normal map and a gray texture for diffuse map. And a AO map.

 

But I've been noticing some pretty bad shader aliasing:

 

http://dl.dropboxusercontent.com/u/45638513/rs05.png

 

So I decided to implement as an experiment full lighting shader surface super-sampling antialiasing solution. Obviously, since this is surface supersampling, post-processing AA like FXAA/SMAA is not needed and neither is Toksvig AA or any other specular AA solution. But geometry borders are not affected by surface supersampling, so for best results traditional MSAA should be used.

 

Here is the same shot with ridiculously high SSAA:

 

http://dl.dropboxusercontent.com/u/45638513/rs06.png

 

For real games something like this might work on dual Titans or whatnot, but average consumer level GPU will tank.

 

Here is a more normal example: Before:

 

http://dl.dropboxusercontent.com/u/45638513/rs07.png

 

After, with SSAA 4x:

 

http://dl.dropboxusercontent.com/u/45638513/rs08.png

 

For these screenshots I did not SSAA the AO map. I did implement this later but found that there is almost zero difference, so I wouldn't recommend it.

 

As said before, SSAA does not need specularity AA. Before with diffuse:

 

http://dl.dropboxusercontent.com/u/45638513/rs09.png

 

After:

 

http://dl.dropboxusercontent.com/u/45638513/rs10.png

 

Closeup before: http://dl.dropboxusercontent.com/u/45638513/rs11.png

Closeup after: http://dl.dropboxusercontent.com/u/45638513/rs12.png

Closeup after with SMAA: http://dl.dropboxusercontent.com/u/45638513/rs13.png

 

I'll do some heavy duty tests with a batch count of about 2500 and no instancing comparing no AA with only SSAA 4x to see if this is doable in real time.

 



Sponsor:

#2 cozzie   Members   -  Reputation: 1585

Like
0Likes
Like

Posted 10 March 2014 - 03:37 PM

Not sure if I understand the question, but I think 4xSSAA looks fine/ for sure acceptable



#3 DwarvesH   Members   -  Reputation: 459

Like
0Likes
Like

Posted 11 March 2014 - 09:48 AM

Not sure if I understand the question, but I think 4xSSAA looks fine/ for sure acceptable

 

Well, the questions:

1. How does it look obviously. With supersampling there is the danger of high frequency detail to be lost or made to soft.

2.Why is the aliasing so bad with this texture. Is the texture too high contrast or am I doing something wrong. Could someone test the Slate Tile found in https://dl.dropboxusercontent.com/u/69879086/Blog/PBR_blog.zip and see if those textures are prone to aliasing? Especially the normal map.

3. Can this be done in real time on good but not great hardware? This is a question for me to answer. This is where the 2500 batch count scene comes in, giving some rough performance numbers. I need to find out the cost of 3x-11x SSAA and see how many objects can be rendered with it? I also need to figure out the best way to tune it. SSAA can be done on each component. Doing it on env mapped objects will probably not be possible.



#4 cozzie   Members   -  Reputation: 1585

Like
0Likes
Like

Posted 11 March 2014 - 02:56 PM

If you want to benchmark, you can compile a 'special' version of your test (running automatically) and writing some usefull data to a log file (i.e. rendered frames, frame times, nr. of triangles). I can then run it to give input.

 

That way you know performance on my machine with 'medium' specs, I5 2320, GTX660 2GB (EVGA).

Maybe someone else can do the same on his I7 with GTX780 or something smile.png

 

That way you could compare the logs.


Edited by cozzie, 11 March 2014 - 02:56 PM.


#5 KoldGames   Members   -  Reputation: 222

Like
0Likes
Like

Posted 11 March 2014 - 04:23 PM

If you want to benchmark, you can compile a 'special' version of your test (running automatically) and writing some usefull data to a log file (i.e. rendered frames, frame times, nr. of triangles). I can then run it to give input.
 
That way you know performance on my machine with 'medium' specs, I5 2320, GTX660 2GB (EVGA).
Maybe someone else can do the same on his I7 with GTX780 or something :)
 
That way you could compare the logs.


I don't know if I'd be any help but I'm also willing to run it on my PC and post the results here. :)

i7 2600k @ 3.4 GHZ
XFX HD 7850 2GB DD OC
16GB of 1600MHZ Ram

#6 fir   Members   -  Reputation: -452

Like
0Likes
Like

Posted 11 March 2014 - 05:19 PM

 

If you want to benchmark, you can compile a 'special' version of your test (running automatically) and writing some usefull data to a log file (i.e. rendered frames, frame times, nr. of triangles). I can then run it to give input.
 
That way you know performance on my machine with 'medium' specs, I5 2320, GTX660 2GB (EVGA).
Maybe someone else can do the same on his I7 with GTX780 or something smile.png
 
That way you could compare the logs.


I don't know if I'd be any help but I'm also willing to run it on my PC and post the results here. smile.png

i7 2600k @ 3.4 GHZ
XFX HD 7850 2GB DD OC
16GB of 1600MHZ Ram

 

If you can help in testing you could be run my ram set small benchmark

 

https://www.dropbox.com/s/d0epr8d1drsa4bs/ramset.zip

 

and say how your result is... (no maloware just set of 1mb ram to zero)



#7 DwarvesH   Members   -  Reputation: 459

Like
0Likes
Like

Posted 12 March 2014 - 04:58 AM

OK, thanks!

 

I'll prepare a bench-marker.

 

Just give a few days to finish the shaders and make sure the gold material does not go into a hissy fit with extra diffuse lights on top of it:

 

As for the "ramset", it stats off with an average of about 120000 but eventually it goes up to 190000. It must be caching.



#8 DwarvesH   Members   -  Reputation: 459

Like
0Likes
Like

Posted 12 March 2014 - 07:26 AM

Holly shit, I need to optimize the hell out of these shaders. I did manage to squeeze in 4 lights in one pass, but with 4x SSAA, fxc reports this about rendering dielectrics:


// approximately 481 instruction slots used (12 texture, 469 arithmetic)

And for metallic objects:


// approximately 628 instruction slots used (16 texture, 612 arithmetic)

BTW, there is no need for two different shaders, one can handle metallic and dielectric, but because of the performance difference I decided to split them up.

 

Just as an experiment I did squeeze in the "traditional" 8 lights in one pass, and metallic objects with 4x SSAA use 1132 instructions.

 

This is going to be one hardcore bench-marking endeavor. I also need to develop a middle-ground between no SSAA and 4x SSAA.


					
					

#9 KoldGames   Members   -  Reputation: 222

Like
0Likes
Like

Posted 12 March 2014 - 03:20 PM


If you can help in testing you could be run my ram set small benchmark
 
https://www.dropbox.com/s/d0epr8d1drsa4bs/ramset.zip
 
and say how your result is... (no maloware just set of 1mb ram to zero)

 

As soon as I started it up, it was in the range of 190,000 - ~200,000.  And after a little bit, this showed up:

 

4izb.png

 


OK, thanks!
 
I'll prepare a bench-marker.

 

Cool! And no problemo dude.  Happy to help. biggrin.png


Edited by KoldGames, 12 March 2014 - 03:22 PM.


#10 cozzie   Members   -  Reputation: 1585

Like
0Likes
Like

Posted 13 March 2014 - 03:47 PM

OK, as soon as it's ready to go let us know



#11 KoldGames   Members   -  Reputation: 222

Like
0Likes
Like

Posted 24 March 2014 - 05:48 PM

Hey! How is bench-marker coming along?



#12 DwarvesH   Members   -  Reputation: 459

Like
0Likes
Like

Posted 26 March 2014 - 09:32 AM

Hey! How is bench-marker coming along?

 

Really really bad! smile.png

 

Not because of the SSAA, which seems to be pretty well behaved, but because of the other 10 billion tasks that take up my time.

 

I spent some time finalizing my permutations system with custom binary indexed shader blobs.

 

And this week, with all the base created, I started heavy duty porting form XNA to SharpDX of my game. I ported almost 200 KiB of code this week and now I need to test everything thoroughly before moving on to the rest.

 

And I couldn't even do that, because I needed a new texture manager GUI so I've spent today writing this:

http://dl.dropboxusercontent.com/u/45638513/mat05.png

 

With the bulk of that time being spent on updating the ListBox control to support multi-column modes and custom item rendering.


Edited by DwarvesH, 26 March 2014 - 09:33 AM.


#13 fir   Members   -  Reputation: -452

Like
0Likes
Like

Posted 26 March 2014 - 09:53 AM

As for the "ramset", it stats off with an average of about 120000 but eventually it goes up to 190000. It must be caching.

 

alrright, this is about to be quick, (I mean 0.12-0.19 milisecond)

On my old pentium 4 it was about 1.2 ms, On core2 duo it was about 0.26 (0.22 -0.3)



#14 fir   Members   -  Reputation: -452

Like
0Likes
Like

Posted 26 March 2014 - 09:58 AM

 


If you can help in testing you could be run my ram set small benchmark
 
https://www.dropbox.com/s/d0epr8d1drsa4bs/ramset.zip
 
and say how your result is... (no maloware just set of 1mb ram to zero)

 

As soon as I started it up, it was in the range of 190,000 - ~200,000.  And after a little bit, this showed up:

 

4izb.png

 

 

You must be incidentally pressed 'a' key, i was doing some 

timer under that (dont even remember what it measured), so

this number is not revelant, 0.19 -0.20 is number of ramset

(not much better than my old core2duo here (0.22 -0.3 mean about 0.26) this eould be my 4GB/s against your 5 GB/s and dwarwesh 5 GB - 8 GB/s



#15 DwarvesH   Members   -  Reputation: 459

Like
0Likes
Like

Posted 02 April 2014 - 06:07 AM

OK, I am almost at the point where I can start the final bench-marking and assessment if the engine has adequate performance or not.

 

But before I had to write and finish the most powerful forward renderer I was able to.

 

I created this test scene: 

http://dl.dropboxusercontent.com/u/45638513/l01.png

 

Since this is forward rendering, using multiple points lights is problematic if a lot of them affect a single object. So I'm rendering there 100 floor tiles. And I'm rendering 100 lights. The light are chosen to be problematic: i.e. they have such a size that they affect a surprisingly large radius around them.

 

For simplicity, point lights are blended on top of the ambient + directional result, so if an object is affected by at least 1 point light, it get's drawn again. So the 100 floor tiles will result in at least 200 draw calls.

 

All the lights are 100% dynamic and so are the objects. Very powerful optimization can be achieved for static scenes, but I don't care about those. So for dynamic scenes, a spatial portioning scheme gives me every frame what light affect what object and the engine takes care of batching and blending. The portioning scheme is pretty fast and should handle thousands of lights spread over realistic levels. Since there is overlap in the lighting, the 200 draw calls become 267. Most floor tiles are affected by at least 9 point lights, sometimes more.

 

The initial version of my scheme used 1268 draw calls to render 100 objects using this lighting setup. 267 has a lot better performance.

 

One large compromise that was needed to allow this was introducing clip radius to point lights. Beyond the radius no pixels are affected. This is not physically based, but is needed to optimize dynamic lights.

 

Now to test with some real life scenes, some rooms and corridors.

 

So my question is: can one achieve a better result without exponentially more effort put into it? These results look pretty good to me. Forward rendering will never have such a batch count as deferred (o + l), but at least I'm not in o * l territory. o + l gives 200.



#16 Hodgman   Moderators   -  Reputation: 29514

Like
0Likes
Like

Posted 02 April 2014 - 06:30 AM


These results look pretty good to me. Forward rendering will never have such a batch count as deferred (o + l)
Forward+ has one less (o) than tiled-deferred (o + 1), which has many less than deferred (o + l) wink.png

#17 DwarvesH   Members   -  Reputation: 459

Like
0Likes
Like

Posted 03 April 2014 - 06:20 AM

 


These results look pretty good to me. Forward rendering will never have such a batch count as deferred (o + l)
Forward+ has one less (o) than tiled-deferred (o + 1), which has many less than deferred (o + l) wink.png

 

 

Well I'm still on DirectX 9 so Forward+ is out of the question.

 

Anyway, that is far too complicated. And far too little documentation on the subject. I'll probably use something like that when it's as common and well documented as physically based rendering with optional material/BRDF layering.

 

And with the way I'm trying to render things, even deferred becomes far too complicated.

 

I'm just trying to create the best possible forward renderer under the circumstances than handles simple but fully dynamic and flexible scenes. I could push the render calls to "o" levels, as in rendering every single object once with ambient layers, directional and any number of point lights all using a single pixel shader, but that is much more work and I"m not sure it is worth it.



#18 Mona2000   Members   -  Reputation: 590

Like
0Likes
Like

Posted 03 April 2014 - 06:45 AM

 

Forward+ has one less (o) than tiled-deferred (o + 1), which has many less than deferred (o + l) wink.png

 

Isn't that notation misleading? Forward+ is actually o * 2 (depth pass + main pass).



#19 Hodgman   Moderators   -  Reputation: 29514

Like
0Likes
Like

Posted 03 April 2014 - 07:22 AM

Isn't that notation misleading? Forward+ is actually o * 2 (depth pass + main pass).

On Dx9 you can't use a compute shader to build the per-tile light lists (which would use a depth pass as input), so I'd build them on the CPU (like here) and then just render the scene as usual with forward rendering. If you're doing a full 11 version, then yep, I misspoke smile.png 
 
You could do a z-pass first to see if it helps reduce overdraw, but that's an optional optimization (you can do the same optimization for deferred if your g-buffer/attribute pass is expensive due to overdraw, e.g. if everything is parallax mapped).
 
Regular forward is the same though -- you build per object light lists and then draw every object once (or twice if you decide to do a z-pre-pass).

I could push the render calls to "o" levels, as in rendering every single object once with ambient layers, directional and any number of point lights all using a single pixel shader, but that is much more work and I"m not sure it is worth it.

What do you do at the moment - one pass per light per object? Or is there some amount of looping to do multiple lights per draw?


Edited by Hodgman, 03 April 2014 - 07:35 AM.


#20 DwarvesH   Members   -  Reputation: 459

Like
0Likes
Like

Posted 03 April 2014 - 08:51 AM

DwarvesH, on 03 Apr 2014 - 3:20 PM, said:
I could push the render calls to "o" levels, as in rendering every single object once with ambient layers, directional and any number of point lights all using a single pixel shader, but that is much more work and I"m not sure it is worth it.
What do you do at the moment - one pass per light per object? Or is there some amount of looping to do multiple lights per draw?

 

Currently I draw every object at least once. The first pass has ambient and directional lights, the things that are constant.

 

Point lights are drawn in another pass. When moving lights in the world a spatial portioning scheme is updated. Then each object can easily consult the spatial partitioning scheme to determine potential light sources. This potential light sources are the culled based on object bounding box light bounding sphere intersection.

 

One pass has an arbitrary maximum number of point lights, currently 10. If an object is lit by more than 10 point lights the engine will use a third pass. Any number of lights is supported this was, but I'm hoping that in practice most objects will be lit by at maximum a handful of lights. There are basically no point lights outside except in special places and on the insides each room is lit separately.

 

There is no looping, each light setup has a loop-less pixel shader. For each render pass, two pseudo DirectX 9 pixel shader constant buffers are set, one light position and light radius packed float4 and one light color and light clip radius packed float4.

 

So basically one pass + one pass for every 10 point lights. All passes set only once the vertex shader and pixel shader constant buffers, except for the second pass which sets two extra vectors.

 

So basically pretty complicated but gives exponentially better results than things I tried before.

 

Potential future directions:

  • add a few extra permutations to handle common things like ambient + 1 directional + up to 3-5 point lights in one pass. Each lighting setup change requires a pixel shader change.
  • add all permutations and render all lights in one pass, with a maximum global point light count. Each lighting setup change requires a pixel shader change.
  • replace the permutations with loop and some sort of dynamic branching/break. One single pixel shader per material type.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS