# Directx 11 porting & other changes

It's been a long time since my last post and therefore I will dedicate my first paragraph to coming up with lame excuses for it.

My only excuse actually, is that I had very little spare time since I moved to the US with my wife and I tried to use it for the development of my engine.

I focused mainly on porting everything from XNA to C++ / DirectX 11 and on optimizations. There are not so many new features in the new version except for the chromatic effect of shallow water due to light scattering inside the water body and the fact that the new version is much faster due to a lot of optimizations that I've made. The previous XNA version was doing about 21-23 fps in the most GPU intensive scenes whereas the new DX11 version is doing 50-55 fps easy in the same scenes and it is mostly CPU bound (the GPU is about 40% idle) which means there is a lot more processing space on the GPU for other things in the future. BTW, my dev machine is a laptop with a GeForce GTX 460M GPU so it's not exactly a top of the line GPU.

In the remainder of this post I will describe some of the major changes that I've made and some of the challenges that I encountered during the porting process.

In the new version of my engine I moved a lot of calculations that were previously done using full-screen quads to compute shaders. One of these functionalities is the procedural terrain generation. I also thought I could take advantage of the integer math operations that are new in DX11 to compute the pseudo-random numbers directly in the compute shader instead of sampling a texture of precalculated values. I did that but I didn't notice any big improvement. I did not run a thorough test on this yet in order to give a final verdict but I suspect that calculating a random number is not much faster than sampling a texture with filtering disabled.

While porting the ocean code I also noticed that the FFT transforms could be done in a compute shader which should be a lot faster than the pixel shader approach that Brunetton used in his code and which I also used in the XNA version of the engine. By googling around, I stumbled upon the NVIDIA code provided in their FFT ocean demo from the NVIDIA SDK 11 which is a 2D radix-8 FFT algorithm. That means, it can only transform 2D maps that have both width and height as powers of 8, for example: 64x64, 512x512, 8096x8096 etc. The problem was that I was using a 256x256 wave spectrum which could not be transformed with the NVIDIA code. So, I had the option to either move to a 512x512 spectrum or use a radix-4 or radix-2 FFT transform. I searched the web for a compute shader implementation of a radix-2 or radix-4 transform but couldn't find anything. In conclusion, if I wanted to stick to the 256x256 spectrum, I had to write my own FFT code and I was in no mood of doing that. I tried that once and it gave me many days of headaches in which I managed to write a 1D radix-2 FFT but it was not easy. The complexity of FFT transform algorithms grows exponentially when you go from one dimension to 2 dimensions so I decided to move to a 512x512 map and use the NVIDIA code. I figured that if it would prove to be to slow, I would move to a 256x256 map later.

There was actually also another option. I noticed there is a new interface in the DX11 SDK called ID3DX11FFT. However, it seems that it can only transform one spectrum at a time and I have 6 of them. This means I would need to issue 6 transform commands whereas the NVIDIA FFT code can be modified easily to transform all 6 of them in one step. The NVIDIA FFT has also the advantage of using a radix-8 algorithm which means it only needs to issue 6 512x512 Dispatch calls for a 512x512 spectrum whereas a radix-2 FFT (like the one Brunetton used and which I suspect, the ID3DX11FFT interface also uses) would require 8 Dispatch calls of the same size for a 512x512 spectrum. I could also be wrong and the DX11 interface could be smarter than that and use a different radix algorithm for different spectrum sizes but I couldn't find anything on the web that describes how it works internally. It also appears that no one ever used it and that's just weird.

Bottom line is, my new version uses a 512x512 spectrum transformed with a radix-8 FFT compute shader instead of a 256x256 spectrum transformed with a radix-2 pixel shader code and the new one is a lot faster. For the future, it would be interesting to experiment a bit with the DX11 FFT interface to see if it computes a 256x256 FFT transform faster than the NVIDIA code computes a 512x512 transform. I don' really need a 512x512 map, the gain in visual quality is negligible so I would prefer a 256x256 transform even if it's only 10% faster. I would also like to write my own radix-4 FFT code one day just for the sake of it and to prove to myself that I can do it . On the other hand I fear I might waste too much valuable time doing it.

There is not much to say about this except for the fact that I'm using it now. If you don't know what deferred rendering is, read this article to get the basic idea. I Initially implemented it because I wanted to leverage the advantage of not having to run all the expensive atmospheric scattering and water shading computations for pixels that eventually get occluded anyway. Later, I came to realize that this problem is already mostly being taken care of by the early-z rejection and the front-to-back sorting of the objects before rendering. However, I am giving deferred rendering another chance because it might prove itself useful later when I will need to render scenes with multiple small lights like indoor scenes. For the moment, I only have outdoor scenes where I have only one big light source.

For the ones who do not know what occlusion culling is, it's exactly what the name says: culling (not rendering) objects that are occluded by other objects in the scene.

I always wanted to give this a try and I finally did. It took me a lot of work but I am really pleased with the results. In some scenes, the frame-rate almost doubled. Basically I use hardware occlusion queries on OBBs which are calculated for each terrain node. I ran into some interesting problems during the implementation of this feature which I will describe in more detail in my next post (which will be soon, I promise ).

In the meantime, here is a video of my latest version:

My only excuse actually, is that I had very little spare time since I moved to the US with my wife and I tried to use it for the development of my engine.

I focused mainly on porting everything from XNA to C++ / DirectX 11 and on optimizations. There are not so many new features in the new version except for the chromatic effect of shallow water due to light scattering inside the water body and the fact that the new version is much faster due to a lot of optimizations that I've made. The previous XNA version was doing about 21-23 fps in the most GPU intensive scenes whereas the new DX11 version is doing 50-55 fps easy in the same scenes and it is mostly CPU bound (the GPU is about 40% idle) which means there is a lot more processing space on the GPU for other things in the future. BTW, my dev machine is a laptop with a GeForce GTX 460M GPU so it's not exactly a top of the line GPU.

In the remainder of this post I will describe some of the major changes that I've made and some of the challenges that I encountered during the porting process.

# Terrain generation

In the new version of my engine I moved a lot of calculations that were previously done using full-screen quads to compute shaders. One of these functionalities is the procedural terrain generation. I also thought I could take advantage of the integer math operations that are new in DX11 to compute the pseudo-random numbers directly in the compute shader instead of sampling a texture of precalculated values. I did that but I didn't notice any big improvement. I did not run a thorough test on this yet in order to give a final verdict but I suspect that calculating a random number is not much faster than sampling a texture with filtering disabled.

# The Ocean

While porting the ocean code I also noticed that the FFT transforms could be done in a compute shader which should be a lot faster than the pixel shader approach that Brunetton used in his code and which I also used in the XNA version of the engine. By googling around, I stumbled upon the NVIDIA code provided in their FFT ocean demo from the NVIDIA SDK 11 which is a 2D radix-8 FFT algorithm. That means, it can only transform 2D maps that have both width and height as powers of 8, for example: 64x64, 512x512, 8096x8096 etc. The problem was that I was using a 256x256 wave spectrum which could not be transformed with the NVIDIA code. So, I had the option to either move to a 512x512 spectrum or use a radix-4 or radix-2 FFT transform. I searched the web for a compute shader implementation of a radix-2 or radix-4 transform but couldn't find anything. In conclusion, if I wanted to stick to the 256x256 spectrum, I had to write my own FFT code and I was in no mood of doing that. I tried that once and it gave me many days of headaches in which I managed to write a 1D radix-2 FFT but it was not easy. The complexity of FFT transform algorithms grows exponentially when you go from one dimension to 2 dimensions so I decided to move to a 512x512 map and use the NVIDIA code. I figured that if it would prove to be to slow, I would move to a 256x256 map later.

There was actually also another option. I noticed there is a new interface in the DX11 SDK called ID3DX11FFT. However, it seems that it can only transform one spectrum at a time and I have 6 of them. This means I would need to issue 6 transform commands whereas the NVIDIA FFT code can be modified easily to transform all 6 of them in one step. The NVIDIA FFT has also the advantage of using a radix-8 algorithm which means it only needs to issue 6 512x512 Dispatch calls for a 512x512 spectrum whereas a radix-2 FFT (like the one Brunetton used and which I suspect, the ID3DX11FFT interface also uses) would require 8 Dispatch calls of the same size for a 512x512 spectrum. I could also be wrong and the DX11 interface could be smarter than that and use a different radix algorithm for different spectrum sizes but I couldn't find anything on the web that describes how it works internally. It also appears that no one ever used it and that's just weird.

Bottom line is, my new version uses a 512x512 spectrum transformed with a radix-8 FFT compute shader instead of a 256x256 spectrum transformed with a radix-2 pixel shader code and the new one is a lot faster. For the future, it would be interesting to experiment a bit with the DX11 FFT interface to see if it computes a 256x256 FFT transform faster than the NVIDIA code computes a 512x512 transform. I don' really need a 512x512 map, the gain in visual quality is negligible so I would prefer a 256x256 transform even if it's only 10% faster. I would also like to write my own radix-4 FFT code one day just for the sake of it and to prove to myself that I can do it . On the other hand I fear I might waste too much valuable time doing it.

# Deferred rendering (shading)

There is not much to say about this except for the fact that I'm using it now. If you don't know what deferred rendering is, read this article to get the basic idea. I Initially implemented it because I wanted to leverage the advantage of not having to run all the expensive atmospheric scattering and water shading computations for pixels that eventually get occluded anyway. Later, I came to realize that this problem is already mostly being taken care of by the early-z rejection and the front-to-back sorting of the objects before rendering. However, I am giving deferred rendering another chance because it might prove itself useful later when I will need to render scenes with multiple small lights like indoor scenes. For the moment, I only have outdoor scenes where I have only one big light source.

# Occlusion culling

For the ones who do not know what occlusion culling is, it's exactly what the name says: culling (not rendering) objects that are occluded by other objects in the scene.

I always wanted to give this a try and I finally did. It took me a lot of work but I am really pleased with the results. In some scenes, the frame-rate almost doubled. Basically I use hardware occlusion queries on OBBs which are calculated for each terrain node. I ran into some interesting problems during the implementation of this feature which I will describe in more detail in my next post (which will be soon, I promise ).

In the meantime, here is a video of my latest version:

5

Sign in to follow this

Followers
0

## 16 Comments

## Recommended Comments

## Create an account or sign in to comment

You need to be a member in order to leave a comment

## Create an account

Sign up for a new account in our community. It's easy!

Register a new account## Sign in

Already have an account? Sign in here.

Sign In Now