Jump to content

  • Log In with Google      Sign In   
  • Create Account

Interested in a FREE copy of HTML5 game maker Construct 2?

We'll be giving away three Personal Edition licences in next Tuesday's GDNet Direct email newsletter!

Sign up from the right-hand sidebar on our homepage and read Tuesday's newsletter for details!

We're also offering banner ads on our site from just $5! 1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


Member Since 29 Mar 2007
Offline Last Active Nov 22 2014 05:35 PM

#4987026 ShaderResourceView w/ D3D11_TEX2D_SRV

Posted by MJP on 05 October 2012 - 12:24 AM

Texture arrays are intended for cases where the shader needs to select a single texture from an array at runtime, using an index. Usually this is for the purpose of batching. For instance, if you had 5 textured meshes and you wanted to draw them all in one draw call, you could use instancing and then select the right texture from an array using the index of the instance.

In your case for a tetris game, I don't think it would be necessary. You probably won't ever need to batch with instancing, in which case texture arrays won't give you any performance advantage. You should be fine with just creating a bunch of textures, and then switching textures between each draw call.

#4987016 Beginner Question: Why do we use ZeroMemory macro for Swap Chain object ?

Posted by MJP on 04 October 2012 - 11:25 PM

It's just a way of initializing the structure data, since the struct doesn't have a constructor. This has always been considered the idomatic way to initialize Win32 structures as long as I can remember. You don't have to do it if you don't want to, you just need to make sure that you set all of the members of the struct.

#4985758 Can you use tessellation for gpu culling?

Posted by MJP on 01 October 2012 - 08:22 AM

Geometry shaders in general are typically not very fast, and stream out can make it worse because all of the memory traffic. IMO it's a dead end if you're interested in scene traversal/culling on the GPU. Instead I would recommend trying a compute shader that performs the culling, and then fills out a buffer with "DrawInstancedIndirect" or "DrawIndexedInstancedIndirect" arguments based on the culling results. I'd suspect that could actually be really efficient if you're already using a alot of instancing.

In general you don't want to just draw broad conclusions like "the CPU is better than the GPU for frustum culling" because it's actually a complex problem with a lot of variables. Whether or not its worth it to try doing culling on the GPU will depend on things like:
  • Complexity of the scene in terms of number of meshes and materials
  • What kind of CPU/GPU you have
  • How much frame time is available on the CPU vs. GPU
  • What feature level you're targetting
  • How much instancing you use
  • Whether or not you use any spatial data structures that could possible accelerate culling
  • How efficiently you implement the actual culling on the CPU or GPU
One thing that can really tip the scales here is that currently even with DrawInstancedIndirect there's no way to avoid the CPU overhead of draw calls and binding states/textures if you perform culling on the GPU. This is why I mentioned that it would probably be more efficient if you use a lot of instancing, since your CPU overhead will be minimal. Another factor that can play into this heavily is if you wanted to perform some calculations on the GPU that determine the parameters of a view or projection used for rendering, for instanced something like Sample Distribution Shadow Maps. In that case performing culling on the GPU would avoid having to read back results from the GPU onto the CPU.

#4985385 Updating engine from dx9 to dx11

Posted by MJP on 30 September 2012 - 10:09 AM

The initial port probably won't be too hard for you. It's not too hard to spend a week or two and get a DX9 renderer working on DX11. What's harder is actually making it run fast (or faster), and integrating the new functionality that DX11 offers you. Constant buffers are usually the biggest performance problem for a quick and dirty port, since using them to emulate DX9 constant registers can actually be slower than doing the same thing in DX9. Past that you may need a lot of re-writing for things like handling structured buffers instead of just textures everywhere, having texturs/buffers bound to all shader stages, changing shaders to use integer ops or more robust branching/indexing, and so on.

#4984654 What is the current trend for skinning?

Posted by MJP on 28 September 2012 - 01:59 AM

With DX11.1 you can output to a buffer from a vertex shader, which is even better than using a compute shader or stream out since you can just output the skinned verts while rasterizing your first pass.

#4983758 Problem sending vertex data to shader

Posted by MJP on 25 September 2012 - 03:53 PM

DXGI_FORMAT_R8G8B8A8_UINT is not equal to uint4.

That's not true. The UINT suffix specifies that each component should be interpreted as an 8-bit unsigned integer, and there are 4 values so uint4 is the appropriate type to use in this case.

#4983755 Performance of geometry shaders for sprites instead of batching.

Posted by MJP on 25 September 2012 - 03:48 PM

I must be missing something...why would you want to stream out your sprite vertices? Stream-out is generally only useful in the case where you want to do some heavy per-vertex work in the vertex shader, then "save" the results so that you can re-use them later. The more common case for a geometry shader is to expand a single point into quad, so that you can send less data to the GPU. It's mostly used for particles, and probably not as useful for a more flexible sprite renderer that might need to handle more complex transformations that are directly specified on the CPU side of things. Either way you need to be careful with the GS. It's implemented sub-optimally on a lot of hardware, particularly first-generation DX10 GPU's. Using it can easily degrade GPU performance to the point of where it's not worth it. Using a compute shader is often a preferred alternative over GS stream out.

Another possible option is to use instancing, where you'd set up vertices + indices for a single quad and then pass all of the "extra" data in the instance buffer (or in a StructuredBuffer or constant buffer). This can allow you to possible pass less data to the GPU and/or do more work on the GPU, while still batching.

#4983043 Powerful GPU - Send more do less, send less do more?

Posted by MJP on 23 September 2012 - 04:52 PM

There's no simple answer to this question because in reality the situation is very complicated, and thus depends on the specifics of the hardware and what you're doing.

One way to look at CPU/GPU interaction is the same way you'd look at two CPU cores working concurrently. In the case of two CPU's you achieve peak performance when both processors are working concurrently without any communication or synchronization required between the two. For the most apart this applies to CPU/GPU as well, since they're also parallel processors. So in general, reducing the amount of communication/synchronization between the two is a good thing. However in reality a GPU is incapable of operating completely independently from the GPU, which is unfortunate. The GPU always requires the CPU to, at minimum, submit a buffer (or buffers) containing a stream of commands for the GPU to execute. These commands include draw calls, state changes, and other things you'd normally perform using a graphics API.

The good news is that the hardware and driver are somewhat optimized for the case of CPU to GPU data flow, and thus can handle it in most cases with requiring stalling/locking for synchronization. The hardware enables this by being able to access CPU memory across the PCI-e bus, and/or by allowing the CPU write access to a small section of dedicated memory on the video card itself. However in general read or write speeds for either the CPU or GPU will be diminished when reading or writing to these areas, since data will have to be transferred across the PCI-e bus. For the command buffer itself the hardware will typically use some sort of FIFO setup where the driver can be writing commands to one area of memory, while the GPU trails behind executing commands from a different are of memory. This allows the GPU and CPU to work independently of each other, as long as the CPU is running fast enough to be working ahead of the GPU.

As for the drivers, they will also use a technique known as buffer renaming to to enable the CPU to send data to the GPU without explicit synchronization. It's primarily used when you have some sort "dynamic" buffer where the CPU has write access and the GPU has read access, for instance when you create a buffer with D3D11_USAGE_DYNAMIC in D3D11. What happens with these buffer is that the driver doesn't explicitly allocate them memory when you create them, it defers the allocation until the point when you lock/map it. At this point it allocates some memory that GPU isn't currently using, and allows the CPU to write its data there. Then the GPU later reads the data when it executes a command that uses the buffer, which is typically some time later on (perhaps even as much as a frame or two). Then if the CPU locks/maps the buffer again the driver will allocate a different area of memory than the last time it was locked/mapped, so the CPU is again writing to an area of memory that's not currently in use by the GPU. This is why such buffers require the DISCARD flag in D3D: the buffer is using a new piece of memory, therefore it won't have the same data that you previously filled it with. By using such buffers you can typically avoid stalls, but you still may pay some penalties in terms of access speeds or in the form of driver overhead. It's also possible that the driver may run out of memory to allocate, in which case it will be forced to stall. Another technique employed by drivers as an alternative to buffer renaming is to store data in the command buffer itself. This is how the old "DrawPrimitiveUP" stuff was implemented in D3D9. This can be slower than dynamic buffers depending on how the command buffers are set up. In some cases the driver will let you update a non-renamed buffer without an explicit sync as long as you "promise" not to write any data that the GPU is currently using. This is exposed by the WRITE_NO_OVERWRITE pattern in D3D.

For going the other way and having the GPU provide data to the CPU, you don't have the benefit of these optimizations. In all such cases (reading back render targets, getting query data, etc.) the CPU will be forced to sync with the GPU and flush all pending commmands, and then wait for them to execute. The only way to avoid the stall is to wait long enough for the GPU to finish before requesting access to the data.

So getting back to your original question, whether or not its better to pre-calculate on the CPU depends on a few things. For instance, how much data does the CPU need to send? Will doing so require the GPU to access an area of memory that's slower than its primary memory pool? How much time will the CPU spend computing the data, and writing the data to an area that's GPU-accessible? Is it faster for the GPU to compute the result on the fly, or to access the result from memory? Like I said before, these things can all vary depending on the exact architecture and your algorithms.

#4982578 Occlusion queries for lens flare

Posted by MJP on 21 September 2012 - 11:22 PM

In my last game, we used an alternative to occlusion queries for this problem, but it requires that you're able to bind your depth-buffer as a readable texture -- I don't know if XNA allows that, but it's possible in D3D9, so maybe.

It doesn't. You'd have to manually render depth to a render target.

#4982536 Dxt1 textures no mip levels

Posted by MJP on 21 September 2012 - 06:31 PM

You're trying to make it a dynamic texture, and dynamic textures don't support mip maps. Do you actually need it to be dynamic?

FYI if you create the device with the DEBUG flag it will output messages about errors like this. It will output them to the native debugging stream, so you need to have native debugging enabled or use a program like DebugView to see the messages in a managed app.

#4982516 can't initialize directx 11

Posted by MJP on 21 September 2012 - 03:51 PM

You need to specify FEATURE_LEVEL_10_1 instead of 11, because your GPU doesn't support that feature level.

#4981940 Heat Distortion Effect

Posted by MJP on 20 September 2012 - 12:35 AM

If you render out the distortion amount first, it lets you use a cheaper shader that doesn't sample a render target. Then you just have one pass where you sample a render target. This might be a big deal if there was a lot of overdraw in their particles. Plus it would have allowed them to accumulate the distortion at a lower resolution if they'd wanted to.

#4981938 Adding lights to a scene

Posted by MJP on 20 September 2012 - 12:32 AM

A tone mapping curve like x / (1 + x) is not enough on its own to turn HDR values into something suitable for a display. You also need to calibrate the image, so that the "important" range of intensities end up visible and not crushed or clipped. Typically this is done with some sort of simple exposure simulation, which is usually just a scale value that you multiply your pixel value by before applying tone mapping. A lower exposure value lets you see details in higher intensities, while high exposure values let you see details in lower intensities. It's directly analogous to manipulating settings on a camera that would adjust exposure, like shutter speed and f-stop. It's also analogous to how your eye adapts to different lighting conditions by dilating or constricting the iris. As you've already experienced, you can't just use one exposure value for all scenes. At the very least exposure needs to be adjustable, so that you can change it depending on how bright or dark the scene is. Even better yet is to have some sort of "auto exposure" routine, where you examine the current distribution of intensities that were rendered and automatically compute an exposure value. Such a routine was proposed by Reinhard in his original paper, and you'll see it in a lot of HDR tutorials and samples.

#4981879 AMD APU 6520g fails on texture map from size

Posted by MJP on 19 September 2012 - 06:49 PM

I'm not super-familiar with SlimDX so I wouldn't know for sure, but perhaps you need to account for the pitch of the mapped texture when writing the new data? This is what you normally have to do in native DX. The pitch will change for different hardware/drivers, so it would account for the strange behavior you're experiencing. There shouldn't be any limitations regarding power-of-2 textures or dimensions being a multiple of 32, so I'm thinking that was the same problem.

#4981791 Why are most games not using hardware tessellation?

Posted by MJP on 19 September 2012 - 01:33 PM

It's complicated, it's performance-heavy, and depending on what you do with it it can have a major impact on the content pipeline. And of course after all of that, only a fraction of your userbase will have hardware that supports it (especially if you factor in consoles). In light of that it shouldn't be that surprising that games aren't bursting with tessellation, and the ones that do use it do it for a subset of assets and/or with techniques that require minimal impact on content authoring (PN triangles and detail displacement mapping for the most part).