Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 29 Mar 2007
Offline Last Active Yesterday, 02:10 PM

#4988612 How many textures can be bound at once in DX11?

Posted by MJP on 09 October 2012 - 11:15 PM

The max shader resource slots is 128, and for feature level 11 the max texture array size is 2048.

#4988548 CryENGINE 3 Irradiance Volumes vs Frostbite 2 Tiled Lighting

Posted by MJP on 09 October 2012 - 06:40 PM

What I meant that what they are doing anything that's radically new at a fundamental level...radiosity has been around for a very long time and there has been a lot of research devoted to optimizing it. Most of what Enlighten offers is a framework for processing scenes and handling the data in a way that's optimized for their algorithms. I'm sure they've devoted a lot of time to optimizing the solving of the radiosity, but I don't that's really necessary for understanding what they're doing at a higher level.

What they're doing isn't magic...they're techniques only work on static geometry so a lot of the heavy lifting can be performed in a pre-process. They also require you to work with proxy geometry with a limited number of patches, which limits quality. They also limit the influence of patches to zones, and only update a certain number of zones at a time (you can see this if you've ever seen a video of their algorithm where the lighting changes quickly).

I don't to sound like I'm trivializing their tech or saying it's "bad" in any way (I'm actually a big fan of their work), my point was just that their techniques stem from an area of graphics that's been around for a long time and is well-documented.

#4988439 Are GPU drivers optimizing pow(x,2)?

Posted by MJP on 09 October 2012 - 01:10 PM

When I've cared enough to check the assembly in the past, the HLSL compiler has replaced pow(x, 2) with x * x. I just tried a simple test case and it also worked:

Texture2D MyTexture;
float PSMain(in float4 Position : SV_Position) : SV_Target0
    return pow(MyTexture[Position.xy].x, 2.0f);

dcl_globalFlags refactoringAllowed
dcl_resource_texture2d (float,float,float,float) t0
dcl_input_ps_siv linear noperspective v0.xy, position
dcl_output o0.x
dcl_temps 1
ftou r0.xy, v0.xyxx
mov r0.zw, l(0,0,0,0)
ld_indexable(texture2d)(float,float,float,float) r0.x, r0.xyzw, t0.xyzw
mul o0.x, r0.x, r0.x
// Approximately 5 instruction slots used

I wouldn't be surprised it he HLSL compiler got tripped up every once in a while, but there's also the JIT compiler in the driver too. So you'd have to check the actual microcode to know for sure, if you access to that.

#4987440 Questions about Intel Sample

Posted by MJP on 06 October 2012 - 10:59 AM

Also the main function of the Pixel Shader(GBufferPS) returns a struct:

struct GBuffer
	float4 normal_specular : SV_Target0;
	float4 albedo : SV_Target1;
	float2 positionZGrad : SV_Target2;

Instead of a float4.How does the gpu even work with this?I mean a pixel shader can only return a color,right?Not a whole struct.

This features is called "multiple render targets" (MRT), and like the name suggests it allows the GPU to output to up to 8 render targets simultaneously. Honestly it's a pretty basic GPU/D3D feature, and if you're not familiar with such things yet I would stick to simpler material before looking at the Intel sample (which is quite advanced!).

#4987181 Precompiled effect files and macros

Posted by MJP on 05 October 2012 - 11:13 AM

It's been a very long time, but I think you can assign a sampler state a value from a global int declared in your .fx file. Then at runtime you can set the int value with ID3DXEffect::SetInt. Otherwise you could always just call SetSamplerState yourself.

#4987026 ShaderResourceView w/ D3D11_TEX2D_SRV

Posted by MJP on 05 October 2012 - 12:24 AM

Texture arrays are intended for cases where the shader needs to select a single texture from an array at runtime, using an index. Usually this is for the purpose of batching. For instance, if you had 5 textured meshes and you wanted to draw them all in one draw call, you could use instancing and then select the right texture from an array using the index of the instance.

In your case for a tetris game, I don't think it would be necessary. You probably won't ever need to batch with instancing, in which case texture arrays won't give you any performance advantage. You should be fine with just creating a bunch of textures, and then switching textures between each draw call.

#4987016 Beginner Question: Why do we use ZeroMemory macro for Swap Chain object ?

Posted by MJP on 04 October 2012 - 11:25 PM

It's just a way of initializing the structure data, since the struct doesn't have a constructor. This has always been considered the idomatic way to initialize Win32 structures as long as I can remember. You don't have to do it if you don't want to, you just need to make sure that you set all of the members of the struct.

#4985758 Can you use tessellation for gpu culling?

Posted by MJP on 01 October 2012 - 08:22 AM

Geometry shaders in general are typically not very fast, and stream out can make it worse because all of the memory traffic. IMO it's a dead end if you're interested in scene traversal/culling on the GPU. Instead I would recommend trying a compute shader that performs the culling, and then fills out a buffer with "DrawInstancedIndirect" or "DrawIndexedInstancedIndirect" arguments based on the culling results. I'd suspect that could actually be really efficient if you're already using a alot of instancing.

In general you don't want to just draw broad conclusions like "the CPU is better than the GPU for frustum culling" because it's actually a complex problem with a lot of variables. Whether or not its worth it to try doing culling on the GPU will depend on things like:
  • Complexity of the scene in terms of number of meshes and materials
  • What kind of CPU/GPU you have
  • How much frame time is available on the CPU vs. GPU
  • What feature level you're targetting
  • How much instancing you use
  • Whether or not you use any spatial data structures that could possible accelerate culling
  • How efficiently you implement the actual culling on the CPU or GPU
One thing that can really tip the scales here is that currently even with DrawInstancedIndirect there's no way to avoid the CPU overhead of draw calls and binding states/textures if you perform culling on the GPU. This is why I mentioned that it would probably be more efficient if you use a lot of instancing, since your CPU overhead will be minimal. Another factor that can play into this heavily is if you wanted to perform some calculations on the GPU that determine the parameters of a view or projection used for rendering, for instanced something like Sample Distribution Shadow Maps. In that case performing culling on the GPU would avoid having to read back results from the GPU onto the CPU.

#4985385 Updating engine from dx9 to dx11

Posted by MJP on 30 September 2012 - 10:09 AM

The initial port probably won't be too hard for you. It's not too hard to spend a week or two and get a DX9 renderer working on DX11. What's harder is actually making it run fast (or faster), and integrating the new functionality that DX11 offers you. Constant buffers are usually the biggest performance problem for a quick and dirty port, since using them to emulate DX9 constant registers can actually be slower than doing the same thing in DX9. Past that you may need a lot of re-writing for things like handling structured buffers instead of just textures everywhere, having texturs/buffers bound to all shader stages, changing shaders to use integer ops or more robust branching/indexing, and so on.

#4984654 What is the current trend for skinning?

Posted by MJP on 28 September 2012 - 01:59 AM

With DX11.1 you can output to a buffer from a vertex shader, which is even better than using a compute shader or stream out since you can just output the skinned verts while rasterizing your first pass.

#4983758 Problem sending vertex data to shader

Posted by MJP on 25 September 2012 - 03:53 PM

DXGI_FORMAT_R8G8B8A8_UINT is not equal to uint4.

That's not true. The UINT suffix specifies that each component should be interpreted as an 8-bit unsigned integer, and there are 4 values so uint4 is the appropriate type to use in this case.

#4983755 Performance of geometry shaders for sprites instead of batching.

Posted by MJP on 25 September 2012 - 03:48 PM

I must be missing something...why would you want to stream out your sprite vertices? Stream-out is generally only useful in the case where you want to do some heavy per-vertex work in the vertex shader, then "save" the results so that you can re-use them later. The more common case for a geometry shader is to expand a single point into quad, so that you can send less data to the GPU. It's mostly used for particles, and probably not as useful for a more flexible sprite renderer that might need to handle more complex transformations that are directly specified on the CPU side of things. Either way you need to be careful with the GS. It's implemented sub-optimally on a lot of hardware, particularly first-generation DX10 GPU's. Using it can easily degrade GPU performance to the point of where it's not worth it. Using a compute shader is often a preferred alternative over GS stream out.

Another possible option is to use instancing, where you'd set up vertices + indices for a single quad and then pass all of the "extra" data in the instance buffer (or in a StructuredBuffer or constant buffer). This can allow you to possible pass less data to the GPU and/or do more work on the GPU, while still batching.

#4983043 Powerful GPU - Send more do less, send less do more?

Posted by MJP on 23 September 2012 - 04:52 PM

There's no simple answer to this question because in reality the situation is very complicated, and thus depends on the specifics of the hardware and what you're doing.

One way to look at CPU/GPU interaction is the same way you'd look at two CPU cores working concurrently. In the case of two CPU's you achieve peak performance when both processors are working concurrently without any communication or synchronization required between the two. For the most apart this applies to CPU/GPU as well, since they're also parallel processors. So in general, reducing the amount of communication/synchronization between the two is a good thing. However in reality a GPU is incapable of operating completely independently from the GPU, which is unfortunate. The GPU always requires the CPU to, at minimum, submit a buffer (or buffers) containing a stream of commands for the GPU to execute. These commands include draw calls, state changes, and other things you'd normally perform using a graphics API.

The good news is that the hardware and driver are somewhat optimized for the case of CPU to GPU data flow, and thus can handle it in most cases with requiring stalling/locking for synchronization. The hardware enables this by being able to access CPU memory across the PCI-e bus, and/or by allowing the CPU write access to a small section of dedicated memory on the video card itself. However in general read or write speeds for either the CPU or GPU will be diminished when reading or writing to these areas, since data will have to be transferred across the PCI-e bus. For the command buffer itself the hardware will typically use some sort of FIFO setup where the driver can be writing commands to one area of memory, while the GPU trails behind executing commands from a different are of memory. This allows the GPU and CPU to work independently of each other, as long as the CPU is running fast enough to be working ahead of the GPU.

As for the drivers, they will also use a technique known as buffer renaming to to enable the CPU to send data to the GPU without explicit synchronization. It's primarily used when you have some sort "dynamic" buffer where the CPU has write access and the GPU has read access, for instance when you create a buffer with D3D11_USAGE_DYNAMIC in D3D11. What happens with these buffer is that the driver doesn't explicitly allocate them memory when you create them, it defers the allocation until the point when you lock/map it. At this point it allocates some memory that GPU isn't currently using, and allows the CPU to write its data there. Then the GPU later reads the data when it executes a command that uses the buffer, which is typically some time later on (perhaps even as much as a frame or two). Then if the CPU locks/maps the buffer again the driver will allocate a different area of memory than the last time it was locked/mapped, so the CPU is again writing to an area of memory that's not currently in use by the GPU. This is why such buffers require the DISCARD flag in D3D: the buffer is using a new piece of memory, therefore it won't have the same data that you previously filled it with. By using such buffers you can typically avoid stalls, but you still may pay some penalties in terms of access speeds or in the form of driver overhead. It's also possible that the driver may run out of memory to allocate, in which case it will be forced to stall. Another technique employed by drivers as an alternative to buffer renaming is to store data in the command buffer itself. This is how the old "DrawPrimitiveUP" stuff was implemented in D3D9. This can be slower than dynamic buffers depending on how the command buffers are set up. In some cases the driver will let you update a non-renamed buffer without an explicit sync as long as you "promise" not to write any data that the GPU is currently using. This is exposed by the WRITE_NO_OVERWRITE pattern in D3D.

For going the other way and having the GPU provide data to the CPU, you don't have the benefit of these optimizations. In all such cases (reading back render targets, getting query data, etc.) the CPU will be forced to sync with the GPU and flush all pending commmands, and then wait for them to execute. The only way to avoid the stall is to wait long enough for the GPU to finish before requesting access to the data.

So getting back to your original question, whether or not its better to pre-calculate on the CPU depends on a few things. For instance, how much data does the CPU need to send? Will doing so require the GPU to access an area of memory that's slower than its primary memory pool? How much time will the CPU spend computing the data, and writing the data to an area that's GPU-accessible? Is it faster for the GPU to compute the result on the fly, or to access the result from memory? Like I said before, these things can all vary depending on the exact architecture and your algorithms.

#4982578 Occlusion queries for lens flare

Posted by MJP on 21 September 2012 - 11:22 PM

In my last game, we used an alternative to occlusion queries for this problem, but it requires that you're able to bind your depth-buffer as a readable texture -- I don't know if XNA allows that, but it's possible in D3D9, so maybe.

It doesn't. You'd have to manually render depth to a render target.

#4982536 Dxt1 textures no mip levels

Posted by MJP on 21 September 2012 - 06:31 PM

You're trying to make it a dynamic texture, and dynamic textures don't support mip maps. Do you actually need it to be dynamic?

FYI if you create the device with the DEBUG flag it will output messages about errors like this. It will output them to the native debugging stream, so you need to have native debugging enabled or use a program like DebugView to see the messages in a managed app.