Brian Klamik

Members
  • Content count

    14
  • Joined

  • Last visited

Community Reputation

266 Neutral

About Brian Klamik

  • Rank
    Member
  1. Hi MJP,   Minor clarification here:   Apps can fill out these footprint structures themselves, but they must follow the documented alignment restrictions. The samples use GetCopyableFootprints, because its conciseness avoids distracting from the sample's true intention. Apps should take care to avoid GetCopyableFootprints from becoming a bottleneck. Once the restrictions are understood, applications can simplify the logic to fill out these structures. As an example, planar complications could be avoided, and the structure data could even be baked to disk.   -Brian Klamik
  2. Oetker,   In D3D12, CopyBufferRegion should complain about an UPLOAD destination resource, just like CopyResource does. UPLOAD resources are always in the GENERIC_READ state and cannot be transitioned out. Perhaps the errors have already been muted elsewhere?   Are you sure you actually have to copy over the contents of the old buffer?   The CopySubresourceRegion call is actually discarded in the scenario you describe. Map in D3D11 requires WRITE_DISCARD to be passed when called on USAGE_DYNAMIC vertex buffers, meaning the current data in the buffer becomes undefined.   I hope you actually don't have to copy over the data at all, and your scenario is much simpler than you originally believed. But, if you actually must address the issue, there is another option besides reserved resources to consider. You can resort to a CUSTOM heap type. UPLOAD is just an abstraction that can be removed, if it gets in the way. Using a CUSTOM heap will allow you to transition the destination resource to COPY_DEST for CopyBufferRegion, then back to GENERIC_READ for further normal usage.   See https://msdn.microsoft.com/en-us/library/windows/desktop/dn770374(v=vs.85).aspx  for more info.   Given the choice of reserved vs. CUSTOM heap options, the reserved resource technique should result in better runtime efficiency. Overall, it avoids copying and a much larger spike in residency. The spike occurs when keeping both buffers alive for awhile. Don't forget you'll likely need an aliasing barrier when using reserved resources. There are less obvious downsides of reserved resources: good D3D tool support for reserved resources will likely take longer to come along than for CUSTOM heaps, and reserved resources are currently a less thoroughly tested code path in drivers.   -Brian Klamik  
  3. It's not clear what you think is a problem. That info message is telling you the object is being destroyed, so did you expect it to be destroyed sooner?   You can figure out what component is creating this object by using the ID3D11InfoQueue interface. Make the debugging infrastructure break on all ClassLinkage object creations by breaking on the following message D3D11_MESSAGE_ID_CREATE_CLASSLINKAGE and examining the call stack. See the following documentation to achieve that: http://msdn.microsoft.com/en-us/library/windows/desktop/ff476538(v=vs.85).aspx
  4. Trouble creating swap chain

    I suspect the issue is the flags you pass. Initialize D3d_Device_Flags to 0 first. The following line is very suspect, because D3d_Device_Flags is declared and read from within the same statement: UINT D3d_Device_Flags = D3d_Device_Flags | D3D11_CREATE_DEVICE_DEBUG; //device flags Then, make sure the computer has the SDK installed on it or remove the _DEBUG flag, as CreateDevice will fail with that flag if the SDK is not installed on the computer you run the program on.
  5. DX11 How do you multithread in Directx 11?

    Did you try MSDN? [url="http://msdn.microsoft.com/en-us/library/windows/desktop/ff476891(v=vs.85).aspx"]'Introduction to Multithreading in Direct3D11' [/url]and all the pages it references?
  6. "Cost" of various DirectX calls

    We published a rough list available in the Appendix of "Accurately Profiling Direct3D API calls". It's available at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9_c/directx/graphics/programmingguide/advancedtopics/ProfilingDirect3D.asp
  7. OpenGL DirectX slower than OGL?

    Your first picture suggests to me that it is CPU-limited. Small amount of vertices and small amount of pixels filled (but large number of discrete models). Your second picture suggests to me that it could be closer to GPU-limited. Few discrete models, large amount of pixels. Since the CPU-limited one is the one puzzling you, I'd suggest you pay strong attention to what tbe CPU is doing in the for( ; i < m_vMaterials.size(); ) loop. Do the rendering techniques really compare exactly to OpenGL? Do you also switch material and texture in OpenGL per "m_vMaterial"? You may be looking at an increase in CPU overhead due to DrawSubset. Can you use a more direct approach to working with the API, which is more efficient than DrawSubset?
  8. Multithreaded use of D3D

    The Meltdown presentation shows actual results using the method detailed in the SDK documentation. What is happening at the beginning and end (for small batches) is: Start: IF = (implicit flush) CPU: |---calling 1000 Draw-----|IF, calling 1000 Draw--------|IF, calling .... GPU: ----------Idle------------|--Drawing--|----Idle---------|--Drawing--|---Idle- End: PGD = (Poll GetData with explicit flush, returned S_FALSE) CPU: ----|PGD, PGD, ...-|GetData returned S_OK... GPU: Idle|--Drawing ----|---Idle----- "Keeping GPU work negligable" means cpu-limited rendering. Therefore, the length of time the GPU takes to execute is less than the length of time it takes the CPU to even make the calls. That's why the GPU is idle so often and finishes the work before the cpu can submit another buffer. The numbers for driver costs come from actual costs discovered with the method detailed in the SDK documentation. They are roughly representative of the cost to expect, but naturally, drivers change and there are different vendors, so only you can know how much your current driver will cost.
  9. Multithreaded use of D3D

    It seems like you're still struggling with the asynchronous/ parallel processing aspect of the GPU. If you have experience with multi-threaded programming, you should use that as an analogy. Imagine the GPU is another CPU thread, except the GPU is more of a slave than a full-fledged sibling. Now, how do multiple threads communicate? Sticking with the analogy, imagine that CPU thread 1 builds up a command buffer. When the buffer is full, CPU thread 1 can hand off the whole buffer to CPU thread 2 (GPU) for consumption. Handling off the whole buffer is analogous to an API command buffer flush. If that flush does not happen, the command buffer will never get acted upon ('cuz they are not in the GPU's IN box yet). A BAD thing to do related to this concept would be to call Draw(), then Sleep( 1000 ). An app should call Draw(), then Flush, before Sleep( 1000 ). Otherwise, the GPU will never make progress on the Draw. DO NOT confuse this and assume that you must litter your application with lots of flush'ing (just in case the GPU is idle). If you look at the big picture, the GPU will be working on frame N, while the CPU is working on frame N + 1. Flushing will happen automatically for the app when needed (like during invoking Present). When dealing with CPU/ GPU synchronization, the most popular reasons for the CPU "busy-waiting" for the GPU to finish are in response to the Lock() call (GPU not done with Resource yet), when too many command buffers are submitted to the GPU (ie. the GPU is way too far behind the CPU), and the imposed restriction on Present where the driver must ensure the GPU is within 3 frames. All these places are implemented with a busy-wait, so you will not see the CPU thread yield... Naturally, there is also GetData (which explicitly exposes this type of busy-wait to the app). An application can retrieve the time lost by busy-waiting, itself, either through using GetData and EVENT Queries or the DONOTWAIT flag for Lock. BTW, I should make it clear that the MULTITHREAD flags takes a crit-sec per-Device object. So, each Device object owns it own crit-sec when passed the MULTITHREAD flag. There is another crit-sec in the kernel to prevent multiple CPU threads from entering the kernel mode driver. That crit-sec can never be avoided anyway, because it's system-wide. [Edited by - Brian Klamik on April 5, 2005 7:14:10 PM]
  10. Multithreaded use of D3D

    Quote:Original post by Coder Have you read Accurately Profiling Direct3D API Calls? If people are having trouble wrapping your head around the idea of command buffering and the CPU aspect of the graphics pipeline, might I also suggest some content we put forth for Meltdown 2004 titled, The CPU Aspect of the D3D Pipeline? I've also dropped the speech script here for reference also. I've also seen good presentations from ATI and NVidia related to this topic and linked to in other threads. As for efficiently adding CPU threads to the graphics pipeline, that's a notoriously hard problem. As highlighted by other people here, just setting the MULTITHREAD flag typically "just adds overhead" (of about 100 clocks per API call to acquire and release a crit-sec). The app has to add it's own crit-sec, since there's no way to hold D3D's crit-sec for multiple API calls: Set, Set, Draw. Interesting techniques for adding multiple CPU threads tend to gravitate toward adding more frames of latency and extending the graphics pipeline across more than one CPU. (Have CPU 1 work on visibility culling frame n, while CPU 2 works on rendering (culled) frame n-1 by calling the API). Or give the other CPU thread another Device Context to work with and share Resources cross-Device Context. Unfortunately, the second method can't be done with D3D9, easily. Realistically, multiple CPU threads are good at managing Resource IO load, as others have mentioned. That's probably the more proven area that CPU threads are useful.
  11. how d3d talks to the driver / GDI

    D3D has to enter kernel from time to time. To accomplish that, it does use some thunks that are located in GDI32.dll Here's a hypothetical analogy: Imagine that someone wrote an OpenGL extension that exposed Direct3DCreate9 as an OGL extension. You would see D3D applications interact with opengl32.dll. However, there's be no guarentee that the Direct3D objects created through this extension could interact with any OpenGL rendering routines. One would be accurate to say that even with such a design, D3D bypasses OGL. D3D leeches off the general Windows graphics driver model architecture for basic needs, which means it can re-use GDI code to talk to drivers. This should be all internal architecture details; and apps shouldn't need to care.
  12. I'd say that the driver does buffer up commands, because that is the type of interface most hardware has. I'd say that this design results in a more efficient design for similar reasons that the runtime uses a command buffer... On your average graphics command, the driver should not be actively deciding whether to buffer or not. I don't even know how the driver can choose not to "buffer", because, again, the driver has no choice, when the interface to the hardware tends to be command buffer based. Due to the design of command bufferss, and the driver not keeping track on how far ahead the CPU is getting, there were issues in earlier DX7/8? timeframe when running heavily GPU bound applications like the DX samples. The GPU would be so far behind the CPU, a few second lag between the mouse and the visuals were noticable. And it was most noticable when one pushed the X (Close Window) button, and would still have to watch a few seconds of animation before the Window would close. In DX8/9, we put in a rule that the driver could not buffer up more that 3 frames. This means the driver only has to actively monitor the progress off the hardware when responding to the Present commands. In the future, due to the presence of the Event Query, you may see this rule get relaxed, and allow an application to enforce whatever frame buffering cap it wants.
  13. Keeping state changes to a minimum...

    We've tried to help reply to similar questions as how expensive are state changes with some content. First, we've added some content to the DX SDK documentation about this, "Accurately Profiling Direct3D API Calls". This article has an appendix with estimates. Available on MSDN: http://msdn.microsoft.com/archive/default.asp?url=/archive/en-us/directx9_c_Summer_04/directx/graphics/programmingguide/advancedtopics/ProfilingDirect3D.asp Second, we've present some Meltdown 2004 content addressing similar issues, complete with pretty graphs, titled "The CPU Aspect of the D3D Pipeline": http://www.microsoft.com/downloads/details.aspx?FamilyID=00600351-4c8f-43cd-b3e3-a9975ecda0ce&displaylang=en
  14. State Changing

    With D3D8/9, our philosophy was this: If the device was created with PURE, the runtime does not track state for alot of stuff (ie. doesn't support most of the Get calls). Therefore the runtime does not have the state necessary to filter out duplicates. So, we expect applications which set the PURE device flag to filter redundant state themselves, aka. sufficiently-optimized. However, there are exceptions. If the state is represented by an interface (ie. PixelShader), the runtime is storing the interface, performing AddRef/ Release. Therefore, we find it cheap enough filter out duplicates for the Set*** functions that take interfaces. If the application doesn't set PURE, then the runtime has a bunch of state lying around in order to support all the Get functions. Therefore, as long as the state is cheap to redundantly filter, the runtime has the capability to do so; and does. Some exceptions would be stuff like SetViewport or Set*ConstantF... The memcmp tax required to filter this stuff doesn't justify the possible reward, in our opinion, eventhough we have some of this state lying around. From the results of Microsoft's perf labs, it is our impression that popular drivers do not filter redundant states. However, what is true about drivers today can easily change; and could naturally be different between vendors. The debug spew from the debug runtime should be viewed as a perf warning. Unfortunately, it would've been ideal to have a system that had higher priorities whenever the runtime did not filter redundant state. However, the internal architecture of the debug runtime/ PURE & non-PURE device made it hard to accomplish exactly what would be perfect for developers to experience. We should be able to do much better in the future; especially with the mechanism of debug spew reporting. I do believe all these warnings are the relatively lowest pri. stuff, meaning you won't be missing much else if you find the level that turns them off. Keep in mind, it is cheaper for the app to not even invoke the runtime API if it can find out "quickly" whether the state change is redundant. Depending on the creation parameters to the Device, we might be taking a crit-sec to filter state. Even if the runtime is not taking a crit-sec, the app still had to incur the cost of invoking the function, which is not as cheap as some would like to believe. The biggest gain for the app is not in the simple if check that the runtime easily implement. The biggest gain comes from the app introducing global knowledge to completely avoid whole groups of if checking when the app knows it's never changing that state for a certain duration. If you're app is in a heavily dynamic scenario, then decide whether the current code is a hot-path for the CPU. If it is, and you're running with PURE, throw in the if check. If it's not a hot-path, or you'd like the runtime to filter, don't use PURE and don't bother with the if checks. Naturally, we're talking about optimizing CPU-limited frames here. If you're consistently GPU-limited, paying attention to stuff like redundant state filtering ain't going to make a difference.