• Advertisement

Matias Goldberg

Member
  • Content count

    1697
  • Joined

  • Last visited

Community Reputation

9609 Excellent

1 Follower

About Matias Goldberg

  • Rank
    Contributor

Personal Information

Social

  • Twitter
    matiasgoldberg
  1. Semi Fixed TimeStep Question.

    Suppose you minimize the application so everything stops; 30 seconds later the application is restored. The counter will now read that you're lagging behind 30 seconds and needs to catch up. Without the std::min, and depending on how your code works, either you end up calling update( 30 seconds ) which will screw your physics completely; or you end up calling for( int i=0; i<30 seconds * 60fps; ++i )update( 1 / 60 ); which will cause your process to become unresponsive for some time while it simulates 30 seconds worth of physics at maximum CPU power in fast forward. Either way, it won't be pretty. The std::min there prevents these problems by limiting the time elapsed to some arbitrary value (MAX_DELTA_TIME) usually smaller than a second. In other words, pretend the last frame didn't take longer than MAX_DELTA_TIME, even if it took more. It's a safety measure. Edit: Just implement the loops talked by gafferongames in a simple sample, and play with whatever you don't understand. Take it out for example, and see what happens. That will be far more educational than what we can tell you.
  2. what is the appeal of fps games?

    Mass Effect was a shooter, but it wasn't a first person shooter. It's a Third Person Shooter, such as the likes of Tomb Rider, Just Cause, GTA, Hitman, etc. Not quite the same beast. Yup. It's good to point it out because the user base usually takes a great pride in the "realism" from such games (it's the users? the marketing? I don't know), when the reality is that "realism" isn't fun. Reality means a bullet travelling at super sonic speed you first get hit then hear the sound coming. A fun game means sound is simultaneous with impact so the player knows where the bullet is coming from in order to shoot back.
  3. DX11 Resolving MSAA - Depth Buffer to Non MSAA

    In my experience sampling from an MSAA surface can be very slow if your sampling pattern is too random (i.e. I attempted this at SSAO and SSR, didn't go well). I'm not sure why, whether it's because of cache effects or the GPU simply not being fast at it. But likely for your scenario that's not a problem. If you're doing this approach using a regular colour buffer (instead of a depth buffer) would be better. Depth buffers have additional baggage you don't need (Z compression, early Z) and will only waste you memory and cycles. Btw if you're going to be resolving the MSAA surface, remember an average of Z values isn't always the best idea. That won't work because the water still needs the depth buffer to avoid being rendering on top of opaque objects that should be obstructing the water. However as MJP pointed out, you can have the depth buffer bound and sample from it at the same time as long as it's marked read-only (IIRC only works on D3D10.1 HW onwards). This is probably the best option if all you'll be doing is measuring the depth of the ocean at the given pixel. Cheers
  4. DX11 Trying to finding bottlenecks in my renderer

    You're still not using the result. printf() rData[0] through rData[3] so it is used. And source your input from something unknown, like argv from main or by reading from a file; else the optimizer may perform the code at compile time and hardcode everything (since everything could be otherwise be resolved at compile time rather than calculating it at runtime). Sure. I can save you some time by telling you warp and wavefront are synonims. A warp is how NVIDIA marketing dept calls them, a wavefront is how AMD's marketing dept calls them. You can also check my Where do I start Graphics Programming post for resources. Particularly the "Very technical about GPUs." section. You may find the one that says "Latency hiding in GCN" relevant. As for bank conflicts... you got it quite right. Memory is subdivided into banks. If all threads in a wavefront access the same bank, all is ok. If each thread access a different bank, all is ok. But if some of the threads access the same bank, then things get slower. I was just trying to point out there's always a more efficient way to do things. But you're doing a videogame. At some point you have to say "STOP!" to yourself, or else you'll end up in an endless spiral of constantly reworking things out and never finishing your game. There's always a better way. You need to know when something is good enough. I suggest you go for the vertex shader option I gave you (inPos * worldMatrix[vertexId / 6u]). If that gives you enough performance for what you need, move on. Trying Compute Shader approaches also restricts your target (e.g. older GPUs won't be able to run it) which is a terrible idea for a 2D game.
  5. DX11 Trying to finding bottlenecks in my renderer

    I got confirmation from an AMD Driver engineer himself. Yes, it's true. However don't count on it. The driver can only merge your instancing into the same wavefront if several conditions are met. I don't know the exact conditions, but they're all HW limitation related. i.e. if the driver cannot 100% guarantee the GPU can always merge your instances without rendering artifacts, then it won't (even if it were completely safe given the data you're going to be feeding, but the driver doesn't know that a priori, or it would take a considerable amount of CPU cycles to determine so). When it comes to AMD, access to global memory may have channel and bank conflict issues. NVIDIA implements it as huge register file, so there's always a reason...
  6. DX11 Trying to finding bottlenecks in my renderer

    You don't have to issue one draw call per sprite/batch!!! Create a const buffer with the matrices and index them via SV_VertexID (assuming 6 vertices per sprite): cbuffer Matrices : register(b0) { float4x4 modelMatrix[1024]; //65536 bytes max CB size per buffer / 64 bytes per matrix = 1024 // Alternatively //float4x3 modelMatrix[1365]; //65536 bytes max CB size per buffer / 48 bytes per matrix = 1365.3333 }; uint idx = svVertexId / 6u; outVertex = mul( modelMatrix[idx], inVertex ); That means you need a DrawPrimitive call every 1024 sprites (or every 1365 sprites if you use affine matrices). You could make it just a single draw call by using a texture buffer instead (which doesn't have the 64kb limit). This will yield much better performance. Even then, it's not ideal, because accessing a different matrix every 6 threads in a wavefront will lead to bank conflicts. A more optimal path would be to update the vertices using a compute shader that processes all 6 vertices in the same thread, thus each thread in a wavefront will access a different bank (i.e. one thread per sprite). Instancing will not lead to good performance, as each sprite will very likely will be given its own wavefront unless you're lucky (on an AMD GPU, you'll be using 9.4% of processing capacity while the rest is wasted!) See Vertex Shader Tricks by Bill Bilodeau.
  7. DX11 Trying to finding bottlenecks in my renderer

    I skimmed through very long thread just to find only the last post (Mekamani's) point it out. You're multiplying each vertex against a matrix from your CPU, no threading. Taking 40ms for doing 40k matrix multiplications per frame for a single core sounds about correct. That's your problem.
  8. Unreal State of Custom and Commercial Engines as of 2017/2018

    If we go backwards in time, you'll find a lot of games were made in Unreal Engine 3. Nowadays there's more of UE4 games (specially indie) because the license price went down from >$50.000 to "free" until you sell enough then they get 5% cut out of gross sale. There were also a lot of games using RenderWare. The names have changed, but the practices haven't. Many games use canned engines, some games still use their own home-grown engine. It's just that games made with canned engines have a recognizable stamp on it, while home-made engines just won't mention it unless they're doing heavy marketing on that, or plan on selling the engine to others. You rarely hear that, for example, Divinity: Original Sin 1 & 2 were made with home-grown engines. The Witcher also uses in-house engine. Also having powerful engines like UE4 & Unity become "free" (they're not really free, but still easily accessible) rather than costing tens of thousands of dollars made them more popular around users that would otherwise have been unable to make a game at all.
  9. Adding to that, you have the costs of clothing, make up, lighting, photo shooting sessions. If something needs to be changed or added then you need to shoot lots of pictures again. If during that time the actor changed shape (e.g. got fatter / more fit) then you need to reshoot everything. Lost a prop? reshoot everything again. Midway was a referent when it comes to HD live-action photo shoots (Mortal Kombat, and also their lesser known Batman Forever, that style for that kind of game... let's say it didn't work out well). There's a reason they don't do that anymore. It does not scale. Mortal Kombat 3 Ultimate & MK Trilogy were already pushing it a lot with their endless palette swaps of scorpion and sub zero (+ the cyborg palette swaps). Also actors sueing the company didn't help (it doesn't matter whether they won or not, or whether they were right; either way it was a lot of legal trouble). The most common lawsuit reasons were that the actors claimed they signed for their look-alike to appear in Mortal Kombat 1, but not in the subsequent games. The TL;DR of this thread is: you can do it, but it's a terrible idea.
  10. Yes. Everyone has cleared up that his is a HW limitation. But I don't think nobody has hinted the obvious: You can create more than one view. The most common scenario is for manually managing memory: creating a large pool, and then having different meshes / const buffers / structured buffers living as views to subregions of it. You just can't access all of it all at once. Though for example if you have a 6GB buffer, you could create 3 views of 2GBs each and bind them all 3 to the same shader.
  11. You need to use R16G16B16A16_SNORM. SINT is when you use the raw signed integer values, and you must declare your variable as int4. The values will be in range [-32768;32767] since they're integers. SNORM is when the integers are mapped from range [-32768;32767] to the range [-1.0;1.0] and your variable must be declared as float4.
  12. ACES and Uncharted Inverse Tone Mapping

    Eye adaptation happens after step 3. Steps 1-3 is not about tonemap correctness, it's about correct AA. After step 3, you have antialiased colour data you can tonemap and eye adapt as you like.
  13. ACES and Uncharted Inverse Tone Mapping

    You cheat: Apply a trivial reversible tonemap operator before AA Resolve AA Apply the reverse of the tonemap operator in step 1 Now apply the tonemap you wanted (Uncharted, ACES, whatever).
  14. DX11 Binding buffers and then updating them

    The example you posted is fine. What I meant is that you cannot do the following: //Draw a Cube graphicsDevice->deviceContext->Draw(cube.vertexCount, 0); UpdateBufferWithCubeData(); //Update the cube that will be used in the draw above^ This is not valid in D3D11, but it is possible (with certain care taken) in D3D12 and Vulkan. No, I meant what is explained here and here. Basically the following is preferred: //Draw a Cube void *data = constBuffer->Map( DISCARD ); memcpy( data, ... ); bindVertexBuffer( constBuffer ); graphicsDevice->deviceContext->Draw(cube.vertexCount, 0); //Draw a Sphere data = constBuffer->Map( DISCARD ); memcpy( data, ... ); graphicsDevice->deviceContext->Draw(cube.vertexCount, 0); over the following: //Draw a Cube void *data = constBuffer0->Map( DISCARD ); memcpy( data, ... ); bindVertexBuffer( constBuffer0 ); graphicsDevice->deviceContext->Draw(cube.vertexCount, 0); //Draw a Sphere data = constBuffer1->Map( DISCARD ); //Notice it's constBuffer1, not constBuffer0 memcpy( data, ... ); bindVertexBuffer( constBuffer1 ); graphicsDevice->deviceContext->Draw(cube.vertexCount, 0); This difference makes sense if we're talking about lots of const buffer DISCARDS per frame (e.g. 20k const buffer discards per frame). It doesn't make a difference if you have like 20 const buffer discards per frame. Btw I personally never have 20k const buffer discards, as I prefer to keep large data (such as world matrices) in texture buffers. This pattern is used with D3D11_USAGE_DYNAMIC buffers. These buffers are visible to both CPU and GPU. This means that actual memory is either stored in GPU RAM and your writes from CPU go directly through the PCIE bus, or that the buffer is stored in CPU RAM and GPU reads fetch directly via the PCIE bus. Whether is one or the other is controlled by the driver, though probably D3D11_CPU_ACCESS_READ and D3D11_CPU_ACCESS_WRITE provide good hints (a buffer that needs read access will likely end up CPU side, a buffer that has no read access will likely end up GPU side, but this is not a guarantee!). What you're saying about an intermediate place, must be done by hand via staging buffers. Create the buffer with D3D11_USAGE_STAGING instead of DYNAMIC. Staging buffers are visible to both CPU and GPU, but the GPU can only use them in copy operations. The idea is that you copy to the staging area from CPU, and then you copy from staging area to the final GPU RAM that is only visible to the GPU (i.e. the final buffer was created with D3D11_USAGE_DEFAULT). Or vice versa as well (copy from GPU to staging area, then read from CPU). There's a gotcha: with staging buffers you can't use D3D11_MAP_WRITE_NO_OVERWRITE nor D3D11_MAP_WRITE_DISCARD. But you have the D3D11_MAP_FLAG_DO_NOT_WAIT flag. If you get a DXGI_ERROR_WAS_STILL_DRAWING when you tried to map the staging buffer with this flag, then the GPU is not done yet copying from/to the staging buffer and you must use another one (i.e. create a new one, or reuse an old one from a pool). What's the difference between STAGING and DYNAMIC approaches? The PCIE has lower bandwidth than GPU's dedicated memory (and probably higher latency). If you write from CPU once, and GPU reads that data once, then use DYNAMIC. But if the data will be read by the GPU over and over again, you may end up fetching the data multiple times from CPU RAM through the PCIE; therefore use the STAGING approach to perform the transfer through the PCIE once, and then the data is kept in the fastest RAM available. This advice holds for dedicated GPUs. Integrated GPUs using staging aggressively may hurt since there is no PCIE, you'll just be burning CPU RAM bandwidth doing useless copies. And for reading GPU -> CPU, you have no choice but to use staging. So it's a good idea to write a system that can switch between strategies based on what's faster depending on each system.
  15. DX11 Binding buffers and then updating them

    You can do that. What you cannot do is to issue Draw commands (or compute dispatches) and update the buffers later; which is something you could do with D3D12 as long as the command buffer hasn't been submitted. As for performance, if you use D3D11_MAP_WRITE_NO_OVERWRITE and then issue one D3D11_MAP_WRITE_DISCARD when bigBufferIsNotFull is false (do not forget to reset this bool! the pseudo code you posted doesn't reset it!) you'll be fine. Also allocating everything dynamic in one big pool is fine. Just a few caveats to be aware: Texture buffers you cannot use D3D11_MAP_WRITE_NO_OVERWRITE unless you're on D3D11.1 on Windows 8 or higher. You always have to issue D3D11_MAP_WRITE_DISCARD. Discarding more than 4MB per frame overall will cause stalls on AMD drivers. And while NVIDIA drivers can handle more than 4MB, it will likely break in really bad ways (I've seen HW bugs to pop up) In Ogre3D 2.1 we do the following on D3D11 systems (i.e. not D3D11.1 and Win 8): Dynamic vertex & index buffers in one big dynamic pool with the no_overwrite / then discard pattern. Dynamic const buffers separately; one API const buffer per "buffer" as in our representations. Though the ideal with D3D11 is to reuse the same const buffer over and over again using MAP DISCARD. We do not use many const buffers though. Dynamic texture buffers also separately, one API tex buffer per "buffer" as in our representations.
  • Advertisement