Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 15 Dec 2001
Online Last Active Today, 12:28 AM

#5157817 Metal API .... whait what

Posted by phantom on 03 June 2014 - 08:23 AM

We are now in a fun situation where 3 APIs look largely the same (D3D12, Mantle and Metal) and OpenGL - while this won't "kill" OpenGL the general feeling outside of those who have a vested interest in it is that the other 3 are going to murder it in CPU performance due to lower overhead, explicate control and the ability to setup work across multiple threads.

It'll be interesting to see what, if any, reply Khronos has to this direction of development because aside from the N API problem the shape of the thing is what devs have been asking for (and on consoles using) for a while now.

#5157386 How much time does a game company give programmers to solve something

Posted by phantom on 01 June 2014 - 01:46 PM

How long is a piece of string?

If something needs to be done then you have to give it time to be done; at worst you'll feature cut if you run out of time.

I've worked on features which had had days to completed and others where I've managed to get a couple of months by convincing people the extra time will be worth the cost even though it over ran initial estimates.

#5157222 Overhead of subroutines in arrays when multidrawing?

Posted by phantom on 31 May 2014 - 03:56 PM

The problem with saying '5% less performance' is we don't know if you were CPU or GPU bound in your tests.

If you were CPU bound then it's possible the extra calls made the difference.
If you were GPU bound then it might be a GPU resource allocation problem.
Or it could be anything in between smile.png

If you could run the tests via AMD's profiling tools and get some solid numbers that would be useful.
Even their off-line shader analyser might give some decent guideline numbers as that'll tell you VGPR/SGPR usage and you can dig from there to see what is going on.

#5156923 Overhead of subroutines in arrays when multidrawing?

Posted by phantom on 30 May 2014 - 03:53 AM

That depends on the source of the data.

If your subroutine switch was provided by a glUniform call pre-draw then yes, the driver can see the value and will likely recompile a shadow copy of your GLSL program which removes the jump and inlines the code. You've now got the best of both worlds (post-recompile) as your register allocation is now constant, the jump is gone and as the user you've not had to write N-versions of the code and don't see or care about this driver based magic going on behind the scenes.

The problem with gl_DrawID and gl_InstanceID is right there in what you wrote however; "dynamically uniform expressions".
Neither of these values is visible to the driver pre-command buffer build so it has no way of knowing what they are to do the shadow recompiles.

Certainly in the case where indexing on gl_InstanceID is used because this doesn't vary per command kick/gpu state but during the execution of a bunch of work. The driver would have to evaluate the location gl_InstanceID is used, look at the draw call to figure out the extent, then shadow compile N copies of the shader and generate a command buffer doing N kicks for the draw with a different program for each instance (or group of instances, depending on how it is used).

Now, gl_DrawID might be a bit more feasible.
If you are using the non-indirect count version, so count comes from a CPU side parameter, then the driver could potentially shadow compile up to 'count' versions of the shader (depending on usage; driver would still need to look into the shader to see how it is used) and then issue multiple kick commands with a different shader for each draw.

Once you get in to the indirect versions however life would get much harder; while the CPU can see a 'max count' it cant know how many draws the command processor on the GPU will end up setting up. So unless the command processor has a set of instructions which allow it to index into an array of program binaries to use (which would have to be 'max count' in size and consist of pointers to the programs) it has no way to route this information so any choices would be 'late'.

So, in some cases it might be possible to do this HOWEVER it will come at a much greater CPU cost as you have to perform much more complex work up front in the driver for the general case of generating the command buffer for the GPU to consume. In the case of instancing it would basically undermine instancing; in the non-indirect multi-draw case it might help as I believe these are generally handled as N kick commands under the hood anyway but for anything sourcing data off a GPU buffer it could be impossible.

But, it comes at the cost of increased memory usage & more driver complexity as it has to evaluate the shader and make choices which increases CPU usage before we even get to the more complicated dispatch logic.

Depending on how it is implemented it could also cause GPU performance issues as instead of a large block of many wavefronts/warps moving though the system you could now have smaller ones doing less work but with worst occupancy and/or less chance for latency hiding depending on how the work is spread across the GPU's compute units.

Now, that's not to say the drivers don't already do some of this; the trival glUniform case is probably handled, you might even get the multi-draw case too although it seems less likely. However I wouldn't count on it.

#5156695 Graphics Layers for different for Directx11 and Directx12

Posted by phantom on 29 May 2014 - 06:53 AM

That would be a great way to make a mess of your code, and since the application would link to both Direct3D 11 and Direct3D 12, only people running Direct3D 12 could run it, making Direct3D 11 a useless parasitic twin.

You could, of course, push the API specific code into a DLL and at runtime detect and load the correct DLL in which would remove that problem.

Granted, it brings with it other things to consider but it's also not an uncommon way to do things.

#5156644 Overhead of subroutines in arrays when multidrawing?

Posted by phantom on 29 May 2014 - 02:09 AM

Very informative, thanks.
I didn't know that registers are not allocated dynamically per subroutine. Is this likely to stay this way for a long time?
What do you think about grouping subroutines in several shaders based on complexity, then, couldn't that be a good compromise?

I don't see it changing any time soon; the registers are divided up between instances when a program starts so the hardware can know how many it can keep in flight at once. If you make that dynamic in some way you both increase the cost of execution as we can't grab all resources up front so during program flow we have to try and grab more resources and then potentially fail if the resources aren't there.

Lets say the shader byte stream and thus hardware knew of sub-routine points then it might work like this.
If you had a shader made up of a main and 2 sub routines;
main - 5 GPR
func1 - 10 GPR
func2 - 20 GPR

So when execution starts the hardware gets told we need 5 GPR, so it figures out how many instances it can launch and off we go.
Then you pick a sub-routine but hold on; where do we get our GPRs from? We allocated them all up front to spawn our initial work load... bugger. At this point we deadlock and the GPU locks up.
At this point you've got two choices;

1) Attempt to stream the state out and reallocate - this would be slow as you'd need to stream to main memory, it would halt any processing while this is going on, and then you'd take state setup cost again while you now reconfigured to have half as many tasks running at once (for func1), or even 1/4 as many (func2). Then when the subroutines return you have to restore main's state, setup again, pull all the tasks back together and relaunch in the old configuration.
(I've probably missed some problems as wavefront/warps could be on the same SIMD unit, thus sharing the register file, but executing different paths so you run a performance risk again when something with a higher GPR count need space but not enough register space is free; whole wavefronts/warps end up sleeping at this point which could hurt performance).

2) Current system of pre-allocating registers in advance so you run a lower number of instances at once but you don't have to have any complicated hardware logic for rescheduling workloads as the shaders progress.

In theory 1 would be the 'ideal' situation as at any given time you are running the maximum number of instances but the dynamic nature of it is likely to be a performance issue going forward with all the extra work needed to rework the threads in flight.

If you've got the CPU time to spare then grouping is potentially a win, as long as you don't end up with subroutines with wildly different GPR counts it could help matters.

#5155368 Using the ARB_multi_draw_indirect command

Posted by phantom on 23 May 2014 - 02:35 AM

When it comes to performance and usage etc this is worth a watch Approaching Zero Driver Overhead.

#5155180 Can a bindless_sampler be mapped as normal UBO instead of using glProgramUnif...

Posted by phantom on 22 May 2014 - 01:54 AM

Yes, that function is just an overload to set uniforms; you can just treat it like a blob of data and push it into a buffer as you suggest.

#5155013 What OpenGL book do you recommend for experts?

Posted by phantom on 21 May 2014 - 02:57 AM

I've found the OpenGL Super Bible to be useful in bring myself back up speed with OGL too.

(Insights is also on my 'to read' list having already spun over the Red Book in order to get a view of the 4.3 API world).

#5154807 GPU bottlenecks and Sync Points

Posted by phantom on 20 May 2014 - 07:12 AM

AFAIK, this is a very common strategy for management of cbuffers in D3D11 -- the drivers are smart enough to allocate all that memory and clean up your garbage.

Doesn't stop the IHV hating you for doing it too much/often ;)

NV - limited number of rename-slots they can allocate from but buffer size isn't a problem
AMD - a buffer has an 8meg rename buffer attached to it, so as soon as your updates on a buffer go over that amount welcome to Slow City.

Both companies will, of course, write code paths for games to get around the problems if they are big enough (AMD had to double the buffer for a game at the last place I worked as they were doing too many discards on a buffer during a frame).

Constantly cycling on one buffer is basically bad voodoo; it'll work but you run the risk of the IHV Ninjas murdering you in your sleep ;)

#5154125 Cases for multithreading OpenGL code?

Posted by phantom on 16 May 2014 - 03:42 PM

As we've pretty much got the answer to the original question I'm going to take a moment to quickly (and basically) cover a thing smile.png

Good clarification. Makes sense that even if the GPU has lots of pixel/vertex/computing units, the system controlling them isn't necessarily as parallel-friendly. For a non-hw person the number three sounds like a curious choice, but in any case it seems to make some intuitive sense to have the number close to a common number of CPU cores. That's excluding hyper-threading but that's an Intel thing so doesn't matter to folks at AMD. (Though there's the consoles with more cores...)

So, the number '3' has nothing to do with CPU core counts; when it comes to GPU/CPU reasoning very little of one directly impacts the other.

A GPU works by consuming 'command packets'; the OpenGL calls you make get translated by the driver into bytes the GPU can natively read and understand, in the same way a compiler transforms your code to binary for the CPU.

The OpenGL and D3D11 model of a GPU presents a case where the command stream is handled by a single 'command processor' which is the hardware which decodes the command packets to make the GPU do it's work. For a long time this was probably the case too so the conceptual model 'works'.

However, a recent GPU, such as AMD's Graphics Core Next series is a bit more complicated than that as the interface which deals with the commands isn't a single block but in fact 3 which can each consume a stream of commands.

First is the 'graphics command processor'; this can dispatch graphics and compute workloads to the GPU hardware to work - glDraw/glDispatch family of functions - and is where your commands end up.

Secondly there is the 'compute command processors' - these can handle compute only workloads. Not exposed via GL, I think OpenCL can kind of expose them but with Mantle it is a separate command queue. (The driver might make use of them as well behind the scenes)

Finally 'dma commands' which is a separate command queue to move data to/from the GPU which is handled in OpenGL behind the scenes by the driver (but in Mantle would allow you to kick your own uploads/downloads as required.

So the command queues as exposed by Mantle more closely mirror the operation of the hardware (it still hides some details) which explains why you have three, to cover the 3 types of command work the GPU can do.

If you are interested AMD have made a lot of this detail available which is pretty cool.
(Annoyingly NV are very conservative about their hardware details which makes me sad sad.png)

To be clear, you don't need to know this stuff although I personally find it interesting - this is also a pretty high level overview of the situation so don't take it as a "this is how GPUs work!" kinda thing smile.png

#5153975 Cases for multithreading OpenGL code?

Posted by phantom on 16 May 2014 - 05:29 AM

Well, it's only a partly parallelisable problem as the GPU is reading from a single command buffer (well, in the GL/D3D model, the hardware doesn't work quite the same as Mantle shows giving you 3 command queues per device but still...) so at some point your commands have to get into that stream (be it by physically adding to a chunk of memory or inserting a jump instruction to a block to execute) so you are always going to a single thread/sync point going on.

However, command sub-buffer construction is a highly parallelisable thing, consoles have been doing it for ages, the problem is the OpenGL mindset seems to be 'this isn't a problem - just multi-draw all the things!' and the D3D11 "solution" was a flawed one because of how the driver works internally.

D3D12 and Mantle should shake this up and hopefully show that parallel command buffer construction is a good thing and that OpenGL needs to get with the program (or, as someone at Valve said, it'll get chewed up by the newer APIs).

#5153535 Graphics baseline for a good-looking PC game?

Posted by phantom on 14 May 2014 - 05:26 AM

Is there a reference (or published experiment) where I can confirm either way? Do all 3 vendors perform the same way?

I think they do it differently; NV do perform better at low tessellation than AMD (which is why they shout about it the most) and Intel I've no idea about.

As normal NV are pretty silent on their internal workings.
AMD do have this document however; http://t.co/zOyz5DFa6D (APU13 talk "The AMD GCN Architecture: A Crash Course") - slide 61 onwards pertains to this topic but the whole thing is a nice chunk of information to have.

#5153524 Graphics baseline for a good-looking PC game?

Posted by phantom on 14 May 2014 - 04:26 AM

but once you can tessellate to the level where your triangles are at size of pixel (which isn't a problem on modern GPUs)

Except where it is a MASSIVE performance problem because you are now issuing one wave front (32 or 64 threads) to do one pixel's worth of work as pixel workloads are dispatched in groups causing massive under utilization of the hardware and wasted resources all over the place.

Pixel sized triangles are the devil.


#5153501 What DirectX version?

Posted by phantom on 14 May 2014 - 01:34 AM

GL4 is also a possibility if you want modern features!

Except then you'll lose all the Intel hardware as they don't ship beyond 3.2 (I think it was two) and even that is patchy.
And when you get to 4.x you end up in the world of 'NV works but doesn't follow spec' and 'AMD claims but has bugs'.

Basically all gfx APIs suck.

(And don't even get me started on the clusterfuck which is Android and OpenGL|ES...)