Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 15 Dec 2001
Online Last Active Today, 12:23 PM

#5143291 [GLSL] NVIDIA vs ATI shader problem

Posted by phantom on 30 March 2014 - 02:11 PM

NV automatically generates those according to the order in which the attributes where defined, while ATI needs to have the explicit layout(location = index) in front of each attribute.

If you don't specify the order (so no layout or pre-330) then you should query the data from the compiled/linked program as the driver is free to do as it likes.

#5142962 [GLSL] NVIDIA vs ATI shader problem

Posted by phantom on 28 March 2014 - 05:51 PM

Fix the errors the AMD programming reports; their GLSL implementation tends to be more correct/strict than NVs which allows more Cg like syntax through.

#5142112 AMD's Mantle API

Posted by phantom on 25 March 2014 - 04:15 PM

I wouldn't mind that so much if a guy working there wasn't pushing methods which don't work on their drivers on twitter AND if non-4.4 extensions didn't turn up... not that those are documented anywhere either.

The amusing thing is AMD taking part in this AZDO presentation has basically pushed me towards getting an NV GPU because the functionality is cool, I want to use it, but AMD aren't delivering...

And they wonder why people don't test their stuff on their hardware... *sigh*

#5142011 AMD's Mantle API

Posted by phantom on 25 March 2014 - 09:47 AM

MDI and bindless are only part of the equation and look fine if you go 'I want to render 1,000,000 objects!' - but a game now-a-days will do multiple passes and draws between presented frames and so you are immediately required to switch shaders, targets and other data between draws which starts to cut things down a bit.

If you take a simple example of a scene with 2 dynamic lights being used for shadow mapping and one forward pass, then that is going to require 3 'passes', with different data sets and render targets.

For the sake of easy maths lets say the whole thing takes 90ms to submit; 30ms per pass. (clearly wildly too big as numbers go but they serve the point).

With the proposed OpenGL solution you would burn 90ms on one thread doing all the work. That's one core gone and while you might be able to do other work that also introduces cache issues as caches are shared between cores.

With Mantle and DX12 the solution would be to offload the creation of the buffers to pass to the GPU to different threads and then submit them all at the end.
So now you would split the work across 3 threads and then spend a bit of time on thread 0 executing them; lets say that takes 3ms.

You've now gone from chewing one thread for 90ms to doing the work across 3 threads for the cost of 33ms total. Setup and validation work is completed in parallel, submission takes very little time as you are just blasting an array saying 'do this, then this, then this' and everyone is happy because the workload is spread, shiney resources are used and a bottle neck is freed up.

This was a simple example; start increasing the lights, add in post-processing, throw in more passes for compute and other data and this can balloon up pretty quickly.

I worked on a forward rendered game which had 9 passes, and could be run in 4 play split for a total of 36 passes and separate chunks of draw (2 UI passes, 1 main, 1 depth, 1 overlay, 3d ui, post process, screen space particles and osd overlay - per player). Test bed for the next game engine was already running at least that number of passes as a hybrid renderer (some deferred, some forward).

Also, finally, it is 'all about draw calls' it is about control and hints; it's about telling the driver "hey, upload this for me via dma, I want it in a bit" when you know you can, it's about laying down a massive hint that this compute shader can be run on a compute only ring because it doesn't need any graphics interactions, it's about letting memory be used as you like without the constrains of 'this must be type X because it was created as type X', it's about control such as surface conversation and unpacking and even SLI/CFX setups where the work can be directed and resources can be dealt without the driver trying to guess what the hell is going on.

In theory something like Mantle means that a CFX rig will work correctly from Day 1 without having to wait for a magic driver patch. Now, I recognise that right now this isn't quite the truth but still. Heck, as an aside from that with various iGPU/dGPU combinations in the wild being able to shift work about is a must.

Frankly treading the GPU like a one-command-buffer-one-device is hurting it and just forcing people to work around it and it's really annoying.

Also, lets not forget that it is some months since OpenGL4.4 was completed and AMD STILL don't have a 4.4 driver and are missing key extensions used in the recent GDC zero-driver-overhead presentation despite being part of that presentation themselves which REALLY bothers me.

#5141463 AMD's Mantle API

Posted by phantom on 23 March 2014 - 09:59 AM

And, amusingly, AMD's drivers STILL don't have support for the key extensions needed to implement the functionality in that talk... the one which AMD took part in.

Basically 'OpenGL has the features' if you accept that 'has features' = 'NV only support'.

For what it's worth bindless is an ARB extension so in theory will appear on all 4.4 hardware at some point... in theory.

Not that any of this allows threaded command list construction (because single threaded has hit a wall), split of data upload command via DMA transfers, setup compute only workloads separate from the gfx ring or allow you to treat memory as just a chunk of memory.

But.. hey... instancing, right? It'll solve all the problems!

#5141431 Unreal Engine 4

Posted by phantom on 23 March 2014 - 07:40 AM

Also a quick question to those who have code access: do they also provide the source code for their tools, or is it just for the runtime?

*puts on works-with-engine hat*

All the tools use the Engine in some way, as everything uses the UObject systems for serialisation so by getting one you get the code for the others.

Basically everything I get at work, sans console support, you get in this release.

#5140617 Unreal Engine 4

Posted by phantom on 20 March 2014 - 05:05 AM

Aren't there strings attached to this? I mean.. In that case I could pay for a month, get the engine, cancel subscription and work with what I have. At the start, updates might be vital for some stuff, but there comes a point where updates aren't of such a big relevance that I must need them in order to make a game.

The 5% aside the 'catch' would be that if you come across something which is broken then you are either a) stuck with it, b) fixing it yourself or c) paying another $19 to catch up and merge all that you've missed out on.

You'd also not get new features, fixes, platforms etc as they hit mainline.

From Epics point of view, if you pay once and never release anything they've made $19 they wouldn't have otherwise.
If you pay once and release something then you owe them 5% they wouldn't have had otherwise.
If you keep paying and release something then they have the N*$19 + 5%.

It's a case of what is it worth to YOU and how does it contrast with other engines in the market for your usage.

Personally, I'd pay the $19 once just to get a look at the code and decide from there, it's not a huge amount of cash after all and worst comes to the worst you can learn something to boot.

#5140293 Array of samplers vs. texture array

Posted by phantom on 19 March 2014 - 05:42 AM

The main issue with 'bindless' is that extra fetch/indirection to get the data to look up the texture; when something is bound that cost is paid up front and shoved somewhere shared.

As to the OP's question;
If you take something like AMD's GCN then a 'texture' is nothing more than a header block of 128 or 256 bytes in size which contains a pointer to the real texture data. When you sample the instructions are issued to query that block and then go off the fetch the data.

When it comes to a TextureArray then the index into a layer is part of the address generated and you are only going via one resource handle; textureHandle->Fetch(i, x , y); in C++-like code.

When it comes to an array of samplers/textures then you are effectively indexing into an array of multiple resource handles; textureHandles[i]->Fetch(x,y); at which point, as Hodgman mentions, you enter a world of divergent flow with multiple if-else chaining happening and the normal issues that brings.

While the above is based on the public AMD GCN Southern Islands data I would be surprised if NV didn't work in a similar manner.

Bindless, as mentioned, slots into this with an indirection on the texture data handle as it'll have to come from another source instead of pre-loaded into a local cache.

Where it really gets fun is when you combine bindless, sparse and array textures to reserve a large uncommitted address space and just page data in and out on demand (16k * 16k * 2048 is a HUGE address range!) - with the recently released (but unspec'd!) AMD_sparse_texture_pool extension you can even control where the backing store is pulled from and limit the size smile.png

#5139780 Million Particle System

Posted by phantom on 17 March 2014 - 01:39 PM

Geo-shaders are basically the shader stage best avoided if you can; by their nature they tend to serialise the hardware a bit.

For a particle system you'd be better off using compute shaders as you can pre-size your buffers and then have them read from one buffer and write to the other. You also only need one shader stage to be invoked; usage of the geo-shader implies a VS must be run first even as a 'pass-thru' due to how the logical software pipeline is arranged - a compute shader wouldn't have this.

That's not to say compute doesn't come with it's own set of potential pitfalls but it is better suited to this task smile.png

#5139744 Million Particle System

Posted by phantom on 17 March 2014 - 11:27 AM

And just to bring up an important point; I don't think you REALLY want to spawn 1,000,000 particles from one emitter in one frame... if nothing else it'll be bloody slow to do and you can do a lot with surprisingly few particles per emitter.

However if you DO want to do this, because you want an effect which has a lot of particles, then I suggest creating them at load time either on the CPU or via a one off GPU task instead of trying to spawn them like normal particles.

(As to why you don't want 1,000,000 particles consider this; a 1920*1200 screen only has 2,304,00 pixel on it so at 1million particles each one would be filling a little over 2 pixels on average. )

#5139443 DrawLine args name confusion

Posted by phantom on 16 March 2014 - 08:37 AM

Clearly you aren't interested in reason and just want to shove your own opinion around under the guise of 'discussion' so I'm going to kill this here and now before any more people get dragged into the entropic black hole this topic threatens to become.

#5137451 Sorting a std::vector efficiently

Posted by phantom on 08 March 2014 - 06:54 PM

Just sort the renderables in place?
It'll probably be faster than 4 memory allocations, an initialisation of 3 arrays, a sort and 'newindex' setup (which will be horrible because you could end up jumping all over memory to find your indices as it's a double indirection).

or, when you add them, use an insertion sort to pre-position them in the correct place in the vector. (A vector which should be pre-sized to 'max number of entities' before you even try to do this loop).

#5137110 Geometry Shaders

Posted by phantom on 07 March 2014 - 05:57 AM

It might be sane, but the problem is as stated there is no stage before the VS that you could do this in; at best you could do a pass through-VS and do the work in the GS but that'll hurt as you'll end up spilling out to gfx memory in and introduce ordering.


Amusingly the whole VS->[tess]->GS thing is just an artifact of the way the pipeline was setup; current hardware could, quite happily, put a 'GS'-like stage first to operate on a primitive (IA is done in ALU cores these days anyway) before the VS/Tess stages... in fact VS wouldn't really be needed at all in that situation as you could test and transform 3 points just fine.


In short; the idea might have merit but the way the software pipeline I arranged means that while the hardware could operate in that manner you aren't going to be able to do it.

#5134905 AMD's Mantle API

Posted by phantom on 26 February 2014 - 06:04 PM

I definitely didn't think that Microsoft would respond so quickly! Let's hope that it's something substantial (bindless resources plz!), and not just a band-aid.

The brief I saw said 'future' in it; this could mean anything form 6 months or Windows 9 and given the recent tendency to put D3D updates behind Window version walls.

The problem is the only way you are going to get a proper fix is an API redesign, a backend redesign and thus a whole bunch of new driver code.. and if it looks nothing like D3D11 it's just going to make everyone throw their hands in the air because now you are going to have to support 4 APIs (D3D, DXNext, D3D-XBox One and Gnm) all of which are subtly different to the point of driving people made so you'll end up with a common subset work-alike aka D3D11 target anyway.

And this is before you take into account that the monolithic 'one interface to submit calls' doesn't really make sense any more; take some thing like AMD's R9 290X card; the GPU on that has a gfx/compute command processor AND 8 compute only command processors AND a DMA engine to handle up/downloads of data.

Even if you can hide the latter behind an API (and pray it does things correctly; in GL land I've heard you need to have a second context to get NV's drivers to do DMA uploads of data...) there needs to be some method of logically queuing up different workloads, establishing dependency graphs between them, kicking long running/low priority tasks, control resource allocation for tasks (shadow generation tends to be ROP and ALU but heavy but light on read bandwidth so being able to kick, say, a read bandwidth heavy, ALU light task at the same time would be useful) and even how memory is laid out (give me a 20meg chunk, I'll layer textures or whatever into it) and reused.

The fact is the GPUs can do a hell of a lot more than is exposed in any API right now and trying to slap an abstraction over it which hides that completely is just annoying.

Yes, not everything is needed by everyone but then those people aren't crying out for more resources so meh, they could stick with D3D11 - unless the API allows the same kind of control as Mantle was looking to give for people who want it then it's going to be horrible once again.

Right now I'd summarise things as follows;
- D3D; we promise things for the future, but it'll probably be for Win8 only if you are lucky, more than likely Win9.. and will probably still have more overhead than you want but abstraction!

- OpenGL; instancing is the answer to everything! Here is how to do it! It only requires these 8 extensions! Oh, and only one vendor has implemented them all, which may or may not be completely to the letter of the spec, another vendor is short some key extensions but 'soon' and the third is still a few GL versions behind but don't worry! we'll get there! You can trust us! Just forget it took us forever to get VBO out, the GL2.0 events and ignore GL3.0/Longs Peak... really, we can change, ignore 14 years of history!

- Mantle; One Vendor. One GPU type. Full Control. Maybe it'll work on more platforms... one day... and you can't see it unless you are a AAA dev!

It's all a wonderful mess... I'm just glad that as of today I'm no longer doing rendering for a day job as it's all just... ugh.

#5132917 Any options for affordable ray tracing?

Posted by phantom on 20 February 2014 - 03:51 AM

The catch is that it'll only be popular on the consoles, as on PC it currently requires Win8+D3D11 GPU's, or Mantle and D3D11 AMD GPUs...

GL also is part way to this with an GL_arb_sparse_texture extension, which if combined with arb_bindless_texture might yield some intresting possibilities. (Although unlike D3D and (probably) Mantle is currently leaves memory allocation under driver control.. which is mildly annoying.. apprently another extension in the works to fix that).

Same hardware constaints as the DX/Mantle stuff of course.