Jump to content

  • Log In with Google      Sign In   
  • Create Account

Banner advertising on our site currently available from just $5!

1. Learn about the promo. 2. Sign up for GDNet+. 3. Set up your advert!

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Yesterday, 03:43 PM

#5244393 GPU support for D3D12_CROSS_NODE_SHARING_TIER

Posted by Matias Goldberg on 03 August 2015 - 04:19 PM

Shouldn't every GPU that fully supports DirectX 12 support all of its features?

I want a pony as well. GPUs aren't like CPUs which are all the same.

GPUs are extremely different, and some of them have superior architectures than others, some are better at doing certain tasks, other are better at other tasks.

Specially when you want existing GPU hardware to be able to run DX12 right now.

If you don't like that, then you can get out of graphics development in games because this heterogeneity has been driving innovation for the last 2 decades.


To the OP:

There is a chart with tiers based by GPU.

Don't be fooled by them though. A tier 3 GPU may be tier 3 because it doesn't support X & Y features, but turns out if it weren't by them, it would be considered Tier 1 (e.g. it may have features or precision that not even tier 2 GPUs have).

Tiers only guarantee a minimum, not a maximum. You should watch out for the capabilities you can query via the D3D12 API.

#5244200 D3D12 Root Signatures for different shaders

Posted by Matias Goldberg on 02 August 2015 - 03:25 PM

Aren't tables part of the root signature?

Perhaps we should clarify a few confusions:
"Descriptors" are stored in the Heap.
A "Descriptor Table" is nothing more than a range of the heap.
This range of the heap must be set to the root signature.
The Root Signature has its own small heap for 'dirty' things.
In C jargon, we could say it's similar to the following:

struct Descriptor;

Descriptor myTable[256];
myTable[0] = Descriptor( ... );
myTable[1] = Descriptor( ... );
SetNumGraphicsRootDescriptorTables( 1 ); //At initialization, tell the Root signature we can have up to 1 table. AFAIK we cannot change it later.
SetGraphicsRootDescriptorTable( 0, &myTable[2], 5 ); //Bind myTable[2] through myTable[8]

This is of course simplified.
Whether you want to use just one table or multiple ones is up to you (probably you will have to use more than just 1 table because of certain restrictions, eg. Samplers must live in their own table). You probably want to have around 4 or 5 to change the Tables based on update frequency (e.g. do not put a texture that will be used by all objects, like a lightmap or a shadow map, in the same table as you will put diffuse textures).
The max amount of tables you can have depends on how you use the limited Root space.
The key to performance is in baking the heaps as much as possible so that you just set the ranges of the table and fire rendering, rather than filling the heap data on the fly.
Examples in pseudo code (I'm assuming 2 texture per shader for simplicity in the explanation):
You can bake data like this:

Descriptor myTable[256];

//Once, during initialization:

//Material A, Shader A
myTable[0] = setupDescriptor( diffuseTextureF );
myTable[1] = setupDescriptor( specularTextureG );

//Material B, Shader A
myTable[1] = setupDescriptor( diffuseTextureH );
myTable[2] = setupDescriptor( specularTextureI );

//Material C, Shader B
myTable[3] = setupDescriptor( diffuseTextureJ );
myTable[4] = setupDescriptor( roughnessmapK );

//Every frame:
size_t lastId = -1;
for( i < numObjects )
    if( renderable[i]->GetMaterialId() != lastId )
        SetGraphicsRootDescriptorTables( 4, myTable[renderable[i]->GetMaterialId()], 2 );
    drawPrimitiveCmd( renderable[i] );

Or you can do set every descriptor on the fly, D3D11/GL style:

//Every frame:
size_t currentTableIdx = 0;
size_t lastId = -1;
for( i < numObjects )
    if( renderable[i]->GetMaterialId() != lastId )
        myTable[currentTableIdx] = setupDescriptor( renderable[i]->GetTexture0() );
        myTable[currentTableIdx+1] = setupDescriptor( renderable[i]->GetTexture1() );
        SetGraphicsRootDescriptorTables( 4, myTable[currentTableIdx], 2 );
        currentTableIdx += 2;

    drawPrimitiveCmd( renderable[i] );

The first method lets you reuse descriptors if it gets reused even they're not contiguous. The last method only gets to reuse descriptor if renderable[i] and renderable[i-1] used the same resources. Plus it wastes performance setting them up every time they change.
And of course you can reduce the number of calls to SetGraphicsRootDescriptorTables by dynamically indexing the textures in the shader as shown in the samples, which is basically a much better version of D3D11's texture arrays (beware of hardware limits).

#5244188 D3D12 Root Signatures for different shaders

Posted by Matias Goldberg on 02 August 2015 - 02:44 PM

You should watch the videos from DX12 presentations from way before the SDK was released.


Basically the root signature is a "wildcard" to do dirty stuff quickly or to do very simple global state changes where it makes sense to put them on the root signature.


But you're strongly encouraged to use and reuse descriptors stored in heaps, bound via tables. You want to use one descriptor per shader (or multiple descriptors per shader, or one descriptor per a group of shaders). Same with the tables.


Edit: So if shader "A" has permanent data that is always bound (e.g. shadow maps, per-pass data like the view matrix or ambient colour) these buffer views (aka descriptors) would go into one table, while more frequently changed data (e.g. diffuse textures) would go into another table, less frequently changed data (e.g. environment maps, texture buffer containing an array of world matrices) would go into its own table, and data that changes per draw (avoid this) should be set to the Root signature directly.

If you can reuse those tables for shader B, that's great. Otherwise add more descriptors.

You can bake the data beforehand (e.g. prepare the descriptor heaps holding the textures used by each material) or do it on the fly (burn heap space for every texture that needs to be rebound, changing the tables during the process)

#5243459 PSO cache

Posted by Matias Goldberg on 29 July 2015 - 06:26 PM

Now that the flood gates are being opened, I can answer.

The source code provided by the post above pretty much covers it. A couple of notes:


  1. You still need to fill all the values from the D3D12_GRAPHICS_PIPELINE_STATE_DESC, not just the blob pointer. The DX API is likely unable to read the blob, only the driver, thus it needs you to still fill this data.
  2. The cache is unique to a driver & GPU model.
  3. Because of the above, you have to check whether PSO creation out of cache succeeded. It can fail if the user updated his drivers or changed his GPU. You can see that the source code provided by MS is doing exactly this.
  4. Most of the time is spent compiling HLSL shaders to D3D bytecode ASM, which is an intermediary representation. You should cache this (which is GPU/driver agnostic) and that has been possible since DX11. PSOs go one step further by allowing you to cache the D3D bytecode -> ISA translation, which is usually fast, but still a nice bonus.

#5242802 Who wants to see a C replacement?

Posted by Matias Goldberg on 26 July 2015 - 01:19 PM

Are we going to ignore that C is just wrapping around assembly into a human readable language that can compile into any machine assembly there is a compiler for?


Because that's what makes C so damn good at it. It's not meant to be pretty. It's meant to abstract how computers work and call it a day.

You can take a random float, cast its hexadecimal representation to a pointer, and read or write memory from/to that location. You can also write to the heap, and self-modify the same code you're executing!

A Java developer would say "What the f***?" How can a low level language like C implement any useful form of garbage collection when you can make a random number in a random location in memory become a pointer that can be garbage collected? (or transform that pointer back into a random number). Yes, you can workaround that problem, like C# uses its unsafe keyword. But that brings a whole new world of problems that don't exist in C.


There's very few things you can do with asm that C can't (often because it's too specific to a platform) which is often fixed with intrinsics or compiler extensions.


Headers are not a problem in C. They're extremely fast, and they work. When they break, it's because you're working with the wrong library version; or someone wrote a header of extremely poor quality (so, why are you using that header?).

Certain languages (e.g. C#) use systems that force people into some programming paradigms to make including external code less painful. Python goes as far as telling people how to indent their code. But see... that's a problem. People working with C like C because it lets them approach a problem how ever they want; it doesn't force it into coders. Sure, these same people also hate that C lets idiotic programmers use bad patterns.

But if they're using C, it's because they need expert, disciplined programmers anyway (you're doing low level stuff, right?). If they needed to solve a problem that can use less skilled programmers, use C++, C#, or Java. Or Python. Or <insert high level language here>.

Compile time is also not a problem.


All of this creates the ideal conditions that make C unmovable: The ecosystem behind it.

The OS API provides cdecl calls, because it was written in C. There's a lot of thin libraries that work with it. Hell... even Python, PHP, Lua... they're written in C!

Debuggers are very mature and have lots of features. I can hook GDB to a server workstation 14.000 km of my location, place breakpoints, inspect the assembly, inspect the source code, watch all the variables. Inspect the raw memory. Modify it. Alter the flow of execution at instruction level in real time. I can even place data breakpoints that will break when a random memory location I selected changes, at the source code line where it changed.

Even if you make the perfect programming language that surpasses C, you have to compete with THAT. And people will prefer C because of that toolset. You're therefore in a chicken and egg problem: Your language needs adoption to build an ecosystem, your language needs an ecosystem to get adoption.

The alternative is to have massive amount of resources to build that ecosystem without the need of adoption.


Headers are a problem in C++; where including them can increase your compilation time many times due to templates, virtual tables, and other fancy language features. The include headers model doesn't scale too well with features/complexity. Also the more features, the bigger the chances a header will break compilation.

I prefer C++ because it allows me to generically write a container (e.g. std::vector, std::list) without too much work. I also get more type safety. Virtual calls can be nice if used appropriately. It sits at a higher level than C because it does more than just wrapping around assembly into text.

Unfortunately C++ is now heading to its "pythonification" (let's make a standard library for everything! std::libcairo... seriously?) with heavy bloating of the language with features that are barely useful but cause several degradation in runtime and compilation performance (yes, let's use them because they're "modern") instead of fixing longstanding issues like "I can't stringify an enum without writing it by hand and keeping it in sync or without using nasty Macros" or improving compile time (i.e. C++'s proposed "modules"... I'm still waiting for them).

Sadly with so many features, like C, it still doesn't prevent less skilled developers from screwing it up. Like Bjarne Stroustrup said, "C makes it easy to shoot yourself in the foot; C++ makes it harder, but when you do it blows your whole leg off"

#5242062 Starting an DX7 Retained mode program in windows 7 x64 sp1

Posted by Matias Goldberg on 22 July 2015 - 06:02 PM

Two seconds of Google were enough, Direct3D Retained Mode removed from Windows Vista

#5241741 Updating large buffers

Posted by Matias Goldberg on 21 July 2015 - 11:51 AM

I update an existing vertex buffer (or sometimes generate a new one if none with enough space exists)

This is a problem. Resource generation is not fast. It can cause the driver to do all sorts of maintenance tasks.
Make sure you've preallocated big enough vertex buffers so you don't have to create more.


This reduces stuttering, but it still occurs. I use buffers with D3D11_USAGE_DEFAULT and update them with UpdateSubresource().

That's another problem. The DX runtime will try to copy your data to a temporary location and defer the actual upload to the GPU in order to avoid UpdateSubresource from being a blocking call. But if the buffer is too large or the runtime ran out of temporary storage, it will block and upload on the spot.
You basically lose control on when your data is truly uploaded.

Use a dynamic buffer mapped with MAP_NO_OVERWRITE (and when you've fully written to the buffer and therefore need to start from 0, either issue a DISCARD or use Event Queries as a means of synchronizing with the GPU), or use a staging buffer and map it with D3D11_MAP_FLAG_DO_NOT_WAIT and then issue a CopySubResource from the staging buffer to the vertex buffer.

#5241329 Question about 3/4 perspective

Posted by Matias Goldberg on 19 July 2015 - 01:18 AM

I keep those entities in a list. Every time a entity moves in y, this list is sorted in ascending order of y. This works fine, except that while they are moving they keep appearing and disappering from the screen, which I find very weird.

You need to sort by Y, and then sort by X for each Y row.

Otherwise if A & B are next to each other at the same Y coordinate, they may overlap; and A may be rendered on top of B and viceversa at random.

Btw, if objects are disappearing, it's likely some silly bug in your code, or your sorting code is violating the strict weak ordering rule and thus misbehaving when the elements change.

#5241292 Pong. Reflection of ball intersecting paddle

Posted by Matias Goldberg on 18 July 2015 - 05:41 PM

The thing is. The A and B.. is that the edges?

Yes, those are the black arrows from my picture. Their values are up to you (tweaked on game design).
The values I used in the graph were Vector2( -0.707106781, -0.707106781 ) and Vector2( -0.707106781, 0.707106781 ) which is basically a normalized normal inclined -45° and 45° respectively

And how should i insert them into this method? Since it parameters is float?

There is a version that takes a Vector2. But even if it weren't, you actually just need to lerp every component of the vector (X and Y coordinates) individually (in other words, two lerp calls).

And what is this Amount in the function?

A value between 0 and 1. Think of it like 0% and 100%.
You can use the distance in the Y component from the position of the ball to the bottom edge of the paddle divided by the height of the paddle to get the value in the [0; 1] range.

Now I did all the job for you. You have the tools now. Time for you to think it through, experiment and figure some the few remainders for yourself. We learn through failure, not success. I only helped you because you seemed really stuck.

#5241278 Pong. Reflection of ball intersecting paddle

Posted by Matias Goldberg on 18 July 2015 - 04:10 PM

I want to reflect my ball when it hits the paddle. If it hits the middle i want it to reflect horizontaly. The closer it get to the edges i want it to reflect sharper and sharper.
How can this be done EASILY? ive tried using angle/radians etc, but i just cant get the hang of it. Can i use any of the C#/XNA/Monogames inbuilt library to do this? Or is there some other way?

First, to reflect the ball you need the reflection vector formula.


Apply this formula with the direction of travel from the ball against the paddle's normal to get R. Examples:

Ball's direction of travel =  Vector2( 0.707106781, 0.707106781 )

Paddle's normal = Vector2( -1, 0 )

R = D - 2 * dot( D, N ) * N = Vector2( -0.707106781, 0.707106781 )


Ball's direction of travel =  Vector2( 1, 0 )

Paddle's normal = Vector2( -1, 0 )

R = D - 2 * dot( D, N ) * N = Vector2( -1, 0 )


Now, this may be a little uninteresting because the Paddle's normal is always (-1, 0) across the entire border (unless you hit the top and bottom). You may want to make it more interesting by interpolating the normal between two normals at the edges, like in this picture:


In black, the normals at the edges.

In orange, the interpolated normal finalNormal = lerp( A, B, W ) where A & B are the two normals, and W is a value between 0 and 1 where 0 means the ball is next to A, and 1 means the ball is next to B. Google how to write a lerp function (lerp stands for 'linear interpolation').


Make sure all your vectors are always normalized, lerp doesn't always return a normalized vector even if A & B are normalized. The reflection vector should be normalized if D and N already were, but it may not hurt to renormalize it.


You can also make it more interesting by skewing the normal slightly upwards or downwards based on the velocity at the paddle is moving, so that skilled users moving the paddle too fast will not reflect the ball 100% horizontally even when it hits the exact middle (it's a way to fake friction).

#5240983 Unnecessary Vertex Attributes

Posted by Matias Goldberg on 16 July 2015 - 09:27 PM

I've heard mostly the same thing over the years. Using interleaved vertex data vs structure of arrays is platform independent. To my "platform" is a generic terms that I've personally defined as a combination of OS, driver, and hardware combination, I guess. The way I've been doing things latest is have up to two VBOs for drawing things to the screen, particularly models. The model data loaded from the stream holding the data (positions, normals, tangents, UVs, etc.) would all be loaded into a single, massive VBO shared across all models. This data would be set to GL_STATIC_DRAW, and I'd have a second VBO for models that contain instance data. This would be set to either GL_STREAM_DRAW or possibly either GL_DYNAMIC_DRAW as the buffers may need to be updated on a regular, frame-by-frame basis.

If you're writing modern OpenGL (i.e. no GLES, no OS X) consider thinking the flags in terms of buffer storage instead of GL_DYNAMIC_DRAW and Co.
Loading all meshes into one VBO is a good idea.


This has caused issues when it comes to designing an intuitive workflow implementation since I don't want a one-size-fits-all vertex format that all models I'll be drawing to the screen have to conform to. There's all kinds of design issues.

You should use as few vertex formats as possible, not explode them. This means you must enforce certain rules and guidelines for your art team to follow.
Random free meshes from the internet are often of very poor quality (this doesn't mean they don't look good; they just were not modeled for real time rendering but rather "for the looks"); and are often very inconsistent.

Btw, I did some research on QTangents, and it sounds like it's a quaternion to generate TBN on-demand.

Yes, QTangents is a glorified term for Quaternion but with a few tweaks, exploiting mathematical properties to encode the reflection information of the TBN matrix (Q = -Q; thus you gain 1 bit of information), and also account for the lack of "-0" (negative zero) in certain GPU hardware. The main goal of a QTangent is to reduce the size of a TBN matrix: Four 16-bit floats beat the crap out of seven 32-bit floats for a regular TBN (3 for the normal, 3 for the tangent, 1 for the sign of the bitangent, derive the bitangent using cross( normal, tangent ) * reflectionSign ).


Also: you mentioned using half-floats for vertex data. I know I can generate half-floats as shader output, usually to a render target, but is it possible to push them from the CPU. My assumption is that I'd have to take a native float, and use some sort of library that can re-encode it into a 16-bit unsigned short that represents 16-bit float data that the target GPU uses for 16-bit floats.

API side, you need to declare your vertex format as GL_HALF_FLOAT.
When filling your data pointers from CPU to GPU, indeed you're going to need some library or routine that converts your 32-bit float to a 16-bit one since C++ nor the x86 instruction set deals with 16-bit half float directly. Google is your friend.

Still trying to wrap my head around this. Does this mean that the driver has kind of like an uber shader implementation with every permutation of all vertex attributes?

On modern hardware, yes.
When you change the vertex format, the driver needs to patch the shader so that it can convert the data from the specified vertex format to a float (or whatever the shader variables was declared as).
Similarly, the pixel shader normally outputs 4 floating point values; but the driver needs to patch the shader to convert it to the appropriate format based on the currently bound FBO's format.
Older hardware had fixed function hardware to perform this task and thus no shader patching was needed, it was a simple state machine. But modern HW now performs the conversion via regular shader instructions. So the driver keeps internal cache of copies of the bytecode for each state combination; even though you think it's all the same GLSL.
Vulkan and D3D12 fix this though.


Sounds like there aren't any definitive, or "most-right" ways in this case. This seems to be a case for other aspects of OpenGL as there are many things left out of the spec for hardware and OS developers to choose how to implement the standard.

While it is true that the GL spec leaves certain stuff left out to the implementer; this is exactly the opposite. The spec here is mandating certain behavior and the driver must follow it strictly, even if their GPU is not exactly suited for doing it the way the API wants, and thus has to do some major background work or resort to workarounds that are completely invisible to you.
This is more a case of "this paragraph in the spec was written 20 years ago".


I haven't done too much with VAOs outside of creating and binding to one whenever I create a vertex data configuration. As I understand it, VAOs store which arrays you've disabled when it's bound, but I've always wondered if they're smart enough to disable unused attribute arrays.

Yes, you're correct VAOs will also save the disabled states.
I personally prefer VAOs because they're clean. But if you google "multiple vao vs one global vao" you will see two contradictory data.

I don't really care about the VAO war because my renderer follows AZDO practices, and thus I have very few VAOs (since I only have one huge global VBO, and thus roughly one vao per vertex formats, a few more actually because I may have a few more VBOs for dynamic vertex data)

#5240580 Unnecessary Vertex Attributes

Posted by Matias Goldberg on 15 July 2015 - 02:15 PM

There's a lot of things going on condensed in just one post.
Let's go by parts:
1. VBO attribute interleaving vs planar objects. A few people fanatically defend using a contiguous buffer for the position and interleave the other attributes in a second buffer, so that you can use the position vertex buffer for rendering.
The general advise is that interleaving is much more cache friendly than a contiguous stream. But this highly depends on the GPU architecture you're targetting.
2. Much more important than everything is getting the per-vertex size down. QTangents can be used to compress a normal + tangent + bitangent reflection information needed for normal mapping from 28 bytes to a mere 8 bytes. Position can often be stored in 16-bit float, halving the size. UVs can often also be stored in 16-bit floats or shorts (if kept in range -1-1 or 0 1 ). You can use ushorts or even ubytes for storing skinning weights, instead of 32-bit floats.
3. If you're a performance fanatic, you can duplicate the VBO optimized just for shadow mapping. It's not just about having the position stored in a contiguous stream, but rather also removing duplicates. Normals and UV seams can cause vertex duplicates. By having an extra VBO just for shadow mapping, you can remove those duplicates and reduce the per vertex size.
Beware, in the best case scenarios (4x-5x reduction in number of vertices; 5 shadow mapping passes) I get only around 20% boost in my AMD Radeon HD 7770 in the final rendering time; but I suspect older GPUs benefit the most from this approach.

This is probably also not a good practice, but what about leaving vertex attrib arrays active? If it's really common that I'd use the arrays 0-3, then why not leave them all enabled? I wouldn't have any pointers to set to them aside from whatever was set to them previously, but my shaders wouldn't reference them.

There is no good advise here. Any of the following can happen within the run of your program (not mutually exclusive):

1. OpenGL driver will check what the shader uses, see it's not being referenced, and ignore your active attribute. Thus if you tried to disable it, you will wake up the GL API to needlessly tell something it will find out anyway, plus the driver will now flag its internal state as dirty, causing unnecessary recalculations (see next point). Recomendation is not to disable the attribute.


2. The driver internally keeps patched versions (bytecode) of your compiled shader for each set of vertex attributes. You see one GLSL shader, but the driver sees many bytecodes, one for each vertex format. If you use the shader with combination A, and now the driver sees the shader will be used with combination B (which is the same as A, but with an unused active attribute), it needs to analyze the shader again to see if the bytecode needs patching and conclude it doesn't. So you would do better by telling the GL API to disable that attribute; which is the opposite of the previous advise.


Neither of the alternatives is good in terms of performance (telling to disable, telling to enable). That's why Vulkan and D3D12 will be using PSOs (PipelineStateObject) which is basically almost every possible setting condensed into one giant blob so that this information is available (shaders, vertex attributes, render targets, etc, etc) and only evaluated once (when creating the PSO).


Since in terms of performance no alternative is better, I'd suggest disabling the ones you use for the sake of correctness & clarity (easier debugging, avoid mistakes when writing new shaders and you think some attribute is not being used when it actually is, etc)

#5240503 Help me to clear up my confusion about OpenGL

Posted by Matias Goldberg on 15 July 2015 - 09:33 AM

Christophe Riccio's samples are a good source of good and modern OpenGL practices (Riccio's samples are often used by OpenGL driver teams to check their implementations, if your code resembles his', there's a high chance you will be free from bugs. Plus, if there's a driver bug and it can be easly repro'd in his samples, report it and the driver teams will often fix it in no-time).

The samples from OpenGL Superbible are also good.

apitest is an absolute must-see for bleeding edge, high performance OpenGL features & practices. The mindset of AZDO rendering with explicit synchronization will prepare your thinking-process you will apply in Vulkan.

#5240496 Best engine for cinematics?

Posted by Matias Goldberg on 15 July 2015 - 08:48 AM

antialiasing - generally terrible, and I'm now rendering at enormous resolutions and downsampling.

Many of the features used by realtime engines aren't antialising friendly.
What you're doing right now is called "Supersampling Antialising" and is considered the ultimate form of antialising (it's pure brute force).
Offline renderers often use SSAA too. So you should probably continue to do it that way.

You only need AA in the final render, so you can disable SSAA during production.

If you hit hardware limits when doing SSAA (e.g. huge resolutions result in out of of GPU memory errors), render the movie 4 times, rendering 1/4th of the viewport on each pass; and then composite the 4 sections into one. (You'll have to do this manually or externally)

transparent/translucent materials - having many problems with things like glass and water, due to the need for transparency and specular reflections.

I'm afraid this is a short coming of most deferred renderers. I do not know UE4, checkout if they have a Forward+ interface.
Also hiring an experienced dev in UE4 to fix it / adapt it to your particular needs can help.
Translucent materials like glass are often rendered through hacks. Lots of hacks.

related to both of the above, rendering of hair/fur is terrible.

This is again something realtime engines suck at. A dev. for hire may be able to integrate TressFX. But if that's not enough, there are a few brute force algorithms for hair and fur. But they're not fast and hence not part of the usual game engine toolset.

motion blur and depth of field pretty low quality, although I think these have recently improved.

I don't know what your bar for "high quality" is, but this is usually easy to fix. Very high quality DoF postprocess effects aren't fast though.

On the render output side (i.e. Matinee):
no 16-bit or higher outputs. 
no ability to render separate layers (e.g. foreground only, background only, effects only etc).

I'm not into UE4, but since it is being sold with source code access, I want to believe it is reasonable well coded, and a dev-for-hire should be able to add these features for you.

My question is: is it worth moving to another engine? Or would it be easier to extend Unreal?

You'll likely hit these problems in any game engine, perhaps except Cinebox which was specifically tailored for movie production.

#5240400 [D3D11-OGL4] Access to all 3 vertices in the vertex shader

Posted by Matias Goldberg on 14 July 2015 - 09:33 PM

Thanks everyone for the responses.

After carefully evaluating each alternative, the responses, colleagues recommendations, vendor recommendations; I've found the following:

  • Colleagues & vendors indicate / seem to agree the most stable solution in terms of performance across GPUs would be option 3
  • On my AMD Radeon HD 7770 a pass-through geometry shader isn't free, it has a cost. However it's not super expensive (probably not much more expensive than option 2; which I couldn't try though). Its performance scaling is directly tied to the number of varyings I output; and it's currently sitting in 3 float3s; which I can bring down via lower precision. Also baking the vertex buffer to only have position and remove duplicates improves framerate a little, so that helps.
  • Given my raytracing algorithm, the bottleneck is not in the GS but rather in the PS with its ROVs, some atomic operations and long list of triangles in the deep maps.


Therefore, I will follow through option 1 because it's the easiest right now, its performance hit isn't as huge in my particular case as I thought it would be and the bottleneck is somewhere else.

Once I get rid of the elephant in the room, I will try to implement option 3.



This is what compute shaders are for.

Both D3D11 and GL4.3 have them.

I've thought this through a lot. But what I'm doing heavily benefits from having a rasterizer, which is Fixed Function hardware in current GPUs and cannot be accessed from Compute Shaders.

It can certainly be done with only CS, but I would have to emulate what the rasterizer already does for me.

None of the alternatives currently exposed by the APIs allow me to harness the full power of the GPU. This is irritating.



This probably doesnt help, but iirc, GCN doesn't actually have interpolation hardware. The pixel shader always gets all 3 verts, plus the barycentric coordinates, and calculates the interpolated attributes itself. Not sure how you'd access that power through D3D, but there's likely an AMD GLSL extension to make your task super simple on GCN/GL.

I've been begging this to the GL & D3D12 driver teams, but my voice doesn't echo in this matter. Too many differences with other GPU vendors apparently. I guess I'll have to prove them this feature is important enough for enabling faster next-gen rendering algorithms.

It would certainly be the best case scenario: get access to the information that the GPU already has, plus all the efficiency from regular rendering.