Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Yesterday, 08:22 PM

#5312753 Which hairstyle looks best?

Posted by on 26 September 2016 - 07:02 PM

TBH each one of these are expressing a very different and unique personality. Tell us more about the character, their background, story and setting and then I can better decide which one fits better.

#5312372 Is OpenCL slowly dying?

Posted by on 24 September 2016 - 11:42 PM

The thing with OpenCL vs Vulkan is the former will prioritize accuracy while the latter will prioritize performance. Although some Vulkan implementations could provide strong IEEE and double support extensions, it doesn't change the fact it will be fancy add on whereas OpenCL it is a must-have and the core focus.



Take into account also WebGL. WebGL is quite popular now. And I suppose that due to difficulties to implement Vulkan in browsers it will be so for a long time.

There won't be a WebVulkan as Vulkan provides a degree of low level access that is a massive security nightmare that browser cannot afford to allow.

#5311140 Is fread considered a seek memory on disk operation?

Posted by on 16 September 2016 - 05:36 PM

Check out "Reading and writing are less symmetric than you (probably) think"

#5310682 Alpha Blend for monocolor (or min/max depth buffer)

Posted by on 13 September 2016 - 11:30 PM

Thanks Matias. One more question: I feel like the request for knowing min/max depth for each pixel is very common (...)

Actually it's not common. It is common though to compute the min / max of a block of pixels (e.g. an 8x8 block); which can be done as a postprocess once you're done rendering opaque objects.
On consoles you could have that information from some internal working of the Z buffer (Either it was Hi-Z or Z Compression), but it's not available on standard desktop APIs. AFAIK not even in D3D12 & Vulkan.

...especially when we need to do efficient raycasting for volumes (so we don't have to march all the way through the bounding box for sparse volumes). What's the standard way industry used for sparse volume rendering?

There isn't one. This is a heavily researched topic now that we have the horsepower to do it at acceptable framerates; and so far the techniques vary depending on what you want to render and how the researcher approached it (i.e. clouds, volumetric particle FXs, godrays, Global Illumination, AO)

#5310613 Data throughput

Posted by on 13 September 2016 - 11:41 AM

You're not even saying your framerate.

At 60 fps you're pushing 286MB/s. At 120 fps you're pushing 572MB/s.

"and that seems to be the limit"... what is your limit? 60fps? 120fps? 240fps? 30fps?


There's a lot of things that could go wrong. First you would have to compare the framerate against having the buffers stored in host memory, to overrule other GPU bottlenecks.

Second if you happen to read from that buffer by accident (i.e. generated assembly reads from data back even if your C code doesn't) you'll hit severe perf. penalties due to write combining memory.

Third you could be hitting CPU limits (i.e. your CPU can't pull the data fast enough) and thus doing it in another thread could increase the framerate.

Four, you could be reaching the DISCARD limit (i.e. AMD is 4MB per frame), so you should use D3D11_MAP_WRITE_NO_OVERWRITE instead.

Five you're not even describing your specs or how you're implementing the upload.

Six, the rest of your code also consumes RAM bandwidth. It's common to see 24-32GB/s RAMs nowadays. If you're doing something else or reading/writing your data more than once, you could be hitting that limit.

Seven... PCI-E 3.0 16x theoretical bandwidth is 15.75GB/s. It's safe to assume in practice you should be able to reach 7GB/s if you do things right. That means pushing 6 million vertices per frame at 60fps w/ 20 bytes per vertex (assuming your GPU & CPU can handle the rest. You may hit another bottleneck before). So: No. Your numbers don't look right (assuming your "limit" was 60fps).

#5310612 Alpha Blend for monocolor (or min/max depth buffer)

Posted by on 13 September 2016 - 11:29 AM

Thanks N.I.B, that's a good idea using MRTs, but I feel like the overhead of render to separate RTs (write to 1 four channel pixel may be faster than write to 2 seperate one channel pixel? anybody) and alpha blend separate framebuffer may run slower than my original method (though I have to benchmark it...)

Vastly depends on the HW.

On some HW the total cost is just the sum:
cost( MRT0 ) + cost( MRT1 ) + ... + cost( MRTN ) = total_cost

On other HW the total cost the sum of the costliest operation:
max_cost = max( cost(MRT0), cost(MRT1), ..., cost(MRTN) )
total_cost = max_cost x N

Source: Deferred Shading Optimizations 2011


On GCN export cost is a bit different. See GCN Perf Tweet #6

#5310348 How to use a big constant buffer in DirectX 11.1?

Posted by on 11 September 2016 - 10:26 AM

The main difference is rather in the amount of maps you have to do.


In D3D11, if you have to draw 100 objects each with individual settings you have to options (let's suppose you only want to send a float4):

  1. Create one Const Buffer of 16 bytes. Map it 100 times with MAP_DISCARD.
  2. Create 64 Const Buffer of 16 bytes each. Map each once (still 100 times total). Probably use MAP_DISCARD.
  3. Create one large const buffer (i.e. 64kb); map it once and use baseInstance to index into an array. In shader you would do value[baseInstance]. This adds some GPU overhead (shader now needs to perform an indirection, baseInstance requires a vertex buffer dedicated to instancing because SV_InstanceID always starts from 0); but now you only map once.

Regarding options 1 vs 2; Fabian Giesen had its experience where option 1 beats 2.


D3D11.1 added a 4th option:

  • Create a very large buffer (virtually no upper limit); map it entirely once. When you're done with it; you can bind the particular region you've written to.

This combines the goodies of both worlds: No GPU overhead from option 3 and no CPU overhead from mapping 100 times. You only need to map once. Granted, you now need to bind the buffer 100 times. But in terms of driver overhead, mapping is very expensive and binding a sub region is cheap.

(note that in some cases option 3 may still beat option 4; GPUs are complex parallel machines. However you can now perform option 3 without being restricted to 64kb limit and can use SV_InstanceID now that you can specify where the const buffer starts while binding)

#5310050 Boss Dance Battle

Posted by on 08 September 2016 - 06:08 PM

Please tell me this will be in the final game.

#5309267 Screenspace Normals - Creation, Normal Maps, and Unpacking

Posted by on 02 September 2016 - 10:41 PM

You might think that this is fine because view-space normals will always point towards the camera, but *due to perspective*, this is false. It's very easy for a surface to be visible yet facing away from the camera (just look down a corridor where the floor slopes downwards)!

I believe the problem is not perspective itself, but rather because we cheat. "Smooth" per-vertex normals instead of sharp per-face normals. Just look at this picture. It's ortho projection, yet one of the normals is facing away from the camera. Would we use "real" per face normals this wouldn't happen. But then it would be extremely hard to make a low tessellated sphere look round.


#5309088 What is the best PBR Real Time Fresnel function

Posted by on 01 September 2016 - 07:21 PM

F0 = ( (1 - IOR) / 1 + IOR )²


You can also convert from F0 to IOR by moving the terms in the equation.




I also noticed no one really uses IOR maps.

Because IORs have no upper bounds (it can go up to infinity) which makes them really hard to represent them reliably in just 8 bits. Also converting from IOR to F0 in the pixel shader is more expensive (it even has a division!), unnecessarily. Note that an IOR = 9999 maps to F0 = 0.996004 while IOR = 99 maps to F0 =0.9604

F0 goes only from 0 to 1; which fits nicely with GPU textures.




And I know this is sort of unrelated, so I'm not going to officially ask for it, but are there any real reasons NOT to combine specular color and base color, as Unreal has done? I can't think of a single real reason, even for more stylized stuff.

Because it's not the same. It looks different and the plethora of BRDFs you can represent diminishes. However, artists find it very intuitive and makes them more productive.

#5308536 Why so many fence/sync types?

Posted by on 29 August 2016 - 01:36 PM

OpenGL has core and extensions.


Extensions start by being proposed by a vendor. GL_NV_sync is such example (it was proposed and implemented by NVIDIA, although other vendors can also implement it if they want).

When an extension becomes really useful/widespread but needs some tweaking (i.e. a different behavior in edge cases, a different interface to accomodate for certain hardware) it may be promoted to ARB (IIRC ARB stands for Architecture Review Board, but don't quote me on that). Which is what happened when GL_NV_sync became GL_ARB_sync.

In some cases, an extension is vendor agnostic but it's not popular/stable enough to be ARB, so its name will say EXT.

Once it becomes really useful it can get into core. GL_ARB_sync became into core in OpenGL 3.2; which means it's guaranteed/mandatory to be present starting from OpenGL 3.2


GL_APPLE_sync (GL ES) & GL_APPLE_fence (GL) are Apple's way of doing it. However since Apple already supports GL 3.2; that means you can use GL_ARB_sync.


Long story short you only need to aim for three:

  1. GL_ARB_sync (Windows, Linux, OSX, all vendors; unless you don't target GL core >= 3.2)
  2. GL_APPLE_sync (iOS)
  3. EGL_ANDROID_native_fence_sync (Android)

The rest are just historic relics that vendors must support so old games can still run.



#5308405 Why can't I print out my std::string vector?

Posted by on 28 August 2016 - 07:37 PM

Are you by any chance using XCode?


Btw one more thing: It's a security risk to do printf( myCStringVariable ); Perform instead printf( "%s", myCStringVariable );



#5306604 Shader array limit?

Posted by on 18 August 2016 - 01:39 PM

Using SSBO is overkill. The problem is you're requesting 90 UBOs instead of an UBO with 90 elements in it.
Change your code to:

struct CSpotLight
vec3 vColor;
vec3 vPosition;
vec3 vDirection;
float fConeAngle;
float fConeCosine;
float fLinearAtt;
bool Enabled;

uniform SpotLightBuffer
CSpotLight SpotLights[90];
} spotLightBuffer;

//Then access it via:

#5305582 How Long Will Directx Last For?

Posted by on 12 August 2016 - 10:32 PM

The same applies to DirectDraw. In the beginning they were separate, one for 2D the other 3D.

Beginning DirectX 8, DDraw began a slow deprecation (moving towards doing everything in D3D) until it was completely phased out in DirectX10.


DDraw could have bugs or not, just like D3D applications. But it's more likely to have fewer bugs. A major advantage is that on Windows XP DDraw acceleration can be turned off as CPUs should be fast enough the kind of work we supplied in the 90's to DDraw; and in Windows Vista+ it can be turned off via the DirectX Control Panel switch or reg keys (though not beginners).

#5305224 Instancebasevertex Perf Hit

Posted by on 10 August 2016 - 11:03 PM

  1. Avoid coherent bit. When you modify a buffer, use glFlushBufferRange to notify the driver which regions are dirty. Make sure you merge your flushes (i.e. don't call glFlushBufferRange 7 times for 7 contiguous chunks; just call it once at the end before submitting your drawcalls with one huge chunk)
  2. Persistent bit will cause the driver to keep the data in host-visible memory (either System RAM or slower VRAM). This is bad.
  3. Don't use the Write bit. This will prevent the driver from keeping the buffer in device only memory.
  4. The correct way is to create two buffers: 1 in device only memory; another with persistent+write bits. You write to the latter from the CPU. Then you copy the data to the former using glCopyBufferSubData (it's like a GPU->GPU memcpy). The second buffer is commonly referred to as "the staging buffer" because it acts like an intermediary stash to talk between CPU and GPU. Once you're done you can destroy the staging buffer or keep it around to reuse it for another transfer for something else.
  5. Ignore points 2, 3 & 4 for dynamic buffers (i.e. data that is re-generated every frame in CPU and sent to GPU). In this case just write to a persistently mapped buffer directly.