Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Today, 08:07 PM

#5302971 How Do You Deal With Errors On Gpus? Do You At All?

Posted by Matias Goldberg on Today, 12:53 PM

The code you posted starts with _texCount uninitialized, which won't work as intended. It doesn't start with 0s unless you fill it. If you do fill it in your actual code, you need to sync that as well.

#5302955 How Do You Deal With Errors On Gpus? Do You At All?

Posted by Matias Goldberg on Today, 11:01 AM

Later when processing for (uint i = array[i]; i<array[i+1]; i++), because of unsigned numbers the difference overflows and gives a huge number close to 0xFFFFFFFFu.
Now i don't know if long runtime or out of buffer writes cause the blue screen, but i know why it happens.

I'm 99% sure what's happening is that because you get a bad loop count, your GPU will take too long to respond and thus you run into TDR.

The hardware bug (if so) seems to ignore my barriers, so it can happen that array[i+1] is smaller than array[i] - usually array[i+1] MUST be >= array[i].
(This also happens with work group size of 64, but less often)

GPU threading is hard.

How are you issuing your barrier? Beware in GLSL memoryBarrier does only half of the job. You also need a barrier:

//memoryBarrierShared ensures our write is visible to everyone else (must be done BEFORE the barrier)
//barrier ensures every thread's execution reached here.

#5302931 Anyone Tried Uploading A 5-Byte File To Php Before?

Posted by Matias Goldberg on Today, 08:50 AM

If the host is a cheap hosting service not controlled by you, it's very possible the webhost has lame "anti-hacking" measures enabled via custom apatche/php hook and htaccess rules that reject certain input when given in specific order because it matched a malware that used similar input to compromise an unpatched server a long time ago. I've seen that kind of crap before.

#5301847 What Makes A Game Look Realistic?

Posted by Matias Goldberg on 21 July 2016 - 06:09 PM

Realistic is a very misleading word the way we know it. When people say I want a game to look "realistic", what they really want is to have it look like it was filmed by a Holywood studio with special lighting setups (which may vary for different shots), makeup, particular hair styles that always highlight nice places and cover the not so pretty places, special lenses, and particular camera angles with a particular camera movement.


You know the phrase "even the girl from the fashion magazine doesn't look like the fashion model in the magazine" ? Same applies to "realistic" graphics, because people expect game's graphics to look like people in a magazine, TV shows, and movies. Which isn't realistic at all.


Therefore to look realistic we have to mimic what they do: Once we figure out the math stuff (proper BRDF, use HDR, Depth of field, wind effects, noise for a shaky camera effect), we need to setup lighting as in movie production (i.e. 3-point lighting is very popular), have the characters perform fashion-model-like walks for females, movie-like poses (3 point landing, anyone?), account for the 12 principles of animation, place the camera in strategic places, have camera shots change at the right times, have the depth of field focus what's important and leave what's unimportant out of focus.


Of course, high resolution textures, high polycount, motion capture and global illumination helps a lot; but it will only get you so far.

And of course, all of these "rules" can be broken. If you know what you're doing and know when to break them, it still looks good. When you don't, it looks crap. Just like crappy movies or your grandma's pictures (no offense to your grandma!).

#5301048 Dx11 Renderstate

Posted by Matias Goldberg on 16 July 2016 - 11:50 PM

I learn dx11 recently, and find the dx11 render state  is strange, why the render state is designed to an object?

Because of performance. Setting one at a time takes a lot more cycles as opposed to setting all of them at once.
Furthermore many parameters are actually inter-linked with each other, even if they appear completely unrelated:

  • For example modern GPUs require shaders to know the vertex layout (Input Assembly in D3D11 terms).
  • Depth write settings interact with alpha testing (to know whether to disable early Z optimization).
  • Cull mode interacts with shaders if the shader uses SV_IsFrontFace.
  • On certain mobile GPUs, blending modes are patched into the shaders.
  • Cull mode interacts with stencil (i.e. two sided stencil)

Because of these dependencies at the implementation level with seemingly unrelated features, changing the setting one by one would have to trigger a lot of flushing, whereas changing everything in one go allows for resolving all the dependencies because all of the data is supplied together (and validated beforehand by making the object immutable!).

TL;DR: Performance.

    if(material.context.depthenable == true);   context->setRS(depthenable, true);
   else context->setRS(depthenable, false);
 I don't know how to easy to handle the dx11 render state object.  Who can show me how you use these ridiculous things?

You're approaching it the wrong way, the D3D9 way. In order to approach the D3D11 way, you need to create your object beforehand:

struct Material
    bool mDepthEnabled;
    bool mDepthWriteEnabled;
    // ... many other settings

    ID3D11RasterizerState *mRasterizerState;
    ID3D11BlendState *mBlendState;
    ID3D11DepthStencilState *mDepthStencilState;

    //Always call after you're done modifying the material.
    void flush()
        SAFE_RELEASE( mRasterizerState );
        SAFE_RELEASE( mBlendState );
        SAFE_RELEASE( mDepthStencilState );
        D3D11_RASTERIZER_DESC rasterizerDesc;
        D3D11_RENDER_TARGET_BLEND_DESC blendDesc;
        D3D11_DEPTH_STENCIL_DESC depthStencilDesc;
        depthStencilDesc.DepthEnable = mDepthEnabled;
        depthStencilDesc.DepthWriteMask = mDepthWriteEnabled ? D3D11_DEPTH_WRITE_MASK_ZERO : D3D11_DEPTH_WRITE_MASK_ALL;

        //...Fill ALL the other settings...

        device->CreateRasterizerState( rasterizerDesc, &mRasterizerState );
        device->CreateBlendState( blendDesc, &mBlendState );
        device->CreateDepthStencilState( depthStencilDesc, &mDepthStencilState );

void setMaterial( material )
     //ASSUME material already has been flushed.
     assert( material->mRasterizerState && "You forgot to call flush!" );
     device->RSSetState( material->mRasterizerState ); //You could check if this is redundant. If it is, avoid calling again
     device->OMSetBlendState( material->mBlendState );//You could check if this is redundant. If it is, avoid calling again
     device->OMSetDepthStencilState( material->mDepthStencilState );//You could check if this is redundant. If it is, avoid calling again

The idea is simple: Create the material. Once it's set, initialize the D3D11 structures. This is all done at loading time, not every frame. Once that's done, don't modify the material frequently (at least the parts that need flushing). If you need to change parameters very often, it's better to use multiple materials instead.

#5301046 Maximum Size Of Vertex Buffer

Posted by Matias Goldberg on 16 July 2016 - 11:22 PM

You're passing vertex_data which is only 120 bytes of size and telling D3D11 to read 900.000 bytes of it. Naturally, it will crash.

#5300046 Multiple animations in one .dae file(blender doesn't support this)

Posted by Matias Goldberg on 10 July 2016 - 04:56 PM

Have you tried the OpenGEX exporter? The exporter may or may not support what you need and OGEX is infinitely superior to Collada.

#5299608 Frame buffer speed, when does it matter?

Posted by Matias Goldberg on 07 July 2016 - 09:29 AM

I presumed that interface width of the memory enables us to transfer more data in a shorter time

It enables to transfer more data in the same time, not in shorter time. It's a very important distinction.
Think of the problem as a truck travelling 500km and it takes them 5 hours to complete. The truck can only hold 1tn of cargo. If you use two trucks, you can send twice the amount of cargo. But it still will take 5 hours to complete.

Why am I asking this is (higher level view) because I'm interested in why HBM is beneficial and when does it stop being such.

It depends on something we call "bottleneck". A game that performs a lot of reads and writes may be bandwidth limited, thus memory that has higher bandwidth will run faster.
But if another game executes a lot of math (which uses the ALU units Hodgman describes) and that's most of what it does, then higher bandwidth won't do jack squad because that's not the bottleneck.
Going back to the truck example:

You have to transfer 2tn of cargo. You have one truck. This is your bottleneck. You need 5hs to travel 500km and send 1tn, then another 5hs to get back and load the rest. Then 5hs more to travel 500km again. In total all the travelling took 15hs by using one truck.
If you use two trucks, you'll be done in 5hs. Memory bandwidth and bus bandwidth behave more or less the same. Because you can send more data in the same amount of time, but you needed a lot of data to send; doubling the amount of data you can transfer allows you to finish sooner only if it's the bottleneck. But you can never go less than 5hs in one trip. (Why? you ask? because GPUs can't send data faster than the speed of light)
Now let's add the "ALU" to the example: Let's suppose all you have to send in the truck a machine that weights only 70kg (that's 0.07tn). However disassembling the machine for transportation and load it into the truck takes you 8 hours. The truck then begins its journey and takes 5hs. Total time = 13hs.
You could use two trucks... but it will still take you 13hs because having an extra truck doesn't help you at all in disassembling the machine. What you need is an extra hand, not another truck. The bottleneck here is in disassembling the machine, not in transportation.
In this example people = ALU; trucks = bandwidth.
More people = you can disassemble and load the machine into the truck faster.
More trucks = you can send more cargo per trip.
More ALU = you can do more math operation in the same amount of time.
More bandwidth = you can do more loads and store from/to memory in the same amount of time.
So, to answer your question: does an increase of bandwidth make a game run faster? It depends.

#5298933 Porting OpenGL to Direct3D 11 : How to handle Input Layouts?

Posted by Matias Goldberg on 03 July 2016 - 04:52 PM

I suggest you follow a Vao + PSO (PipelineStateObject) approach (PSOs are a D3D12, Vulkan and Metal concept).


Eventually you find yourself across all APIs that you need input layout, shader bytecode data, rasterizer state, depth states, etc. A PSO is a single condensed block with all this information combined. The only catch is that PSOs don't normally require vertex & index buffers, whereas Vaos do.


Therefore Vao + PSO approach: In both your GL and D3D11 pipelines create an emulated PSO (should contains input layout, blend state, rasterizer state, msaa count, shaders to use, depth state, etc) and Vao together and assign them to your renderables.

For D3D11 assign a dummy Vao that only contains vertex & index buffers and a valid PSO, while for GL assign a valid Vao and a valid PSO. Then make your abstracted code set the PSO and then the Vao while iterating through them to render.

In GL, both setVao() and setPso() functions will perform relevant stuff, in D3D11 the setVao() will only set the vertex & index buffers, and setPso() will do all the work.


So in D3D11:

  • Vao: Contains Index & Vertex Buffers
  • PSO: Contains everything else

In GL:

  • Vao: Contains Index & Vertex Buffers + Vertex Layout definition
  • PSO: Contains everything else


This is very easy to write, simplifies everything (you have all the information you need!), and just works™. That's what we do in Ogre 2.1.

Plus, you make your engine friendly with D3D12, Vulkan & Metal.

#5298807 Tangent Space computation for dummies?...

Posted by Matias Goldberg on 02 July 2016 - 08:43 AM

If you're looking to find a working implementation, you can have a look at mine's. It's very basic, nothing fancy. It's based on Langyel's method.
Several more modern, superior one's have appeared since then.
Should be enough to get you started.



I do not get how a vertex can have a normal. Triangles have a normal. At least in Blender. Then Blender also assigns a mean normal to subdivision patches ..

See Polycount vs Vertex count

#5298649 Water and Fresnel

Posted by Matias Goldberg on 30 June 2016 - 11:04 AM

Everything has fresnel.



Given enough grazing angle, every surface will look like a mirror. Problem is some surfaces are really non-smooth or the grazing angle must be so steep we can barely notice a discernible reflection because it becomes very thin.

#5298418 [MSVC] Why does SDL initialize member variables?

Posted by Matias Goldberg on 28 June 2016 - 11:39 AM

For hunting bugs in production, yeah that sucks. You often want a non-null invalid address.

But for deployment you want to avoid crashes and potential memory corruption if the random address happens to be valid.

Note that while 0x00000000 is always considered a bad address in the x86 ABIs, 0xcdcdcdcd could be a valid address if e.g. running with Large Address Aware.

#5297994 [D3D12] Multiple command queues

Posted by Matias Goldberg on 25 June 2016 - 09:12 AM


This, but remember that copy-queues should have lower bandwidth compared to graphics queue (at least on actual hardware). They are great for concurrency and background works, but for the shortest job to be down it is better to use the graphics queue. I am not sure how they compare against compute queues, but I cannot imagine a scenario where is better to use compute queues instead of graphics queues for immediate copy operations only.

Do you have a reference for that? Maybe for CPU-side to CPU-side, or GPU-side to GPU-side transfers that's true... but I wouldn't think so for transfers between CPU-side and a dedicated GPU (across PCI-e) it would be.
The whole point of the copy queue is that it's designed to fully saturate the PCI-e bus while consuming zero shading/grahpics/compute resources (it's just a DMA controller being fed an "async memcpy" job). Intel say that their DMA controller has fairly low throughput, but, their "GPU-side RAM" is actually also "CPU-side RAM" so in some cases you'd just be able to use a regular background thread and have it perform the memcpy :lol:


For references:

  • DX12PerfTweet 25: Copy queue consumes no shader resources but has less bandwidth than graphics and compute queues.
  • DX12PerfTweet 34: Use the copy queue for background tasks. Spinning for copy to finish is likely inefficient.
  • DX12PerfTweet 56: Use the COPY queue to move memory over PCI-Express: this is more efficient than using COMPUTE or DIRECT queue.
  • GPUOpen blog - Performance Tweets Series: Streaming & Memory Management: (...) The copy queue exposes the copy engine, which is a dedicated DMA engine designed around efficient transfers across the PCIe bus. (...) Before you run off and move all copies to the copy queue, keep in mind the copy queue is not designed for all copies. In fact, the copy engine is only optimized for transferring data over PCIe. It’s the only way to saturate PCIe bandwidth (...).

- if you're copying CPU->CPU, don't use the GPU, call memcpy :lol:
- if you're copying CPU->GPU or GPU->CPU, use the copy queue, except maybe if you're optimizing for Intel or a mobile platform.
- If you're copying GPU->GPU, probably use a compute queue, except maybe for SLI/crossfire (multi-adaptor) cases.

That is pretty much it. Integrated GPUs will perform better if you write directly to the GPU memory from the CPU. It's a mystery to me whether this applies to AMD APUs as well.

#5297646 glsl represent 1 big texture as 4 smaller ones (tearing)

Posted by Matias Goldberg on 22 June 2016 - 05:14 PM

You're gonna have trouble with bilinear (gets worse with trilinear) filtering at the edges because the GPU should be interpolating between the two textures, but obviously this won't happen, so you need to do it yourself.


Potentially you may have to sample all four textures and interpolate it yourself:

// Assuming layout of textures:
// |0|1|
// |2|3|
result = mix(
mix( c0, c1, fract( uv.x * 1024.0 - 0.5/1024.0 ),
mix( c2, c3, fract( uv.x * 1024.0 - 0.5/1024.0 ),
fract( uv.y * 1024.0 - 0.5/1024.0 ) );

If you're at the left/right edge, you only need c0 & c1 or c2 & c3; if you're at the top/bottom edge you only need c0 & c2 or c1 & c3. But if you're close to the cross intersection, you're going to need to sample and mix all 4 textures.


Also the mipmaps need to be generated offline based on the original 1024x1024 rather than generating them on the GPU since it will generate them based on the 512x512 blocks individually.


I can't think quickly of a way to fix the trilinear filtering problem though.

#5297226 How to get patch id in domain shader.

Posted by Matias Goldberg on 19 June 2016 - 11:43 AM


Also, drawing each path in its own DrawCall sounds incredibly inefficient. You need to provide at least 256 vertices per draw call to fully utilize the vertex shader.

I thought it was 64 vertices to fully utilize the vertex shader and 256 to not become command processor limited.
edit - for amd.


AMD's wavefront size is of 64, that's true, but there are some inefficiencies and overhead details, such as needing 3 vertices to make a triangle (e.g. 64 triangles x 3 = 192 vertices assuming no tri shares any vertex). Real world testing shows on average you get near optimum throughput at >= 256 vertices per draw.
Edit. See http://www.g-truc.net/post-0666.html

@Matias is it still true if I have a pass-through vertex shader?