Jump to content

  • Log In with Google      Sign In   
  • Create Account

Awesome job so far everyone! Please give us your feedback on how our article efforts are going. We still need more finished articles for our May contest theme: Remake the Classics

pcmaster

Member Since 13 Nov 2007
Offline Last Active Sep 19 2012 06:07 AM
-----

#4981650 Why are most games not using hardware tessellation?

Posted by pcmaster on 19 September 2012 - 06:05 AM

It isn't completely straightforward to implement if you have to account for "non-standard" meshes - e.g. non-quads, too many adjacent faces/edges, etc. (then you need a lot of pre-computation). Otherwise from that, I don't understand its absence either but Ashaman has a point. Unfortunately :-(


#4971379 f32tof16 confusion

Posted by pcmaster on 20 August 2012 - 02:21 AM

Or you can use f32to16 to pack two halfs into an uint. Like this:
float2 toBeQuantised(333.333, 666.666);
uint half1 = f32to16(toBeQuantised.x);
uint half2 = f32to16(toBeQuantised.y);
uint twoHalfs = half1 | (half2 << 16);

But this doesn't make that much sense or use, in addition to what Kauna said :-)


#4946972 Structured buffer float compression

Posted by pcmaster on 07 June 2012 - 01:13 AM

Hyunkel, yes, I'm familiar with DX11 compute shaders. You don't necessarily need to use StructuredBuffer UAV. You can use several UAVs as outputs of your compute shaders. So instead of a stream (array) of packed interleaved struct data, you might have streams (arrays) of individual struct members. Instead of 1 RWStructuredBuffer, you'd have 4 RWBuffers as targets of your compute shader. The main disadvantage I see is that you use 4 target slots instead of 1 (there should always be at least 8 supported, if I recall well). I believe you can have texture/buffer UAVs as well in cs_5_0 (unlike cs_4_1) but I've actually used RWStructuredBuffer just like you.


#4946733 Structured buffer float compression

Posted by pcmaster on 06 June 2012 - 06:16 AM

Why not use DXGI_FORMAT_R11G11B10_FLOAT? There's plenty of nice formats. Or a single R32_UINT and pack your normal manually, no big deal. No need to ever use R32G32B32_FLOAT format for normals transfer! If you can have normals in screen-space, definitely go pack them according to one of the methods linked. I use #4 to my greatest pleasure in screen space (1 float3 to float2 (stored as R16G16_FLOAT)).

Regarding HLSL, you just write your float, float2, float3, whatever happily and depending on the bound target view (RTV, DSV, UAV?) format, the conversion happens automatically. There is no HLSL construct for "half", there is no need.

There is no sense of packing data in between the shader stages (such as from vertex shader to hull shader or such). You can happily send i.e. R8_UNORM buffers to input assembler (or bind them as SRV) and your shaders see whatever type (such as float) automatically. The same at output.

I'd split your struct into separate streams of positions, normals, temperatures etc, with formats R32G32B32_FLOAT, R11G11B10_FLOAT, R8_UNORM, etc., for example.


#4940427 [SOLVED] Disabling interpolation of vertex attributes causes error

Posted by pcmaster on 15 May 2012 - 09:06 AM

You posted the reason yourself, the routine tells you :-)

Do
"float  lightRange  : RANGE"
and
"nointerpolation float  lightRange  : RANGE"
look the same to you? They don't. For the linker they don't either :-)

I recommend using the very same struct on both GS output and PS input, you'll save yourself trouble. Otherwise you'll have to add the "nointerpolation" keyword to GS_OUT matching members, too.


#4936754 Speed up shader compilation (HLSL)

Posted by pcmaster on 02 May 2012 - 07:53 AM

You won't speed anything up by removing comments, dead code, useless vars and such, since the lexical/syntactic analysis isn't slow. What's slow is register allocation and I'm afraid there's usually not much you can do. We're getting into 15-30 minute compile times with our complex DX11 shaders (and we have hundreds) and what sometimes helps (with compile time):
- manually unroll loops (works better (in terms of compilation time) than using [unroll], [fastopt] or whatever compiler hints)
- especially true for nested loops!
- the deeper the called function, the worse
- look for redundant texture sampling which could be pulled up from loops or functions - you'll get cache hit, however it will compile longer

What doesn't help (neither compilation speed nor performance):
- trying to manually optimise ALU operations

I guess most of this will be true for DX9, too.


#4928755 FBO and RBO and how they are used

Posted by pcmaster on 06 April 2012 - 06:25 AM

Frame Buffer Object really represents just a view of a texture (or a part of a texture). A texture is an array of texels. Not only in DirectX it is called "Render Target" and it really is the same concept. All rendering always DOES go into "textures". Forget about the concept of a "screen", think the way the card thinks: "Give me some rectangular target array and I'll happily apply a pixel shader to each of its fragments". Your application can then present this texture to the user ("send it to screen", which usually DOES involve copying it via "CPU" into a widget or window canvas or whatever, automatically (swap-chain) or manually), or do some more processing, or store it to disk, or whatever.

The OpenGL terminology actually is way more complicated than what I've just presented, study it thoroughly here:

http://www.opengl.or...mebuffer_Object
http://www.songho.ca...ngl/gl_fbo.html

A short answer to the difference between GL FBO and RBO:

There is one active FBO that is the target of all rendering output and it might "contain" several target textures - a colour, another colour texture, maybe yet another texture to store anything auxiliary, a depth (all these are called FB attachments)... You can attach basically "any" number of any textures or RBOs to a FBO at once.

A RBO is a single texture and is one of attachments to a FBO. A RBO content can be modified exclusively by rendering to it while attached to a FBO (possibly with other RBOs or textures or not). RBO content can then be copied to another texture (so called "unpacking"). RBO doesn't have mip-maps. RBO cannot be pre-initialised with any pixel data. I'd use a RBO as a depth buffer (Z-buffer).

An ordinary OpenGL texture can have mip-maps and any of its mip-slices can indeed serve the very same purpose as a RBO, that is serve as a render target.

Also, ordinary textures can serve as "sources" of data in your shaders (actual surface-modifying colour data, normals or anything at all). RBOs are "destination-only". And FBOs, again, encapsulate various textures and/or RBOs and as such don't posses any own data.

Complicated, huh? Posted Image


#4925315 Path Tracing BSDF

Posted by pcmaster on 26 March 2012 - 05:59 AM

The renders look very nice. What you wrote makes sense to me. You don't have any acceleration structure in place, yet? How long do you trace such scenes then? Posted Image
Regarding SSS you'll have to read some papers on that, I'm afraid. I could help just with realtime rasterised SSS (mostly for skin, which is quite fake but nice and fast Posted Image) However, it does sound reasonable that the reflected (scattered) ray should exit randomly somewhere near the entrance with a changed direction and radiance (colour), depending on surface properties (where to get them from?). That is how it obviously works.


#4882151 GLSL get vec4 component

Posted by pcmaster on 09 November 2011 - 09:48 AM

It's worth to mention that this (syntactic and hardware) feature of shading languages is called SWIZZLING - http://en.wikipedia.org/wiki/Swizzling.
Just in case since it isn't obvious what you're asking, according to my knowledge, you cannot index your built-in vector/matrix types' components by variables (nor literals) in current shading language (such as vec3 v; float x = v[2]).


#4877209 [Dx11] InterlockedAdd on floats in Pixel Shader - Workaround?

Posted by pcmaster on 26 October 2011 - 08:57 AM

Maybe you could do with some kind of manual locking and a kind of busy waiting (boo boo boo :D). Kinda manual mutex. So, you will have a texture representing mutex, one for each fragment, initialised to 0. Now a thread (fragment)  wants to operate on some memory location [x,y].

[loop]do // critical section enter (alias mutex::lock())
{
  uint orig;
  InterlockedCompareExchange(mutex[x,y], 0, 1, orig);
  if (orig == 0) // this means the exchange succeeded! you own the "mutex"
	break; // mutex[x,y] now equals 1

} while (1);
Then tamper the float4 texture at [x,y]. Read it. Modify the value. Write it back. Nobody else will touch it in the meantime. After you're done, call
InterlockedCompareExchange(mutex[x,y], 1, 0, dummy); // critical section leave (alias mutex::unlock())
Since we made sure that mutex[x,y]==1, this will exchange its value to 0. This is a signal for the other threads waiting in the loop for this location, that the mutex is "free" and one of them can enter the critical section. I claim this is actually the same serialisation that the GPU thread scheduler or whatever name would do anyway -- if many want to access the same critical location, they have to queue up.

I have not done this before, I mean not with DX11 (I did something similar with OpenCL). I have mixed experience with such "complex" shaders and DX11 (fxc.exe), so I have no idea whether this will actually work but to me it now seems legit :-) I'm NOOOOOOT sure whether this will work with Pixel Shader but in a Compute Shader (or OpenCL or CUDA), this really should work. The main problem might be in the eternal loop, which is something the optimiser doesn't seem to like at all :D


#4874988 Gamma correction in OpenGL

Posted by pcmaster on 21 October 2011 - 03:27 AM

Do it in your last screen-space shader or add one for that purpose. Just before presenting the image to the user.


#4861520 geometry shader discard count stream out

Posted by pcmaster on 14 September 2011 - 06:33 AM

With DX10/11 it's definitely possible to find out how much geometry was spawned by a GS. However, you still need to allocate a buffer big enough to store the highest possible number of vertices to be generated, beforehand.

One way is to use ID3D11Device::CreateQuery() with D3D11_QUERY_SO_STATISTICS_STREAM0, stream-out (i.e. issue a draw-call with stream-out), and finally ID3D11Device::GetData() and look into D3D11_QUERY_DATA_SO_STATISTICS::NumPrimitivesWritten. Other thing is to use ID3D11DeviceContext::DrawAuto(), which will automatically determine the amount of data in a buffer that was previously used for stream-out (you'll connect this buffer to input assembler stage).

A query might inflict a performance penalty, as the driver will have to finish some things and might let the CPU wait. On the other hand, drawAuto will not tell the CPU how much was rendered/generated. They are two completely different things but I thought it might have to do something with your question.


#4856198 OpenGL 1,2,3,4 general question

Posted by pcmaster on 01 September 2011 - 02:47 AM

You're really risking a huge flame-war here by calling either "cleaner", "better" or anything similar :-)

Truth is that extensions get into OpenGL quicker (in fact you have to wait for Microsoft until they make up their minds to use anything the new cards support!!!).

Khronos don't change the whole API as much as Microsoft every time, fortunately. New features (functions) become available, some are deprecated, some finally removed. The whole concept persists. Same goes for OpenCL.

Start learning directly OpenGL 4. No mather what, do NOT look at OpenGL 1.x, ever :-) That, unfortunately, disqualifies most of the famous NeHe tutorials, for example, hehe. Start with desktops, learn basics and do not touch OpenGL ES (mobile) before that much, if you plan to.


#4823592 Rectangle spreading blur (not only for DoF)

Posted by pcmaster on 15 June 2011 - 07:30 AM

Hi community,

I wonder if anyone of you read the 2009-2010 papers from Kosloff and Barsky on rectangle spreading. I'm having problems with some small details in "Depth of Field Postprocessing For Layered Scenes Using Constant-Time Rectangle Spreading" paper (http://www.cs.berkel...lur/kosloff.pdf). Concretely, Fig 3 bottom, which represents the normalisation table and then (therefore) with variable per-pixel blur radii (e.g. coming from CoC), and then in general with arbitrary PSFs (but that's another story).

I need to understand, why is the normalisation image a pixel wider (in each direction) than the original input image, how will this change if a smaller or larger kernel is used and ultimately what will happen with these extra pixels, which are in fact out of the input image, when variable blur will be used (Fig 3 has a constant PSF 3x3 "kernel").

I'm unable to find any implementation of any spreading (scattering) blur algorithm, including their DX10 implementation, which they mention (DX, GL, C++, Matlab, ... anything would be helpful).

Anyone feeling like reading the paper and helping me out by discussing it here?


PARTNERS