Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 30 Aug 2006
Online Last Active Today, 02:41 PM

Posts I've Made

In Topic: Questions about Physical based shading(PBS)?

Today, 01:16 PM

Direct and indirect light are substantially different in the real world. The difference is in the colors it contains and the intensity. When light bounces off something, some of the light is absorbed and does not bounce off leaving the rest of the light to bounce. It can lose intensity, or luminance, across all colors - making it darker,


No. That's like sitting on top of a stack of boxes, following the forces your weight causes and concluding the 'direct' force you apply to the top box is somehow different than the 'indirect' force on the bottom box. This makes no sense - thinking like this you would never be able to develop a working rigid body simulator.


With graphics we get close to a similar situation. Our box of tricks becomes empty, "Reflections don't need to be correct - they just should look good" does not impress anymore, and hardware is powerful enough to do better.


By drawing a difference between direct and indirect light you get stuck in something that works now but may prevent you from thinking forward.

Probably, like most graphics devs you got used to this seperation so much that you forgot (or are not willing anymore) to put this in question.


Light is just photons, no matter where it comes from, what it have gone through or by what it has been bent.

And no matter if it has been emitted, reflected or refracted - because in practice it's always a combination of all 3 of them.



But there's a big difference between direct sunlight on an object and direct sunlight that has bounced off a red wall and is then bouncing off the object.

Notice that the mirrorball model explains all this wihout the need to seperate. It also explains all the math, which is surprisingly simple. (Simpler at least than reading a PBR book):


The photograph of the mirrorball shows a small bright spot - the sun. It's far away, so just a small area but still very bright (distance falloff)

If the sun moves out of horizon, on the backside of the normal plane, its area on the photo decreases smoothly to zero allthough distance is constant (cosine falloff)

The red wall has a large area on our photo, but its intensity is much smaller because it also 'sees' the bright sun only as a small spot and it absorbs any light that isn't red (simple wall BRDF).


Again - no matter if a photo pixel shows sun or wall, we sum up all pixel colors and take the average to get the correct amount of incomming light.

In Topic: Questions about Physical based shading(PBS)?

Yesterday, 03:54 PM

2. In real world, lighting contains direct and indirect lighting


I don't agree with this. Seperating direct and indirect lighting mostly makes sense in computer graphics, but in real world there is no difference because they are both the same thing - just light.


If i had to calculate the lighting for a lambert material in real world, i could do this:

Place a mirror ball at the sampling point and make an orthonormal photograph of the ball aligned with the surface normal.

Then calculate the average color of the ball, multiply it with surface color and that's it.


It does not matter if the reflection on the ball contains light bulbs, sun or just a bright wall - the method works without any knowledge of what 'direct' or 'indirect' could mean.

In Topic: Nintendo NX/Switch

20 October 2016 - 03:17 PM

If i had 2 children, i'd prefer this over any other gaming platform.

I doupt the older audience they show in the video wanna carry it beside their mobiles?

Maybe targeting 'family' instead 'friends' would do better. It did well for Wii.

In Topic: How does compute shader code size affect performance

19 October 2016 - 01:16 PM

The involved memory is LDS only, no cache / alignment issues.


The for loop executes 4 times at maximum, i use the (;;) to prevent unrolling here. There is no working #pragma yet for Vulkan.

(I keep such things configurable with #ifdefs - e.g. in my old shader unrolling was a win for 256 thread workgroups, but a loss for 128)


Try to seperate code and look what gets you the most slowdown.


I did a comparision with the older version of my shader where the code section was a win. Surprise: It takes the same time there.

Assumption: Old shader has less idle threads than new shader, so work distribution should be an even bigger win for the new shader. Yummy...

Reality: Because work processing has been optimized well, making idle threads busy is not worth the effort anymore? Really?


No - it can't be that simple, because this does not explain the slowdown even if i put it a condition to make sure it is executed absolutely never.


Arrrgh - please AMD, give us a tool to inspect Vulkan register usage and occupancy... this guessing drives me crazy.

So i'll continue in OpenCL, reduce register usage there and hope Vulkan will benefit from those changes. Perfect workflow...  :|

In Topic: How does compute shader code size affect performance

19 October 2016 - 09:20 AM

Ok, so this is the code i've added causing my issue.

Similar code gave me a speedup of 20% in an older version of my shader.

This code is at a point where complex math is done and VGPR usage should be low (guess 10), and i don't think it increases register usage at all if compiler is clever enough.

Simply removing it would be good and fast enough, but... i hate idle threads :)

I will try to replace a complex math code block with a lookup. I bet after that the work distribution increases performance...


EDIT: the entire shader has about 700 lines

// goal is to distribute work to idle threads, also split large workloads first to reduce work divergence
        // _counter is local ("shared") LDS uint initialized to zero
        // packed is a VGPR containig work description and other unrelevant bits
        // hasWork is boolean in VGPR
        // lID is current thread ID

        uint worklessThreadSlot = 0x10000; // large number to save a branch later
        if (!hasWork) worklessThreadSlot = atomic_add(ADRS _counter, 1); // thread with no work gets an index
        else packed |= lID<<8; // this link to orignal work spender thread will be copied along for later data transfer

        uint availableCount = _counter; // idle thread count

if (availableCount > (WG_WIDTH * 13/14)) // this is the condition i've added to test how it affects performance if it's executed only rarely
        _counter = 0;

        uint maxWork = large constant
        for (;;) // top down method - first split threads having large amounts of work, then shrink maxWork threshold to distribute smaller workloads
            uint firstWorkReceiver = _counter;

            uint work = packed;

            bool split = (work & 0xFFFF0000) > maxWork;
            if (split)
                uint newWorker = atomic_add(_counter, 1);
                if (newWorker < availableCount)
                    packed = modify to do only half of the work
                    _exchangeLDS[newWorker] = move other half of work to LDS so another thread can grab it (all this is 10 lines of simple bit manipulation code)


            // update register of work reveiving thread

            bool isNewWorker = (worklessThreadSlot >= firstWorkReceiver
                             && worklessThreadSlot < min(_counter, availableCount));
            if (isNewWorker)
                packed = _exchangeLDS[worklessThreadSlot];
                hasWork = true; // now this thread knows about it's received work and is ready to subdivide again

            maxWork >>= 1; // shrink threshold

            if ((maxWork <= small constant) || // subdivision fine enough
                (_counter >= availableCount) || // out of idle threads
                (_counter == firstWorkReceiver)) // nothing found to subivide

#if 1 // distribution completed, need to copy some other registers data through LDS. cost: 0.05 ms    cost of entire code is 0.1 ms

        bool isWorkReceiver = (worklessThreadSlot < min(_counter, availableCount));
        uint srcIndex = (packed>>8) & 0xFF;

        // repeat copy operation like this for 3 VGPRs (2 x vec4 and 1 x uint)
            _exchangeLDS[lID] = original thread VGPR data (in total i copy 2 x float4 + 1 uint data this way...)
            if (isWorkReceiver) receiving VGPR data = _exchangeLDS[srcIndex];

        // continue doing the work...

Idependent of that code block, i've often had the feeling that adding code caused slowdowns.

With OpenCL and CodeXL i saw nothing bad like register / LDS / bandwidth increase or occupancy decrease - it's just like "Add one more line of code and you tier down".

But i'm just guessing and would like to know for sure.