• Create Account

Hodgman

Member Since 14 Feb 2007
Offline Last Active Today, 01:18 AM

#5213663D3D11CreateDevice failed

Posted by on 01 March 2015 - 09:04 AM

and now its not working
Can you elaborate?

#5213570"Reverse" frustum culling?

Posted by on 28 February 2015 - 07:15 PM

Hmm... I don't really have any obstacles for "flood filling", so, if I understand flood filling right, that would just give me a giant sphere?

From a world-view-projection matrix, you can extract 6 planes that define the frustrum's volume in world space - they're your obstacles.

And that I don't understand at all. For frustum culling I first need a list of all objects, right? I can't make a list of infinite objects... I could initialize a big sphere of quadrants and then the objects in there, then I could do culling on that finite set. But that way I'd end up working with a lot of useless processing compared to only initializing the objects that are in the frustum in the first place...

You could first make a list of the quadrants within the camera's AABB, and then frustrum cull that list.
It doesn't have to literally be a list of quadrant AABBs BTW either. You can represent such a list simply with a start xyz and an end xyz (min and max xyz of the camera's 8 frustrum corners).

You don't often see this problem come up in 3d games, it's more common in 2d stuff...
One area where it pops up is in clustered shading - each cluster is a "quadrant", there's a huge number of clusters (more than you want to push through a brute-force algorithm), and you're need to quickly perform frustrum/sphere/cone-vs-quadrant tests, to get a list of clusters that touch each shape.

#5213450Best way of ensuring semantic type consistency?

Posted by on 27 February 2015 - 09:34 PM

I use this strong typedef implementation: http://www.gamedev.net/topic/663806-declare-custom-types-of-ints/#entry5198653

^That thread is about integer types, but it should work for floating point types too.

#5213347DirectCompute. How many threads can run at the same time?

Posted by on 27 February 2015 - 11:20 AM

No, 1 group of 1024 threads, or 16 groups of 64 threads.

If you dispatch more, then the GPU will get to them when it gets to them.

It's a good idea to give the GPU more than enough work to do, because it's "hyperthreaded". It doesn't run a workgroup from beginning to end without interruption before moving onto the next. It will continually do a bit of work on one group, then a little bit on another, constantly jumping around between half-finished groups.
It does this because it's a great way to hide memory latency.
e.g. with a pixel shader - say you have:
return tex2d(texture, uv * 2) * 4;
For each pixel, it has to do some math (uv*2) then fetch some memory (tex2d), then do some more math (*4).
A regular processor will reach the memory-fetch, and stall, waiting potentially hundreds of cycles until that value arrives from memory before continuing... So it takes 2 cycles to do the math, plus 400 cycles wasted on waiting for memory! Resulting in 402 cycles per pixel.

To avoid this, when a GPU runs into that kind of waiting situation, it just switches to another thread. So it will do the "uv*2" math for 400 different pixels, by which time the memory fetches will start arriving, so it can do the final "result*4" math for 400 pixels, with the end result that it spends zero time waiting for memory! Resulting in 3 cycles per pixel (assuming the GPU can handle 400 simultaneous work groups....)

For the GPU to be able to hide memory latency like this, you want your thread group size to be a multiple of 64, your dispatch size to be greater than 10, and your shaders to use the least number of temporary variables as possible (as these are the 'state' of a thread, which must be stored somewhere when switching threads).

#5213338How do I know if I'm an intermediateprogramming level?

Posted by on 27 February 2015 - 10:32 AM

It's a thread about C++ programming, so OP is referring to std::vector specifically.

#5213337DirectCompute. How many threads can run at the same time?

Posted by on 27 February 2015 - 10:31 AM

But what is the maximum number of groups that I can run simultaneously?
The number actually in-flight at once depends highly on the GPU, but also on the complexity of your shader...

On a high-end GPU, using a very complex shader, probably around 1024... or when using a very simple shader, probably 10x more -- around 10240.

All that really matters is that the thread group size is a multiple of 64 (AMD's SIMD size), and then you dispatch the appropriate number of threadgroups, i.e. (AmountOfWork+63)/64 to cover all your work items (AmountOfWork).

#5213315Open source rendering engine

Posted by on 27 February 2015 - 08:22 AM

Horde3D is another one.

#5213227Non-Member Functions Improve Encapsulation?

Posted by on 26 February 2015 - 06:33 PM

An example:

```class Widget
{
public:
Widget();

void Size( Vec2 );
Vec2 Size() const;

float Area() const;
private:
Vec2 size;
}

void  Widget::Size( Vec2 s ) { size = s; }
Vec2  Widget::Size() const   { return size; }
float Widget::Area() const   { return size.x * size.y; }```

In this case, the Area function does not have to be a member, as it does not need to know private details.
It could also be implemented as a free function:

`float Area( const Widget& w ) { Vec2 size = w.Size(); return size.x * size.y; }`

If you find a bug in Widget, you know it's caused somewhere inside the class - somewhere that has the power to violate the internal invariants enforced within the private implementation.

In the second version, there's less code in the code-base that is able to access the internal details of Widget - so the private implementation of widget is smaller / more private.

It's important to understand the logic and merit behind this argument, however whether or not you adopt it is frankly more of a stylistic choice.

Many people choose to ignore these guidelines, as they argue that the code is more readable and self-documenting when Area is a member of Widget, and that this readability is more valuable in practice than the theoretical increase in maintainability from the first approach.

#5213225GLSL; return statement ...

Posted by on 26 February 2015 - 06:06 PM

Just be aware that the cost of wasteful shaders is multiplied by the number of pixels draw, which tend to be very large numbers.

e.g. Let's say your skydome covers every pixel on a 1080p render target, that there's an unnecessary if statement in there, that this if statement adds a dozen clock cycles to the shader, that your GPU shades 1000 pixels simultaneously, and it runs at 800MHz.

12cycles * 1920*1080 pixels / 800MHz / 1000 cores = 0.03ms cost (0.2% of a 60Hz frametime budget).

But on an older GPU that shades 24 pixels at 700MHz -
12cycles * 1920*1080 pixels / 700MHz / 24 cores = 1.48ms cost (8.9% of a 60Hz frametime budget).

If you keep adding those kinds of tiny inefficiencies to your shaders, it can add up to a decent percentage of your total frame time. You can always optimize later though...

#5213074Low-level platform-agnostic video subsystem design: Resource management

Posted by on 26 February 2015 - 06:45 AM

On platforms with "GPU malloc" (where memory allocation and resource creation are not linked), I have the streaming system do a regular (CPU) malloc for the header, and a GPU-malloc (write-combine, uncached, non-coherent) for the contents, and then pass the two pointers into a "device.CreateTexture" call, which has very little work to do, seeing that all the data has already been streamed into the right place.
I treat other platforms the same way, except the "GPU malloc" is just a regular malloc, which temporarily holds the pixel data until D3D/GL copies it into an immutable resource, at which point it's free'ed.

Interesting, I'm unfamiliar with the concept. Am I correct to assume it's strictly on consoles with unified memory (Xbox One, PS4)?
Xb360, XbOne, PS4 are known to have unified memory, but that's not important here -- unified memory means that it's all physically the same, as opposed to a typical PC where some is physically on the motherboard and some is physically on the GPU PCIe card. This doesn't matter.

What matters is whether you can map "GPU memory" (whether that's physically on the GPU, or unified memory) into the CPU's address space.

Pointers in C, C++ (or even in assembly!) aren't physical addresses - they're virtual addresses, which the hardware translates into addresses of some physical resources, which doesn't even have to be RAM! A pointer might refer to RAM on the motherboard, RAM on a PCIe card, a file on a HDD, a register in an IO device, etc...
The act of making a pointer (aka virtual address) correspond to a physical resources is called mapping.

When you call malloc, you're allocating some virtual address space (a contiguous range of pointer value), allocating a range of physical RAM, and then mapping that RAM to those pointers.

In the same way, some OS's let you allocate RAM that's physically on your PCIe GPU, but map it to a range of pointers so the CPU can use it.
This just magically works, except will of course be slower than usual when you use it, because there's a bigger physical distance between the CPU and GPU-RAM than there is between the CPU and motherboard-RAM (plus the busses on this physical path are slower).

So, if you can obtain a CPU-usable pointer (aka virtual address) into GPU-usable physical RAM, then your streaming system can stream resources directly into place, with zero graphics API involvement!

Yes, this is mostly reserved for game console devs...
But maybe Mantle/GLNext/D3D12 will bring into PC land.

GL4 has already kinda added support for it though! You cant completely do your own resource management (no "GPU malloc"), but you can actually map GPU-RAM into the CPU's address space.
The GL_ARB_buffer_storage extension lets you create a texture with the appropriate size/format, but no initial data, and then map is using the "PERSISTENT" flag.
This maps the texture's GPU-side allocation into CPU address space so you can write/stream into it.
You should avoid the "COHERENT" flag, as this will reduce performance dramatically by forcing the GPU to sniff the CPU's caches when reading from the texture
If not specifying COHERENT, the CPU-mapped virtual addresses will be marked as uncached, write-combined pages. This means the CPU will automatically bypass it's own caches, as you'll only be writing to these addresses (never reading from them) and that it will queue up your writes into a "write combining buffer" and do more efficient bulk-transfers through to the GPU (even if your code is only writing one byte at a time). The only catch is you have to call a GL fence function when you've finished writing/streaming, which will ensure the write-combine buffer is flushed out completel, and any GPU-side caches are invalidated if required.
Pretty awesome!
So one of the fastest loading algorithms on GL4 may be to create the resource first, which just does the resource allocation. Then use map-persistent to get a pointer to that allocation. Then stream data into that pointer, and unmap it and fence.

AFAIK, the background context DMA transfer method is probably the fastest though.

As above, I copy D3D11. I have a boolean property in a GpuCapabilities struct that tells high level code if multithreaded resource creation is going to be fast or will incur a performance penalty.

How do you check for this? Is it a simple divide like PC vs. consoles or can it also relate to graphics card/driver version?
If I'm emulating the feature myself (e.g. On D3D9), I know to simply set that boolean to false
On D3D11, you can ask the device if multithreaded resource creation is performed natively by the driver (fast) or emulated by the MS D3D runtime (slower).

On GL you have to use some vendor specific knowledge to decide if you'll even attempt with background context resource creation (or whether you'll emulate multithreaded resource creation yourself), and make an educated guess as to whether the vendor is going to actually optimize that code path by using DMA transfers, or whether you'll actually have fallen onto a slow-path... Fun times...

As you mention, modern drivers are pretty ok with using multiple GL contexts and doing shared resource creation. On some modern GPUs, this is even recommended, as it triggers the driver's magic fast-path of transferring the data to the GPU "for free" via the GPU's underutilised and API-less asynchronous DMA controller!

Neat, and all you have to do is kick off a DMA on a concurrent context?
As with everything in GL, there's multiple ways to do it, it might silenty be emulated really slowly, it might be buggy, you'll have to test on every vendors GPU's and each vendor probably has conflicting guidelines on how best to implement it.
Besides that, it's pretty simple

#5213037Simple Alternative to Clustered Shading for Thousands of Lights

Posted by on 26 February 2015 - 01:03 AM

When taking a look at your technique, you would need to sample the memory quite often, compared to a clustered/cell deferred rendering pipeline, where the lights are stored as uniforms/parameters.

In my experience with clustered, you store all the lights in the buffer in a very similar manner, except each cluster has a linear section of the buffer.

```struct ClusterLightRange { int start, size; };
struct LightInfo { float3 position; etc };

Buffer<ClusterLightRange> clusters;
Buffer<Lights> lights;

ClusterLightRange c = clusters[ClusterIdxFromPixelPosition(pixelPosition)];
for( int i=c.start, end=c.start+c.size; i!=end; ++i )
{
Light l = lights[i];
DoLight(l);
}```

So the main additional cost is doing a bounding sphere test per light that you visit, and iterating through the light array as a linked-list rather than a simpler linear array traversal. On modern hardware, it should do pretty well. Worse than clustered, but probably not by much -- especially if the "DoLight" funciton above is expensive.

It would be interesting to do some comparisons between the two using (A) a simple Lambert lighting model, and (B) a very complex Cook-Torrence/GGX/Smith/etc fancy new lighting model.

Also, you could merge this technique with clustered shading:

A) CPU creates the BVH as described, and uploads into a GPU buffer.

B) A GPU compute shader traverses the BVH for each cluster, and generates the 'Lights' buffer in my pseudo example above.

C) Lighting is performed as in clustered shading (as in my pseudo example above).

#5212993How do I know if I'm an intermediateprogramming level?

Posted by on 25 February 2015 - 07:02 PM

I believe in you. You can do the thing.

I'm going to frame this and hang it over the office door.

#5212990who really a creative director is?

Posted by on 25 February 2015 - 06:57 PM

A "Creative Director" is someone who direct "Creative's", that is, people who have a work that falls under the (a bit arbitrary) collection of "being creative", that is, art, writing, music or design.  All those things people for a brief time thought had something to do with the right hemisphere of the brain. (A misconception which have been pretty throughly disproven by now).

And a "Director" is a top level manager.

That's pretty much all there is to the title.

To me, it seems the importance of "Directors" in general are a bit overvalued, specially in the american culture.

Most of the actual work (and ideas) come from the team as a whole, and all good managers know that...

^^^ that.

In my personal experience, most companies don't have a creative director, and instead make do just fine with 'only' an art director, lead designer and a producer doing that work.

The companies I've worked at who did have a creative director, it was an executive-level role, where that person was a major shareholder, and thus had the power to bestow the title upon themselves...
I.E. The dream job of all the newbies here who want to be an "ideas guy" and critic, with all the power but no responsibility.

#5212988Using a 2D Texture for Point-Light Shadows

Posted by on 25 February 2015 - 06:47 PM

The benefits are that you can apply all the same soft-shadow techniques, such as PCF, to your point lights rather than just to your spot lights and directional lights.L. Spiro

You can't use PCF/comparison-sanpling on cube-maps? Is that a restriction on certain platforms?

What you (Johan) have suggested can't be done, if you are implying that you render the scene once and just have it all map to several parts of a texture. The problem you have is if you have 6 squares randomly packed into a single 2D texture, when you generate the vertex coordinates, any given triangle if only rendered once, can obviously span multiple "squares". There is no way to get one part of the triangle to go to square1 and another part to go to square 2.

You can do this on modern hardware (with either manually packed 2d target, or a cube-target) by binding 6 viewport and using a geometry shader to duplicate triangles and output them to specific viewports.

#5212978GLSL; return statement ...

Posted by on 25 February 2015 - 04:40 PM

On older GPU's, I used the rule of thumb that a branch costs a dozen math instructions, so in the average case you need to be skipping more than a dozen instruction to get any benefit.

On modern GPUs, branching is almost free.

However, on every GPU, branching is done at SIMD granularity.
AMD GPUs process 64 pixels at a time, and NVidia process 32 at a time.
If one of those pixels enters an if statement, then the whole SIMD unit must enter the branch, meaning that up to 63 pixels will be wasting their time.
So: branches should be coherent in screen space - pixels that are close to each other should be likely to take the same branches.

PARTNERS