Jump to content

  • Log In with Google      Sign In   
  • Create Account

Banner advertising on our site currently available from just $5!


1. Learn about the promo. 2. Sign up for GDNet+. 3. Set up your advert!


Hodgman

Member Since 14 Feb 2007
Online Last Active Today, 09:07 AM

#5213450 Best way of ensuring semantic type consistency?

Posted by Hodgman on 27 February 2015 - 09:34 PM

I use this strong typedef implementation: http://www.gamedev.net/topic/663806-declare-custom-types-of-ints/#entry5198653

^That thread is about integer types, but it should work for floating point types too.




#5213347 DirectCompute. How many threads can run at the same time?

Posted by Hodgman on 27 February 2015 - 11:20 AM

No, 1 group of 1024 threads, or 16 groups of 64 threads.

If you dispatch more, then the GPU will get to them when it gets to them.

It's a good idea to give the GPU more than enough work to do, because it's "hyperthreaded". It doesn't run a workgroup from beginning to end without interruption before moving onto the next. It will continually do a bit of work on one group, then a little bit on another, constantly jumping around between half-finished groups.
It does this because it's a great way to hide memory latency.
e.g. with a pixel shader - say you have:
return tex2d(texture, uv * 2) * 4;
For each pixel, it has to do some math (uv*2) then fetch some memory (tex2d), then do some more math (*4).
A regular processor will reach the memory-fetch, and stall, waiting potentially hundreds of cycles until that value arrives from memory before continuing... So it takes 2 cycles to do the math, plus 400 cycles wasted on waiting for memory! Resulting in 402 cycles per pixel.

To avoid this, when a GPU runs into that kind of waiting situation, it just switches to another thread. So it will do the "uv*2" math for 400 different pixels, by which time the memory fetches will start arriving, so it can do the final "result*4" math for 400 pixels, with the end result that it spends zero time waiting for memory! Resulting in 3 cycles per pixel (assuming the GPU can handle 400 simultaneous work groups....)

For the GPU to be able to hide memory latency like this, you want your thread group size to be a multiple of 64, your dispatch size to be greater than 10, and your shaders to use the least number of temporary variables as possible (as these are the 'state' of a thread, which must be stored somewhere when switching threads).


#5213338 How do I know if I'm an intermediateprogramming level?

Posted by Hodgman on 27 February 2015 - 10:32 AM

It's a thread about C++ programming, so OP is referring to std::vector specifically.




#5213337 DirectCompute. How many threads can run at the same time?

Posted by Hodgman on 27 February 2015 - 10:31 AM


But what is the maximum number of groups that I can run simultaneously?
The number actually in-flight at once depends highly on the GPU, but also on the complexity of your shader...

On a high-end GPU, using a very complex shader, probably around 1024... or when using a very simple shader, probably 10x more -- around 10240.

 

All that really matters is that the thread group size is a multiple of 64 (AMD's SIMD size), and then you dispatch the appropriate number of threadgroups, i.e. (AmountOfWork+63)/64 to cover all your work items (AmountOfWork).




#5213315 Open source rendering engine

Posted by Hodgman on 27 February 2015 - 08:22 AM

Horde3D is another one.


#5213227 Non-Member Functions Improve Encapsulation?

Posted by Hodgman on 26 February 2015 - 06:33 PM

An example:

class Widget
{
public:
  Widget();
 
  void Size( Vec2 );
  Vec2 Size() const;
 
  float Area() const;
private:
  Vec2 size;
}
 
void  Widget::Size( Vec2 s ) { size = s; }
Vec2  Widget::Size() const   { return size; }
float Widget::Area() const   { return size.x * size.y; }

In this case, the Area function does not have to be a member, as it does not need to know private details.
It could also be implemented as a free function:

float Area( const Widget& w ) { Vec2 size = w.Size(); return size.x * size.y; }

If you find a bug in Widget, you know it's caused somewhere inside the class - somewhere that has the power to violate the internal invariants enforced within the private implementation.

In the second version, there's less code in the code-base that is able to access the internal details of Widget - so the private implementation of widget is smaller / more private.

 

It's important to understand the logic and merit behind this argument, however whether or not you adopt it is frankly more of a stylistic choice.

Many people choose to ignore these guidelines, as they argue that the code is more readable and self-documenting when Area is a member of Widget, and that this readability is more valuable in practice than the theoretical increase in maintainability from the first approach.




#5213225 GLSL; return statement ...

Posted by Hodgman on 26 February 2015 - 06:06 PM

Just be aware that the cost of wasteful shaders is multiplied by the number of pixels draw, which tend to be very large numbers.

e.g. Let's say your skydome covers every pixel on a 1080p render target, that there's an unnecessary if statement in there, that this if statement adds a dozen clock cycles to the shader, that your GPU shades 1000 pixels simultaneously, and it runs at 800MHz.

12cycles * 1920*1080 pixels / 800MHz / 1000 cores = 0.03ms cost (0.2% of a 60Hz frametime budget).

But on an older GPU that shades 24 pixels at 700MHz -
12cycles * 1920*1080 pixels / 700MHz / 24 cores = 1.48ms cost (8.9% of a 60Hz frametime budget).

If you keep adding those kinds of tiny inefficiencies to your shaders, it can add up to a decent percentage of your total frame time. You can always optimize later though...


#5213074 Low-level platform-agnostic video subsystem design: Resource management

Posted by Hodgman on 26 February 2015 - 06:45 AM

On platforms with "GPU malloc" (where memory allocation and resource creation are not linked), I have the streaming system do a regular (CPU) malloc for the header, and a GPU-malloc (write-combine, uncached, non-coherent) for the contents, and then pass the two pointers into a "device.CreateTexture" call, which has very little work to do, seeing that all the data has already been streamed into the right place.
I treat other platforms the same way, except the "GPU malloc" is just a regular malloc, which temporarily holds the pixel data until D3D/GL copies it into an immutable resource, at which point it's free'ed.

Interesting, I'm unfamiliar with the concept. Am I correct to assume it's strictly on consoles with unified memory (Xbox One, PS4)?
Xb360, XbOne, PS4 are known to have unified memory, but that's not important here -- unified memory means that it's all physically the same, as opposed to a typical PC where some is physically on the motherboard and some is physically on the GPU PCIe card. This doesn't matter.

What matters is whether you can map "GPU memory" (whether that's physically on the GPU, or unified memory) into the CPU's address space.

Pointers in C, C++ (or even in assembly!) aren't physical addresses - they're virtual addresses, which the hardware translates into addresses of some physical resources, which doesn't even have to be RAM! A pointer might refer to RAM on the motherboard, RAM on a PCIe card, a file on a HDD, a register in an IO device, etc...
The act of making a pointer (aka virtual address) correspond to a physical resources is called mapping.

When you call malloc, you're allocating some virtual address space (a contiguous range of pointer value), allocating a range of physical RAM, and then mapping that RAM to those pointers.

In the same way, some OS's let you allocate RAM that's physically on your PCIe GPU, but map it to a range of pointers so the CPU can use it.
This just magically works, except will of course be slower than usual when you use it, because there's a bigger physical distance between the CPU and GPU-RAM than there is between the CPU and motherboard-RAM (plus the busses on this physical path are slower).

So, if you can obtain a CPU-usable pointer (aka virtual address) into GPU-usable physical RAM, then your streaming system can stream resources directly into place, with zero graphics API involvement!

Yes, this is mostly reserved for game console devs... :(
But maybe Mantle/GLNext/D3D12 will bring into PC land.

GL4 has already kinda added support for it though! You cant completely do your own resource management (no "GPU malloc"), but you can actually map GPU-RAM into the CPU's address space.
The GL_ARB_buffer_storage extension lets you create a texture with the appropriate size/format, but no initial data, and then map is using the "PERSISTENT" flag.
This maps the texture's GPU-side allocation into CPU address space so you can write/stream into it.
You should avoid the "COHERENT" flag, as this will reduce performance dramatically by forcing the GPU to sniff the CPU's caches when reading from the texture :(
If not specifying COHERENT, the CPU-mapped virtual addresses will be marked as uncached, write-combined pages. This means the CPU will automatically bypass it's own caches, as you'll only be writing to these addresses (never reading from them) and that it will queue up your writes into a "write combining buffer" and do more efficient bulk-transfers through to the GPU (even if your code is only writing one byte at a time). The only catch is you have to call a GL fence function when you've finished writing/streaming, which will ensure the write-combine buffer is flushed out completel, and any GPU-side caches are invalidated if required.
Pretty awesome!
So one of the fastest loading algorithms on GL4 may be to create the resource first, which just does the resource allocation. Then use map-persistent to get a pointer to that allocation. Then stream data into that pointer, and unmap it and fence.

AFAIK, the background context DMA transfer method is probably the fastest though.

As above, I copy D3D11. I have a boolean property in a GpuCapabilities struct that tells high level code if multithreaded resource creation is going to be fast or will incur a performance penalty.

How do you check for this? Is it a simple divide like PC vs. consoles or can it also relate to graphics card/driver version?
If I'm emulating the feature myself (e.g. On D3D9), I know to simply set that boolean to false :D
On D3D11, you can ask the device if multithreaded resource creation is performed natively by the driver (fast) or emulated by the MS D3D runtime (slower).

On GL you have to use some vendor specific knowledge to decide if you'll even attempt with background context resource creation (or whether you'll emulate multithreaded resource creation yourself), and make an educated guess as to whether the vendor is going to actually optimize that code path by using DMA transfers, or whether you'll actually have fallen onto a slow-path... Fun times...

As you mention, modern drivers are pretty ok with using multiple GL contexts and doing shared resource creation. On some modern GPUs, this is even recommended, as it triggers the driver's magic fast-path of transferring the data to the GPU "for free" via the GPU's underutilised and API-less asynchronous DMA controller!

Neat, and all you have to do is kick off a DMA on a concurrent context?
As with everything in GL, there's multiple ways to do it, it might silenty be emulated really slowly, it might be buggy, you'll have to test on every vendors GPU's and each vendor probably has conflicting guidelines on how best to implement it.
Besides that, it's pretty simple :lol:


#5213037 Simple Alternative to Clustered Shading for Thousands of Lights

Posted by Hodgman on 26 February 2015 - 01:03 AM

When taking a look at your technique, you would need to sample the memory quite often, compared to a clustered/cell deferred rendering pipeline, where the lights are stored as uniforms/parameters.

In my experience with clustered, you store all the lights in the buffer in a very similar manner, except each cluster has a linear section of the buffer.
e.g. pseudo-shader:

struct ClusterLightRange { int start, size; };
struct LightInfo { float3 position; etc };

Buffer<ClusterLightRange> clusters;
Buffer<Lights> lights;

ClusterLightRange c = clusters[ClusterIdxFromPixelPosition(pixelPosition)];
for( int i=c.start, end=c.start+c.size; i!=end; ++i )
{
  Light l = lights[i];
  DoLight(l);
}

So the main additional cost is doing a bounding sphere test per light that you visit, and iterating through the light array as a linked-list rather than a simpler linear array traversal. On modern hardware, it should do pretty well. Worse than clustered, but probably not by much -- especially if the "DoLight" funciton above is expensive.

It would be interesting to do some comparisons between the two using (A) a simple Lambert lighting model, and (B) a very complex Cook-Torrence/GGX/Smith/etc fancy new lighting model.

 

Also, you could merge this technique with clustered shading:

A) CPU creates the BVH as described, and uploads into a GPU buffer.

B) A GPU compute shader traverses the BVH for each cluster, and generates the 'Lights' buffer in my pseudo example above.

C) Lighting is performed as in clustered shading (as in my pseudo example above).




#5212993 How do I know if I'm an intermediateprogramming level?

Posted by Hodgman on 25 February 2015 - 07:02 PM

I believe in you. You can do the thing.

I'm going to frame this and hang it over the office door.


#5212990 who really a creative director is?

Posted by Hodgman on 25 February 2015 - 06:57 PM

A "Creative Director" is someone who direct "Creative's", that is, people who have a work that falls under the (a bit arbitrary) collection of "being creative", that is, art, writing, music or design.  All those things people for a brief time thought had something to do with the right hemisphere of the brain. (A misconception which have been pretty throughly disproven by now).
 
And a "Director" is a top level manager.
 
That's pretty much all there is to the title.
 
To me, it seems the importance of "Directors" in general are a bit overvalued, specially in the american culture.
 
Most of the actual work (and ideas) come from the team as a whole, and all good managers know that...

^^^ that.

In my personal experience, most companies don't have a creative director, and instead make do just fine with 'only' an art director, lead designer and a producer doing that work.

The companies I've worked at who did have a creative director, it was an executive-level role, where that person was a major shareholder, and thus had the power to bestow the title upon themselves...
I.E. The dream job of all the newbies here who want to be an "ideas guy" and critic, with all the power but no responsibility.


#5212978 GLSL; return statement ...

Posted by Hodgman on 25 February 2015 - 04:40 PM

On older GPU's, I used the rule of thumb that a branch costs a dozen math instructions, so in the average case you need to be skipping more than a dozen instruction to get any benefit.

On modern GPUs, branching is almost free.

However, on every GPU, branching is done at SIMD granularity.
AMD GPUs process 64 pixels at a time, and NVidia process 32 at a time.
If one of those pixels enters an if statement, then the whole SIMD unit must enter the branch, meaning that up to 63 pixels will be wasting their time.
So: branches should be coherent in screen space - pixels that are close to each other should be likely to take the same branches.


#5212907 How do game engineers pack their data?

Posted by Hodgman on 25 February 2015 - 06:48 AM

On a lot of the games I've worked on, each game file was individually compressed using LZMA. An archive is then built by appending all the compressed files end-to-end into a giant mega-file, and also building a look-up-table/dictionary from filenams to offsets/sizes within the archive.
To load a file you look it up in the dictionary, then stream 'size' bytes (starting from 'offset') from the archive file into an LZMA decompressor.

Compression algorithms have a lot of settings, letting you balance time taken vs compression ratio. Assuming you're loading stuff on a loading screen, you want to balance those settings so the the decompression CPU code takes the same amount of time as the file reading IO time - this way you can queue up a lot of files and keep the IO busy at 100% while getting the best compression ratio possible.
On DVD games, this means high compression setting. On bluray, even higher. On HDD, low to none, as these are way faster - it can be faster to load uncompressed data! If targeting SSD's, compression will most likely just waste time :lol:
Many console games will keep assets compressed on disk, but will decompress and cache them on the HDD.


#5212702 Max performance and support

Posted by Hodgman on 24 February 2015 - 08:00 AM

iOS partially does in certain circumstances, such as when attributes in a vertex buffer are misaligned.

ie, not aligned to 16 bytes? ES 2? I've seen thrown around that explicit 16 byte alignment is good for some desktop hardware too, AMD cards apparently. I'm assuming if you're using another API (Mantle? ES 3?) you'd have to do proper alignment in any case.
Every card that I've been able to find specs for in the past decade has required attributes to be 4byte aligned.
AFAIK, D3D forces this on you, e.g. by not allowing you to define an attribute of data type short3 in an input-layout/vertex-descriptor, by having a huge enum of all valid formats (each of which is a type such as unsigned short integer, and a component count such as 4/RGBA).
GL on the other hand lets you declare the type (e.g. short integer) and the component count (4/RGBA) separately, which allows you to specify combinations that no hardware supports - such as a short3 having 6 byte alignment.
As mentioned by LS, in that particular case the GL driver will have to reallocate your buffer and insert the padding bytes after each element itself wasting a massive amount of CPU time... Would be much better to just fail hard, early, rather than limping on :(

As for 16/32byte alignment - these values are generally for the stride of your streams, not the size of each element.
E.g. Two interleaved float4 attribs (= 32byte stride) is better than an interleaved float4 attrib & float3 attrib (= 28 byte stride).
This rule is less universal, and varies greatly by GPU. The reason it's important on some GPU's is that they first have instructions to fetch a whole cache line, and then instructions to fetch individual attributes from that line.
If the vertex fetch cache uses 32 byte cache-lines and if you also have a 32 byte stride, then it's smooth sailing - just fetch one cache-line and then fetch the attribs from it!
But if you have a 28byte stride, then in the general case you have to fetch two cache lines and deal with reassembling attributes that are potentially straddling the boundary between those two lines, resulting in a lot more move instructions being generated at the head of the vertex program :(


#5212691 Moiré pattern brick texture tilling

Posted by Hodgman on 24 February 2015 - 07:05 AM

Also, does the texture resource / shader resource view actually contain mips?




PARTNERS