Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 14 Feb 2007
Online Last Active Today, 05:16 AM

#5180346 LOD in modern games

Posted by Hodgman on 14 September 2014 - 06:17 PM

I thought the gpu merges pixels from different triangles until a warp/wavefront can be scheduled?
I thought they would do this as well, but apparently there must be some obstacles with the implementation of that approach...


This guy's done the experiments and found quite a big performance cliff when rasterizing less than 8x8 pixels. It's also interesting that the shape of small triangles matters (wide vs tall)!

#5180340 My teams all hate eachother.

Posted by Hodgman on 14 September 2014 - 05:32 PM

Maybe the staff's perceptions of each other are all correct, and the problem is that they lack the capacity for self-reflection required to temper their own ego and the tact to politely test other's.
i.e. They're young :P

I was in a team like this. My advice would be:
give the programmers the freedom to design, as they're the ones implementing, so they're the ones who can most easily riff on design choices. Same with concept artists an visual design, and with 3D artists and props/items. Let the design be flexible enough to accommodate everyone - the game will likely be better for it, and the team will be more engaged.

Have everyone release their work into the project accompanied by a standard text file saying they grant the project unlimited license to use the work (removing the capacity for manipulative copyright shenanigans) - or better, use the MIT/BSD/WTFPL instead of your own custom made text file. Don't accept any Zips/etc where the text file is missing.

Stop offering payment if you don't have the cash up-front. Unless you've already formed a real company and have had your lawyer draft up a shareholder constitution, a schedule for issuing shares, and contributor agreements, then a promised profit-sharing scheme is NOT going to happen. If you are lucky enough to finish the game, you're going to have to do all of the above at release time, plus setting up bank accounts,shitting your pants over IRS forms, etc... And it's extremely likely that you will all get legally fucked over in the process.
It also makes anyone with any experience instantly see your project as a scam, dooming ou to inexperienced contributors.
It's much healthier to admit that this is a fan/hobby/portfolio project only, with no money involved. If you want to show your appreciation to your team-mates, send them an unexpected gift instead. If you want to dangle a carrot, say that if the ge is popular, you will form a studio to professionally create a sequel, with money that time.

#5180184 OmniDirectional Shadow Mapping

Posted by Hodgman on 13 September 2014 - 11:40 PM

6 matrices per frame is not much -- 384 bytes per light per frame frame. 100 lights at 60Hz might add up to 2MB/s, over a bus rated in GB/s :)

For comparison, an animated character might have 100 bone matrices supplied per frame.

But... Why do you need 6 matrices? The cube itself has one rotation and one translation value, which should be identical for every face, right?

Also, seeing as it's omnidirectional, you you even need to support rotation at all? A simple translation / light position value should be enough.

[edit] i.e. Subtracting the light position surface position gives you the direction from the light to the surface, which is the texture coordinate to use when sampling from the cube-map.

#5180144 Dynamic cubemap relighting (random though)

Posted by Hodgman on 13 September 2014 - 07:09 PM

Cheap idea for relighting the increasingly popular cubemap/image based lighting solution. Just store depth/normal/abedo of each cubemap face, and relight N cubemap faces a frame with the primary light/update the skybox. As long as you store the/apply the baked roughness lookups into the mipmaps, and apply such to the final output cubemaps you use for lighting, you get a dynamically re-lightable image based lighting system.

heres an example implementation, coined
Deferred Irradiance Volumes/

Turns out this idea works pretty well :D

#5180064 DX12 - Documentation / Tutorials?

Posted by Hodgman on 13 September 2014 - 07:38 AM

So in essence, a game rated to have minimum 512MB VRAM (does DX have memory requirements?) never uses more then that for any single frame/scene?
You would think that AAA-games that require tens of gigabytes of disk space would at some point use more memory in a scene then what is available on the GPU. Is this just artist trickery to keep any scene below rated gpu memory?

Spare a thought for the PS3 devs with 10's of GB's of disc space, and then just 256MB of GPU RAM laugh.png I'm always completely blown away when I see stuff like GTA5 running on that era of consoles!
Ideally on PC, if you have 512MB of VRAM, yes, you should never use more than that in a frame. Ideally you should never use more than that, ever!
If you've got 1024MB in resources, but on frame #1 you use the first 50% of it, and on frame #2 you use the second 50%, it's still going to really hurt performance -- in between those two frames, you're asking D3D to memcpy half a gig out of VRAM, and then another half a gig into VRAM. That's a lot of memcpy'ing!
Game consoles don't do this kind of stuff for you automatically (they're a lot more like D3D12 already!), so big AAA games made for consoles are going to be designed to deal with harsh memory limits themselves. e.g. on a console that has 256MB of VRAM, the game will crash as soon as you try to allocate the 257th MB of it. There's no friendly PC runtimes/drivers that are going to pretend that everything's ok and start doing fancy stuff behind the scenes for you biggrin.png
The tricky part in doing a PC version is that you've got a wide range of resource budgets. On the PS3 version, you can just say "OK, we crash after 256Mb of allocations, deal with it", and do the best you can while fitting into that budget. On PC, you need to do the same, but also make it able to utilize 512MB, or 700MB or 1GB, etc... The other hard part that on PC, it's almost impossible to know how much memory any resources actually take up, or how much VRAM is actually available to you... Most people probably just make guesses based on the knowledge they have from their console versions.

that method would have to also be used -for example- when all of that larger-than-video-memory resource is being accessed by the GPU in the same, single shader invocation? Or that shader invocation would (somehow) have to be broken up into the subsequently generated command lists? Does that mean that the DirectX pipeline is also virtualized on the CPU?

I don't know if it's possible to support that particular situation? Can you bind 10GB of resources to a single draw/dispatch command at the moment?


I don't think the application will be allowed to use that DMA-based synchronisation method (or trick? ) that you explained.

D3D12 damn well better expose the DMA command queues laugh.png nVidia are starting to expose them in GL, and a big feature of modern hardware is that they can consume many command queues at once, rather than a single one as with old hardware.


Wait. That's how tiled resources already work

Tiled resources tie in with the virtual address space stuff. Say you've got a texture that exists in an allocation from pointer 0x10000 to 0x90000 (a 512KB range) -- you can think of this allocation being made up of 8 individual 64KB pages.
Tiled resources are a fancy way of saying that the entire range of this allocation doesn't necessarily need to be 'mapped' / has to actually translate to a physical allocation.
It's possible that 0x10000 - 0x20000 is actually backed by physical memory, but 0x20000 - 0x90000 aren't actually valid pointers (much like a null pointer), and they don't correspond to any physical location.
This isn't actually new stuff -- at the OS level, allocating a range of the virtual address space (allocating yourself a new pointer value) is actually a separate operation to allocating some physical memory, and then creating a link between the two. The new part that makes this extremely useful is a new bit of shader hardware -- When a shader tries to sample a texel from this texture, it now gets an additional return value indicating whether the texture-fetch actually suceeded or not (i.e. whether the resource pointer was actually valid or not). With older hardware, fetching from an invalid resource pointer would just crash (like they do on the CPU), but now we get error flags.
This means you can create absolutely huge resources, but then on the granularity of 64KB pages, you can determine whether those pages are physically actually allocated or not. You can use this so that the application can appear simple, and just use huge textures, but then the engine/driver/whatever can intelligently allocate/deallocate parts of those textures as required.

#5180026 LOD in modern games

Posted by Hodgman on 13 September 2014 - 01:55 AM

I thought gpu's rasterized in quads (2x2 pixels) and that was the efficiency threshold? Or has that changed in recent years.
Yeah they do, but nVidia's chips also run computations on 32 items at once, and ATI chips run computations on 64 pixels at once.


8x8 pixels = 16 2x2 quads = 64 work items = full utilization of AMD hardware.

#5179881 Shader variables (uniform)

Posted by Hodgman on 12 September 2014 - 08:33 AM

If your drivers are really nice, they might recompile your shader code every time you change the value of the uniform, making it free...

But you should assume that the driver isn't going to be that magically nice, and that you're going to pay the cost of that if for every pixel that you draw.


Often, games design these kinds of shaders using #if / #endif instead, and then compile the shader code multiple times.

#5179879 DX12 - Documentation / Tutorials?

Posted by Hodgman on 12 September 2014 - 08:29 AM

Say your typical low end graphics card has 512MB - 1GB of memory. Is it realistic to say that the total data required to draw a complete frame is 2GB, would that mean that the GPU memory would have to be refreshed 2-5+ times every frame?
Do I need to start batching based on buffer sizes?

It's always been the case that you shouldn't use more memory than the GPU actually has, because it results in terrible performance. So, assuming that you've always followed this advice, you don't have to do much work in the future wink.png

The article is not very clear on this. It says that the driver will tell the operating system to copy resources into GPU memory (from system memory) as required, but only the application can free those resources once all of the queued commands using those resources have been processed by the GPU. It's not clear if the resources can also be released (from GPU memory, by the OS) during the processing of already queued commands, to make room for the next 512MB (or 1GB, or whatever size) of your 2GB data. But my guess is that this is not possible. This would imply that the application's "swap resource" request could somehow be plugged-into the driver/GPU's queue of commands, to release unused resource memory intermediately, which is probably not possible, since (also according to the article), the application has to wait for all of the queued commands in a frame to be executed, before it knows which resources are no longer needed. Also, "the game already knows that a sequence of rendering commands refers to a set of resources" - this also implies that the application (not even the OS) can only change resource residency in-between frames (sequence of rendering commands), not during a single frame.

If D3D12 is going down the same path as the other low-level API's, the resources as we know them in D3D don't really exist any more.
A resource such as a texture ceases to exist. Instead, you just get a form of malloc/free to use as you will. You can malloc/free memory whenever you want, but freeing memory too soon (while command lists referencing that memory are still in flight) will be undefined behavior (logged by the debug runtime, likely to cause corruption in the regular runtime).
Resource-views stay pretty much as-is, but instead of creating a resource-view that points to a resource, instead they just have a raw pointer inside them, which points somewhere into one of the gpu-malloc allocations that you've previously made. These resource-view objects will hopefully be POD blobs instead of COM-objects, which can easily be copied around into your descriptor tables. These 'view' structures are in native-to-the-GPU formats, and will be read directly as-is by your shader programs executing on the GPU.
This is basically what's going on already inside D3D, but it's hidden behind a thick layer of COM abstraction.
At the moment, the driver/runtime has to track which "resources" are used by a command list, and from there figure out which range of memory addresses are used by a command list.
The command-list and this list of memory-ranges is passed down to the Windows display manager, which is responsible for virtualizing the GPU and sharing it between processes. It stores this info in a system-wide queue, and eventually gets around to ensuring that your range of (virtual) memory addresses are actually resident in (physical) GPU-RAM and are correctly mapped, and then it submits the command list.
At the moment, it's up to D3D to internally keep track of how many megabytes of memory is required by a command-list (how many megabytes of resources are referenced by that command list). Currently, D3D is likely ending your internal command-list early when it detects you're using too much memory, submitting this partial command-list, and then starting a new command list for the rest of the frame.

Also, DX12 is only a driver/application-side improvement over DX11. Adding memory management capabilities to the GPU itself would also require a hardware-side redesign.

This kind of memory management is already required in order to implement the existing D3D runtime - pretending that the managed pool can be of unlimited size requires that the runtime can submit partial command buffers and page resources in and out of GPU-RAM during a frame.
There's already lots of HW features available to allow this biggrin.png
Both the CPU and the GPU use virtual-addressing, where the value of a pointer doesn't necessarily correspond to a physical address in RAM.
Generally, most pointers (i.e. virtual addresses) we use on the CPU are mapped to physical "main RAM", but pointers can also be mapped to IO devices, or other bits of RAM, such as RAM that's physically on the GPU.
The most basic system is then for us to use an event, such that when the GPU executes that event command, it writes a '1' into an area of memory that we've previously initialized with a zero. The CPU can submit the command buffer containing this event command, and then poll that memory location until it contains a '1', indicating the GPU has completed the commands preceding the event. The CPU can then map physical GPU memory into the CPU's address space, and memcpy new data into it.
This is a slow approach though - it requires the CPU to waste time doing memcpys... but worse, because memcpy'ing from the CPU into GPU-RAM is much slower than to regular RAM!
Another approach is to get the GPU to do the memcpy. Instead, you map some CPU-side physical memory into the GPU's virtual address space, and at the end of the command buffer, insert a dispatch command that launches a compute shader that just reads from the CPU-side pointer and writes to a GPU-side pointer.
This frees up the CPU, but wastes precious GPU-compute time on something as basic as a memcpy. On that note - yep, the GPU can read from CPU RAM at any time really - you could just leave your textures and vertex buffers in CPU-RAM if you liked... but performance would be much worse... Also, on Windows, you can't have too much CPU-RAM mapped into GPU-address-space at any one time or you degrade system wide performance (as it requires pinning the CPU-side pages / marking them as unable to be paged out).
Even older GPUs that don't have compute capabilities will have a mechanism to implement this technique -- it just easier to explain if we talk about a memcpy compute shader wink.png
Lastly, modern GPUs have dedicated "DMA units", which are basically just asynchronous memcpy queues. As well as your regular command buffer full of draw/dispatch commands, you can have a companion buffer that contains DMA commands. You can insert a DMA command that says to memcpy some data from CPU-RAM to GPU-RAM, but before it, insert a command that says "wait until the word at address blah changes from a 0 to a 1". We can also put an event at the end of our drawing comamnd buffer like in the first example, which lets the DMA queue know that it's time to proceed. This can be an amazing solution as it has zero impact on the CPU or GPU!
Instead of D3D just doing all this magically internally, if you want to use an excessive amount of memory in your app (more than the GPU can handle), then in D3D12 it's going to be up to you to implement this crap...

If you don't want to use an excessive amount of RAM, the only thing you'll have to do is, while you're submitting draw-calls into your command buffer, you'll also have to generate a list of resources that are going to be used by that command buffer, so you can inform Windows which bits of GPU-RAM are going to be potentially read/written by your command buffer.

#5179456 Code organization without the limitations of files

Posted by Hodgman on 10 September 2014 - 06:18 PM

Every function goes into it's own file. BAM super IDE has minority report of those billion fragment-files and diff still works.

Just because you don't want to view things as files doesn't mean you actually have to change the storage mechanism though. Quite often when exploring existing code, I'll use the "find all references to" feature of the IDE, which produces a list of places where this class/function/variable is touched. When I click on an item in that list, the IDE opens the file and scrolls to the line... Your super IDE could instead open all the file, but only displaying the range of lines that we're interested in. These file-views don't need to be full screen, so you can file a lot of them in a nice draggable UI to get all the info in front of the user at once.

#5179200 Indie game developers getting screwed by PAX?

Posted by Hodgman on 09 September 2014 - 06:37 PM

A friend here was on the waiting list for a space in the PAX "Indie MEGABOOTH", seeing they sell out immediately. Someone else dropped out, so he got one at the last minute (about 2 weeks before the event).
It's notoriously hard and/or ridiculously expensive to get booths at big events like that, which is why the indie megabooth exists. Did you try to scope space with them?

Also both of the op's only posts are related to that story and pretend to be an external eye.

blink.png? You're suggesting that the OP may have been involved in the incident discussed in that story?

I'll straight up say that the OP is involved in that incident and is just link-spamming his own blog over the internet.
Not the best way to start a discussion, OP... Does not make your company look professional.

#5179058 Blur shader

Posted by Hodgman on 09 September 2014 - 05:28 AM

Box blur. It's the same as Gaussian but instead of each sample using a unique weight value, you sum all the samples together unweighted and the multiply the result by 1/numSamples.

Gaussian isn't that expensive BTW, as long as you precomputed the weights and hard -code them into the shader.

#5179003 Critique my approach to writing RAII wrappers

Posted by Hodgman on 08 September 2014 - 09:35 PM

You'd be hard pressed to argue in most C++ workplaces that shared pointer isn't a "RAII class".
I can just imagine that workplace conversation when someone name-drops the RAII idiom and some smug bastard tries to get picky...
A: So, this is here just a RAII class, like a shared pointer,
B: Shared pointer isn't RAII! happy.png 
A: ... huh.png ... uhhh.... whatever. So anyway, it's like a shared pointer, but...


To me, if what Bregma says isn't true, then RAII doesn't mean anything new

It was new 30 years ago when they were originally writing C++, coming from C programming. The whole point was to avoid the horrible mess that is resource cleanup in plain C.
RAII isn't a new term, so it shouldn't be news to anyone who's written C++ code ever.

don't really see the point of trying to redefine what RAII is



it's "resource acquisition is initialization".

  • What is the class invariant of an object that can be in a valid state but not hold a resource?
  • What kind of aid to logic is "may or may not hold a resource"?
  • What kind of aid to reasoning is "the resource may or may not be held by this object"?
  • How should I distinguish between this version of RAII and any other object that gets initialized in its constructor and destroyed in its destructor?
  • How is late resource acquisition (or resource transfer)  different from open/close semantics?
  • Is there another name for the idiom in which the invariant of the object is "owns the resource"?

Seriously, or just begging the question? Aren't the answers obvious if you don't start by assuming that your assertion is correct?

  • When the object goes out of scope, if it has a resource, the resource will be released.
  • When you want to use RAII but it's expected that acquisition may fail or be optional.
  • ^
  • Why do you need to?
  • You can't opt out of 'closing' a RAII object; it will always release any resources it has when it goes out of scope. BTW, unique_ptr, which you say is TrueRAII™ can act in this same "contravening" way -- you can manually 'close' it / release ownership and retreive the raw pointer. You can also initialize or swap them with a nullptr at any time... so even this class that you say is TrueRAII™ fails the same tests that you used to say that shared_ptr isn't TrueRAII™... And yes, decrementing a reference counter is releasing a resource.
  • Bjarne also called it "General Resource Management".

#5178945 HLSL compiler weird performance behavior

Posted by Hodgman on 08 September 2014 - 03:22 PM

If you don't specify loop/flatten attributes, the compiler will try both options before picking one. Even if you do specify [loop], it still seems to partially unroll loops to see if maybe it's a better choice.

That's one reason you should always compile your shaders ahead of time, instead of on your loading screen!

#5178770 Best reflection method

Posted by Hodgman on 07 September 2014 - 06:16 PM

Yeah I think one of the most common solutions on this generation of games will be to use a lot of cube-map "probes" around the level, but combine those results with screen-space reflections (ray-tracing through the depth buffer) to add in self-reflection and other local details.

In the future, we'll probably see ray-tracing through voxel representations of the scene become popular (partially resident sparse voxel octrees, etc).

If your reflective object happens to be a flat plane, then old-school planar reflections are tried and tested there :)

#5178668 How difficult(or not) do you find these C++ tests?

Posted by Hodgman on 07 September 2014 - 06:13 AM

Well, I consider myself a C++ expert, but I just took their C++ test #1, spent ~1:30 on each question, thinking "this is a stupid test, badly written and checking for esoteric quirks that only experts will know"... and apparently I took it too quickly because I missed subtle details on each one and got all 5 questions wrong. Taking it again and reading more carefully I found all my mistakes dry.png

So... I don't think I'd use that test to screen candidates laugh.png


As DaBono, they'd probably be more interesting as interview discussion topics, so you can see how well they can explain why different answers would be generated.