Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 14 Feb 2007
Offline Last Active Today, 10:02 AM

#5002640 Managing constant buffers without FX interface in Direct3D10

Posted by Hodgman on 20 November 2012 - 05:43 AM

The 16-byte aligned version of the matrix probably utilizes the SSE2 instruction set (which is what they mean by P4 optimised).
The matrices are converted into this SSE-friendly format before the two matrix multiplications take place -- that's the significance, not the copying into the buffer afterwards.

Many engines that I've worked with do make use of 16-byte aligned vectors, matrices, and even floats, throughout all math-heavy parts of the code, so that SSE (or equivalent) instruction sets can be used in those parts of the code-base.

The ability to set multiple buffers in a single call to *SSetConstantBuffers is just an optimisation for the times when you would call it 3 times in a row. i.e. You can call it many times to bind resources to many different slots, but if you find that you're making multiple calls in a row, then you can instead pass an array to a single call to reduce the number of API calls you have to make.

2) Yes, updating the contents of a buffer is completely separate to binding that buffer to a slot. If you bind it to a slot, then update it, then it's still bound to the slot.
Basically all your call does is store a pointer to your buffer resource into an array, e.g. conceptually--
device.VertexShaderConstantBuffers[slot] = input;//just storing a pointer
// or
for( int i=0; i != NumBuffers; ++i )
  device.VertexShaderConstantBuffers[i+StartSlot] = input[i];

To match up the slots correctly when binding buffers, you've got 2 choices:
a) After you compile your shader, reflect on the binary to find which slots your named cbuffers have ended up in.
b) When writing your shaders, manually specify which slot a cbuffer is designated with the register keyword
cbuffer MyData : register(b7) // "MyData" buffer should be bound to slot #7

#5002617 Game Engine max map sizes - why?

Posted by Hodgman on 20 November 2012 - 03:01 AM

This series of altdevblog posts also gives some good insights into float behaviour, and is a bit more easily digested than the 'what every programmer should know' article:

#5002594 object-oriented vs. data-oriented design?

Posted by Hodgman on 20 November 2012 - 12:05 AM

Why not use both? A lot of the time, they're not mutually exclusive choices.

OOD is basically a big collection of wisdom in writing robust software.
DOD is basically the statement "please stop and think about what exactly it is that you're telling the CPU, Cache and RAM to do, and then keep it simple".

The way I see it, DOD is basically there as a "check and balance" to make sure you don't end up getting too carried away with OOD abstractions, to the point where you forget that at the end of the day, all you're doing is transforming blobs of data into other blobs of data.

#5002328 Game Engine max map sizes - why?

Posted by Hodgman on 19 November 2012 - 06:54 AM

The short answer is that the games that those engines were build for didn't require larger maps. Engines are only built to the requirements of their attached games, after all.

The technical answers as to the precision of large numbers is in What Every Computer Scientist Should Know About Floating-Point Arithmetic.

For some insights in how people have made engines that support large worlds, the "Continuous World of.." presentation is a good read:

#5002227 Something that bothers me...

Posted by Hodgman on 18 November 2012 - 09:40 PM

I just don't get why mix the two things, or worse, pass them both combined as realistic effects. Realistic for what exactly? Why have eye adaptation while having lens flares? Are we the character or are we looking at him through a camera? Its contradictory.

Well, it's realistic as in "holywood realism". Lots of games go for a kind of "film" look, rather than a human perspective. "Realistic" might also mean "photorealistic" or "hyperrealistic", which are artistic genres.

BF3 for example, is obviously trying to look like it's filmed via a camera. It's a deliberate style choice. It's some kind of "war footage" schtick.

As for eye adaptation -- this isn't contradictory for a film-style presentation, because cameras also have adaptive (or manual) exposure controls. In fact, most tone-mapping systems refer to this parameter as "exposure", because they were initially developed for photography. Automatic adaptation might just be the "AI cameraman" turning the exposure knob on your virtual camera.

As for bloom, it's extremely important in human-style graphics as well as camera-style graphics. Try looking at a bright light source at night (e.g. a street-lamp), then hold out your thumb to cover it up - you'll notice a large 'glow' around the light disappears when you do so.
n.b. streetlamps on foggy nights also have a 2nd 'glow' that doesn't disappear when you cover them with your hand, which is atmospheric scattering.

As for head-bob, it should be very minimal in a human-style rendering. When running, your head does move excessively, but your brain uses your vestibular organ to "stabilize" it's vision, so that you don't notice just how much your eyes are actually moving.
If you use a head-camera to film yourself running, and the watch the footage later (without the simultaneous vestibular hints), then it's very hard to watch and very disorienting. A camcorder-style presentation should make greater use of head-bob... However a holywood-style presentation should minimize it because that's what (non-camcorder) films do.

That said... yes, it's a weird situation.
I remember being really impressed with one crappy FPS in the 90's, because they only showed their lens flare effect when you were using your scope (and thus looking at the scene through a lens) Posted Image

As for more human-style presentation, one thing I want to see is internal 4D HDR rendering instead of RGB rendering. The eye has 4 kinds of light receptors, which are roughly tuned to red, green, blue and teal, so the eye actually sees in 4D colour. However, the optic nerve cuts this information down to just 3 dimensions, before it's processed by your brain, which is why we're ok with just rendering with RGB.
However, the process by which the 4D signal is cut down to 3D for processing differs depending on how much light is around. If there's 1% of a candle's worth of light, then RGB are thrown out, and only the Teal sensor data is sent to the brain. If there's more than 3 candle's worth of light, then Teal is thrown out, and only RGB are sent to the brain. However, in the range between those 2 extremes (around the 1 candle level), a weird "low-light" vision mode kicks in where the RGB and Teal data are combined in a special way, which always gives low-light scenes a very different appearance when you're actually there compared to when you see a photograph of it.
If we actually rendered in 4D and then used a tone-mapper that simulated this 4D->3D process that our eyes perform, then low-light renderings will appear much more realistic than what we currently achieve.

#5002203 Catering for multiple vertex definitions

Posted by Hodgman on 18 November 2012 - 08:10 PM

To add a counter-viewpoint --- using a single fat vertex format for all meshes, to avoid some possible future performance problem, is definitely a premature optimisation in my eyes.
Why not just use the API properly, and then take measures to merge objects into single batches later when profiling says you have to?

All of these actions will interrupt the pipeline and - while they won't directly cause pipeline stalls - they will cause a break in the flow of commands and data from the CPU to the GPU

That's a bit of an exaggeration. All any GL function that talks to the GPU does, is write command packets into a queue (which is read by the GPU many dozen milliseconds later). Writing new command packets can't cause a break in the flow of already-written packets, nor will it somehow stall later packets.
On the GPU end, it has hardware reading/decoding this queue in parallel to actually doing work. As long as you're submitting large enough groups of work (and GPU groups/batches don't correspond to 'batches' on the CPU, which are usually regarded as individual glDraw* calls -- the GPU can merge multiple CPU draws into a single group depending on conditions), then the executing the groups will take longer than decoding them, so decoding (which includes applying/preparing state changes) is pretty much free.
i.e. when you're giving the GPU enough work per batch, the pipeline looks like:
Decode #1|Decode #2|   |Decode #3|
         | Run #1      | Run #2      | Run #3      |
and if you're not giving it enough work, it might look like:
Decode #1| Decode #2  | Decode #3  |
         |Run #1|stall|Run #2|stall|Run #3|
And yes, if you run into the second case, then to fix it you may want to increase the number of pixels/vertices processed per draw call, and one way to do that may be to merge shaders, which may in turn require the merging of vertex formats... But all that is an optimisation topic, which means it should be done under the supervision of a profiling tool.

N.B. the first pipeline diagram above actually has a 'break' between Decode#2 and Decode#3 (i.e. the flow of commands from the CPU->GPU), but isn't a bad thing ;)

As for "saving precious space", this isn't about saving RAM. Yep, RAM is cheap and ever growing. The reason you want to save space is bandwidth.
Below are the specs on a high-end and low-end model GPU from 3 different generations of nVidia cards:
Model                Bandwidth@60Hz   Memory
------------------   --------------   ------
GeForce 8400 GS      109  MiB/frame   512MiB
GeForce 8800 Ultra   1.73 GiB/frame   768MiB

GeForce     205      137  MiB/frame   512MiB
GeForce GTX 285      2.65 GiB/frame   2GiB

GeForce GT  620      246  MiB/frame   1GiB
GeForce GTX 680      3.2  GiB/frame   4GiB
As you can see, the high-end cards can pretty much read or write every byte of their memory around once per frame, but, the low-end cards can only touch a quarter of their RAM in any given frame.
Moreover, large parts of your RAM have to be read/written more than once in a frame -- render targets with blending will require multiple reads/writes per frame, texels will likely be read many times, VBOs are shared between different models and thus reused, and even within the drawing of a single mesh verts are shared between triangles (and will be redundantly reshaded upon cache miss, about half the time).

When you get to profiling, it's just as likely that some of the fixes you'll have to apply will be bandwidth-saving measures, which could be the opposite of the above -- e.g. splitting a single shader into multiple ones that take different vertex inputs, and sample different amounts of textures.

#5002047 Catering for multiple vertex definitions

Posted by Hodgman on 18 November 2012 - 08:51 AM

Well you only need struct Vertex for programatically created vertex data. Vertex data that is loaded from a file can be referred to via void*, and can use any layout.

For vertex formats, you can either have a hard-coded enum/list like in your example, and have the file specify a value from that list (usually you don't have too many unique formats, so this will be fairly maintainable), or, the file can actually encode the vertex format itself, with e.g. struct { int offset, type, size, stride, etc; } elements[numElements]; (which would allow people to use new formats without editing your code -- useful on bigger projects with more artists / tech-artists).

Yep, malicious/corrupt data will do bad things, but you can put the error checking code into the tool that generates your files.

#5002011 glVertexAttribPointer - understanding the size attribute

Posted by Hodgman on 18 November 2012 - 05:54 AM

It's the number of components that make up the attribute (e.g. x, y and z is 3 components).

e.g. if size is 3, and type is GL_FLOAT, then the corresponding C/C++ type is float[3] and the GLSL type is vec3.

#5002009 How to represent a point using spherical harmonics?

Posted by Hodgman on 18 November 2012 - 05:51 AM

Well, as I mentioned, let's take the direction vector [0 0 1], with an infinite point light source.

Ah ok. The confusion was because an infinite/directional and a point/omni light are different things.
The details for directional lights are contained in the Stupid SH Tricks paper in the "Analytic Models" section.

#5001958 Traversal & Deletion in an STL List

Posted by Hodgman on 17 November 2012 - 11:06 PM

Erase returns the a new iterator value; you should be using it instead of the dodgey "--" trick.
for (hordeIt zIt=horde.begin(); zIt !=horde.end(); /*N.B. nothing here*/)
                        if (...)
                                zIt = horde.erase(zIt);
2) Replacing STL data-structures with custom implementations that do the same thing isn't that helpful (especially when you replace them with buggy versions). Avoiding STL structures is more useful if you require another kind of structure altogether (e.g. something other than a doubly linked list, e.g. an enumerable pool of zombies, etc...).
3) You were lucky in the game of russian-roulette that is undefined-behaviour. In your code, --zIt is making use of an invalidated iterator. Depending on the STL implementation, this could do anything (including 'work as intended').

#5001949 Problems with timeGetTime()

Posted by Hodgman on 17 November 2012 - 10:49 PM

You shouldn't be calling timeGetTime twice per frame like that -- the result of that code is that you're not timing your Draw/Update code at all, you're only timing the code that comes between the bottom line and the top line.
You can change it to only call timeGetTime once per frame like this:
static DWORD lastFrameTime = 0;
DWORD frameDelta;
DWORD thisFrameTime = timeGetTime();
if( lastFrameTime == 0 )//first frame!
  frameDelta = 0;
  frameDelta = thisFrameTime - lastFrameTime;
lastFrameTime = thisFrameTime;

P.S. timeGetTime isn't very accurate, so most people use QueryPerformanceCounter/QueryPerformanceFrequency instead, or you can use someone else's ready-made high accuracy timer, such as timer_lib.

#5001918 Simulating CRT persistence?

Posted by Hodgman on 17 November 2012 - 08:24 PM

You can make an "accumulation buffer" just by creating a new render target (texture) and accumulating values into it.

e.g. to keep 10% of the previous frame around (and 1% of the frame before that, and 0.1% of the frame before that...)
Render scene to target #1.
Blend target #1 into target #2 with 90% alpha.
Display target #2 to screen.

#5001902 Intel HD 2000 Performance estimation on this scenario?

Posted by Hodgman on 17 November 2012 - 07:22 PM

Intel HD 2000 is a GPU and most of those items are only relevant to the CPU.

Can a mobile quad core CPU for Intel HD 2000 keep up with a minimal frame rate of 25 in this case?

Yes... if all those things that you've listed take less than 40ms to compute Posted Image You'll have to add timers to your code to keep track of how long each part takes per frame, and then optimize them until they do take under 40ms.

As for the quad-core part -- do you use 4 threads?

#5001470 what would be the proper oop way to do this?

Posted by Hodgman on 16 November 2012 - 01:43 AM

I'll agree. OOP doesn't make a whole lot of sense here.

Who are you agreeing with?
"Not everything has to be it's own class in OO" != "OO is nonsensical here".
If everything in OOP had to be a class, we'd be writing code like below, which we don't --
Assertion( Comparison( Adder( Integer(1), Integer(1) ).Result(), Integer(2) ).Result() ).Check();

Yeah, that'll start a holy war. I should probably bow out at this point...

Yes, in a thread asking how to use OO design in a specific situation, you're just trying to OO-bash, poorly.

Please no one take the bait.

#5001429 What side skills are essential / noteable plusses for becoming a game physics...

Posted by Hodgman on 15 November 2012 - 09:10 PM

What happens when you need to search for a specific pattern in 500 MB of build log files stored on a remote build server? For a someone familiar with the command line thats a one-line call to ssh+grep. If you are stuck in GUI-land, it's likely to present a stumbling block...

In GUI land, you open a remote desktop connection to the server and then open the log in your favourite text editor.

YMMV, but I learned bash/grep/sed/etc in University and then never touched them again. Mostly because they don't exist on Windows, and there's not much point installing some Unix-style shell like cygwin, when a Python interactive shell is actually more powerful.
If I need to automate something, I'll use Visual Studio macros, or a Python script (or PHP, or JavaScript, or even C), or use Excel, or a DSL like CMake.

On the topic of every-programmer side skills - make sure you're comfortable with version control systems. Ideally, both a centralized one like Subversion/Perforce, and a distributed one like Git/Mercurial.