Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 14 Feb 2007
Offline Last Active Today, 04:08 PM

#5293573 Are Third Party Game Engines the Future

Posted by Hodgman on 26 May 2016 - 07:15 AM

Yeah it used to be that IDTech and Unreal were extremely popular in the AAA development space, now (to caricature the situation) CryEngine gets laughed at, Unreal gets derided for being Java-esque bloatware, and Unity still gets ignored by AAA devs.

Among the indie developers I know, CryEngine gets laughed at, Unreal has a small guild of zealous fans, and Unity is overwhelmingly popular (yet they all bitch about its flaws).


So if anything, proprietary engines in the AAA scene have kinda made a comeback in recent years!

Except behind the EA iron curtain, where Frostbite is basically their own internal "off the shelf engine" a la Unreal, which they've forced onto all of their studios, wiping out all their other proprietary engines in the process...

#5293570 NV Optimus notebook spend too much time in copy hardware queue?

Posted by Hodgman on 26 May 2016 - 07:09 AM

You could be using more GPU RAM than the GPU actually has available, which will cause D3D to constantly move textures/etc in and out of GPU RAM for you, possibly multiple times a frame. Try reducing your texture resolution drastically and see if the problem goes away.

#5293562 Do you usually prefix your classes with the letter 'C' or something e...

Posted by Hodgman on 26 May 2016 - 06:28 AM

For globals, something like g_ is almost mandatory IMHO, as they pollute their way into every scope.

For POD structs with no functions, I drop the m_ as you're always going to be accessing them as instance.member anyway.

For larger classes with decent sized functions, something like m_ for members is a great aid in readability and avoidance of common pitfalls -- note that the compiler isn't able to optimize read/writes of members in the same way that it does for local variables, so there is an important performance difference here.

e.g. Both of these are functionally equivalent, but version #2 will compile into much better code:

1) for( int i=0; i!=10; ++i ) { m_count++; printf("%d", m_count); }

2) int count = m_count; for( int i=0; i!=10; ++i ) { count++; printf("%d", count); } m_count = count;

Microsoft's "systems hungarian" prefixes such as m_lpstrName (32bit pointer to a null-terminated character array) and m_uiCount (unsigned 32bit integer) are just noisy garbage that make maintenance harder and readability worse IMHO :lol: but YMMV, especially if writing an operating system...

Note that the original "hungarian notation" is actually recommending that you write variable names like m_xPosition vs m_yPosition or m_byteSize vs m_rowSize, which is pretty sane advice.

#5293494 Do you usually prefix your classes with the letter 'C' or something e...

Posted by Hodgman on 25 May 2016 - 11:10 PM

I learned C++ on MSVC6, using Microsoft's conventions from the time, so I used to use CClass, IInterface, Function m_datatypeMember, etc...


Nowadays I feel that the C prefix makes the code less readable, not more readable. You can very easily tell just by context that TextureManager is a class and not a function, without any need to clutter the code with these prefixes.

#5293465 responsiveness of main game loop designs

Posted by Hodgman on 25 May 2016 - 07:33 PM

My monitor can only display images at 60Hz, but I can run my sim loop at 600Hz. That means I can render once for every 10 updates, and the user is still getting an uber-responsive experience.


If I poll input once per 10 updates, the user probably won't know - it will still feel good... but if I poll input for every update, it will reduce the input latency by approx 0ms to 16ms, making it feel very slightly more responsive.

#5293462 what good are cores?

Posted by Hodgman on 25 May 2016 - 07:27 PM

In any case, on the CPU side my personal feeling is that memory bandwidth isn't nearly as big a problem as latencywhen it comes to games. It's chaotic accesses and cache misses that kill us.
The GPU, on the other hand, can never have too much bandwidth. We're seeing some great new tech on that front with HBM(2) and GDDR5X.

Yeah, this^  :)
GPU's tend to have quite massive latency, but they can hide it because they're massively "hyperthreaded" -- after a thread requests some memory, it can switch to a different thread and do some ALU work there. With a long enough list of threads to switch to, latency stops being a problem.
Over on the CPU though, we've only got 1 or 2 hardware threads per-core, which doesn't make much room for latency hiding... x86/x64 also tries to hide latency by rearranging your instructions on the fly, to move memory requests sooner and instructions that depend on them later, but that only goes so far...

On the Cell SPU's, the programmer was given some great tools to solve this in software.
A typical CPU has registers, a multi-layered cache, and then RAM sitting on top. Your code makes requests to move data between RAM and registers, and the cache is this magic thing in the middle that is (mostly) invisible to the code.
On the SPU's, you had registers, a local store (similar to L1 cache), and RAM on top... but none of it was automatic. You couldn't actually access RAM directly from your code - and the local store (~L1) was not automatic, it was completely manually controlled! In order to do a read-modify-write operation, your code would have to:
* Make a request to copy some RAM into the LS. This is an async operation, so your code actually continues running while that request is fulfilled.
* Block on the request to ensure it completes - this must be done manually by you or you're in a race condition with the memory controller.
* MOV the data from LS into a register.
* Do some work.
* MOV the data from a register back into LS.
* Make a reguest to copy the value from the LS back into RAM.
* Block to ensure this request completes.

Compared to x86's traditional automatic caches, this model gives so much more power to the programmer. Sure, in the above example, memory latency will kill performance as we're making memory requests and then immediately waiting for them to be completed, but, a more typical usage pattern looked like:

* Request RAM->LS transfer for Job[N] inputs
* Request RAM->LS transfer for Job[N+1] inputs
* block on Job[N] inputs transfer
* Run Job[N]
* block on Job[N-1] outputs transfer
* Request LS->RAM transfer for Job[N] outputs
* N++

i.e. you're pre-fetching ALL the input data for a job into LS, and you're doing it one job in advance. As long as each job itself is longer than the memory latency, then you never actually wait on memory, as it's being transferred concurrently with processing.

IMHO, it would be a massive boon to us if Intel/AMD added instructions that allowed us to reserve a portion of the L1 cache for ourselves, for use as a software-controlled rather than automatic cache region.

The hard part of this model is (A) making sure that the size of your Job's inputs and outputs is small enough to fit into a fraction of the Local Store size (so multiple jobs can be resident in there concurrently)... and (B) forcing your programmers to declare the address-ranges of all input and output memory regions of a job.
That latter one is a very foreign programming model to anyone who's been raised on OOP -- in OOP, data access patterns (and flow control) are often sphagetti and impossible to untangle...

So basically, you're forced to reorganize your code into a graph of Input->Process->Output nodes -- which is one reason why people say that PS3 development was hard :)
Once you do this though, it actually benefits the other SKU's of your game too. These Input->Process->Output graphs actually run really well on other CPU architectures, and code that's structured in this way is really easy to execute in parallel on any kind of multi-core CPU too.

#5293350 what good are cores?

Posted by Hodgman on 25 May 2016 - 07:45 AM

and this seems to have introduced responsiveness issues.  seems to me there's something fundamentally wrong about doing an update when you have yet to render the results of the previous update. what's the player supposed to do? guess one update ahead of time what input they should enter? things like this may explain why you can quickly wiggle the mouse in skyrim and stop, and it takes the screen half a second to respond, both to you moving the mouse, and your stopping moving the mouse. with lag like that, no wonder combat is so sh*tty.  that may be why combat in Caveman 3.0 is so fast and furious compared to most games. its pretty much guaranteed to render, get input, and update in 66ms max.

Isn't 33ms still more responsive than 66ms? :wink:

You also need to be aware that D3D/GL like to buffer an entire frame's worth of rendering commands, and only actually send them to the GPU at the end of the frame, which means the GPU is always 1 or more frames behind the CPU's timeline.

Say a game polls input at 60Hz, renders/updates in sequence at 60Hz, the GPU renders at 60Hz, and your LCD buffers for one frame at 60Hz.
That's 33-67ms on a CRT, and 50-83ms on an LCD, depending on how late in the frame the button is pressed.

0         16.6      33.3      50        66.6           83.3
| Frame 1 | Frame 2 | Frame 3 | Frame 4   | Frame 5    | 
|Press    |Poll     | GPU Draw|Scan out   | LCD Visible|
|         | Update  |         |CRT Visible|            |
|         |  Draw   |         |           |            |

If a game polls input at 60Hz, renders/updates in parallel at 60Hz (w/ buffering), the GPU renders at 60Hz, and your LCD buffers for one frame at 60Hz.
That's 50-83ms on a CRT, and 67-100ms on an LCD, depending on how late in the frame the button is pressed.

0         16.6      33.3      50        66.6        83.3         100
| Frame 1 | Frame 2 | Frame 3 | Frame 5 | Frame 6   | Frame 7    | 
|Press    |Poll     | Draw    | GPU Draw|Scan out   | LCD Visible|
|         | Update  |         |         |CRT Visible|            |

Say a game polls input/renders/updates in sequence at 15Hz, the GPU renders at 15Hz, and your LCD buffers for one frame at 60Hz.
That's 133-217ms on a CRT, and 150-233ms on an LCD, depending on how late in the frame the button is pressed.

0         66.6      133.3     200         216.6        233.3
| Frame 1 | Frame 2 | Frame 3 | Frame 4   | Frame 5    | 
|Press    |Poll     | GPU Draw|Scan out   | LCD Visible|
|         | Update  |         |CRT Visible|            |
|         |  Draw   |         |           |            |

So destiny is still twice as responsive as caveman, despite the buffering. Framerate actually is key here - as three frames of latency at 60Hz is still more responsive than a single 15Hz delay.

As for half-second mouse input latency... yeah you do see that in some games, and it is a stupid thing to let happen! And yes this may be the graphics mafia pushing framerate too much :lol:
This generally happens because the CPU is so much faster than the GPU, that the CPU spends all it's time sitting around idle. Your GPU driver decides this is bad, so it allows the CPU to start rendering the next frame, even though the GPU is still busy. This continues until you've got 5 frames worth of D3D commands sitting around in some overflowing queue, and your game has no idea that this has occurred. As far as you know, you've told D3D/GL to draw some objects and flip some back-buffers, but the driver has lied to you and just put all these commands into a 5-frame long queue, which might add up to 200ms+ worth of extra latency for no reason.
The workaround to defeat nasty graphics drivers here, is to issue a new GPU query/event every frame (or draw to a 1px render target), and at the start of each frame, ask D3D/GL to retrieve the value of the query/event (or the contents of that render target) from one, two or three frames ago (depending on how much you want to prioritize latency over efficiency). This enforces a maximum latency on the command queue that the driver can't possibly circumvent.

#5293339 what good are cores?

Posted by Hodgman on 25 May 2016 - 06:47 AM

I believe it's actually pretty hard to usefully use more than 1-2 cores full time.

The links I posted above show AAA games successfully doing this :)

Even if you can't keen N cores busy full time, it's pretty easy to get them all 100% busy for short bursts.
It's very common to see profiler output from a typical game these days that looks like:


i.e. The GPU is the bottleneck, working flat out. There's one thread that's overburdened by single-threaded code, which occasionally goes idle to let the GPU catch up, and there's a bunch of workers that sit around idle often, but occasionally go full tilt as the main thread runs into parallel workloads.

The goal when optimizing this kind of game is to move as much stuff off the "main thread" as you can in order to take advantage of all that "free" CPU time on the other cores. Even if you don't get good efficiency out of it, you may as well use that time up and reduce the length of the critical path somewhat.

On a well structured game though, the imbalance between that "main thread" and those mostly-idle "worker threads" is not identifiable. Lots of games have profiler output that looks more like:


Or, I hate to post this because job-based/thread-pool frameworks are IMHO better than thread-per-system frameworks  :D but here's my game running with the job system disabled, so each system is tied to one thread in the pool instead of broadcasting their workloads to all threads in the pool:
Split over 3 threads for a non-hyper-threaded quad-core CPU (4th core left alone by me, so that middleware and drivers can camp out there).

The GPU is the bottleneck at 4ms a frame. The renderer takes about 2ms (~6000 draw calls), so then idles in Present for about 2ms so as to not get too far ahead of the GPU. Gameplay takes about 2ms and then idles for about 2ms so as to not get too far ahead of the renderer. Special effects and HUD (HTML :() deferred drawing commands ironically take around 4ms, so are also almost a bottleneck.

With the job system enabled, work from all three of those threads is able to be scheduled across any of them, which fills in that big "Yield" in the middle nicely, and reduces the time that the top thread wastes inside "Present" waiting on the GPU  :)

Also, if I turned on triple buffering of my frame data, then you'd see two of the middle-row's simulation loops back to back, before it Yielded due to being two frames ahead of the rendering system :) It's already dropping some frames in the above picture though - centre left shows a "TimeStep" tasks, which contains three red "Ste[p]" blocks below it. Each one of those is a simulation frame, so in this case, the renderer is only running once every three sim updates, using interpolated data. I could also increase the simulation frequency and watch all that CPU idle time dissapear :lol:

I have observed this situation quite often in a preprocessing tool using OpenMP:
Do a very simple thing like building a mip-map level in prallel: Speed up is 1.5... very disapointing
Do a complex thing like ray tracing: Speed up is 4... yep - that's the number of cores
My conclusion is that memory bandwidth limits hurt the mip-map generation.
I assume it would be faster to do mips and tracing at the same time, so memory limit is hidden behind tracing calculations.
Are there any known approaches where a job system tries to do this automatically with some info like job.m_bandWidthCost?
I've never heared of something like that.

You find the same situation when optimizing GPU code, where you've got thousands of threads to try and keep busy :)
Memory bandwidth is the bottleneck these days. Processors approximately double in speed every two years (yes, that's a gross implication of Moore's law), while RAM doubles in speed every 10 years... So in the same time span that we get a 32x CPU improvement, we only get a 2x RAM improvement... Or in other words, compared to CPU speeds, every 10 years RAM actually gets 16x slower!!
Tasks that only involve moving memory around are going to be slow no matter how many cores you throw at them -- what you need is more memory buses :)


Yep, doing a memory intensive task at the same time as an ALU intensive task is a good way to utilize the available resources -- assuming your CPU has something like hyperthreading where a core can be doing multiple things at once. GPU's are amazing at this, where one GPU compute unit (basically a core) can have, say, 10 thread-groups in flight at once. As a thread-group is forced to idle due to memory latency, it can switch over to making some progress on a different thread-group. This can reduce an actual memory latency of 1000 cycles to a perceived memory latency of much, much less. CPU's aren't quite that capable of thread-juggling quite yet though :wink:

In my experience, the perf-balancing of jobs tends to be done manually. Programmers choose magic numbers to use to say how much data is to be partitioned into each job within a single job-list, and how large jobs should be in general (e.g. 1k objects per frustum cull job). I imagine that if you built a good enough profiling tool into you engine, and had good a enough dynamic code-gen / data-driven job-graph system, you could try to automate the optimization of your jobs...

#5293212 what good are cores?

Posted by Hodgman on 24 May 2016 - 08:38 AM

Hodgman: Moreover, in D3D11, you can perform resource management tasks on any thread, and create multiple command buffers, to record draw/state commands on many threads (before a single "main" thread submits those command buffers to the GPU). D3D12/Vulkan head further in this direction.

Can you give a link so I can read more? I would love to be able to manage texture resources in a separate thread.....

The ID3D11Device interface is thread-safe -- it's got all the routines for creating/destroying resources. If you need Map or UpdateSubresource, you can use them on a secondary ID3D11DeviceContext created from ID3D11Device::CreateDeferredContext and then have the main thread execute the command buffer generated from that context. Alternatively, you can call Map on the main context, pass the pointer to a secondary thread, have it write data into the resource, then have the main thread call Unmap.

#5293204 what good are cores?

Posted by Hodgman on 24 May 2016 - 08:05 AM

Every game I've worked on for current-gen and previous-gen consoles (and PC in the past 5-10 years) has used a job system at it's core. All of these games wouldn't have been able to ship without it -- they would've been over the CPU budget and not been able to hit 30Hz / 60Hz.
This model doesn't associate particular systems with particular threads. Instead you break the processing of every system down into small tasks (jobs), and let every thread process every job. Physics can then use N threads, Rendering can use N threads, AI can use N threads -- so you make full use of a Cell CPU with a single dual-threaded PowerPC plus 6 SPU's (8 threads, two instruction sets!), a tri-core dual-threaded PowerPC (6 threads), a dual quad-core AMD CPU-pair (8 threads), or a hyper-threaded hexa-core PC (12 threads)...

Yep, this has been much more important for console devs, as consoles have shitty CPU's! But, there's also a shitload more performance in a modern PC just going to waste if you're still writing single-threaded games.
Some recent presentations:
The Last of Us: Remastered:


it seems that very little of the basic tasks in a game are parallel in nature.

That's because you haven't practiced it yet. Pretty much everything in your game is capable of using multiple cores!
If you want an example, try a functional language like Erlang for a while, and see how data dependencies can be broken down at quite a fine-grained level to expose parallelism without even trying.

ideally, each basic task would have one or more processors dedicated to it (how about one processor per entity? <g>) , update would update some drawing data for render from time to time and set a flag that it had posted new data for exchange (to be passed to render). same idea with input passing info to update, or perhaps handling the input directly. render and audio would just hum along, checking for new data to display or play.

This is one of the early models that people used when the Xbox360/PS3 were thrown at us, with their shittily-performing-yet-numerous-cores.
In my experience, it's largely died off in favor of job systems. The dedicated thread-per-system with message passing model does still see some use in specialized areas, such as high-frequency input handling (e.g. a 500Hz thread that polls input devices), audio output, filesystem interactions, or network interactions... but for gameplay / simulation / rendering it's not popular any more.

in render, only one CPU can talk to the GPU at a time

Actually calling D3D/GL functions is only one small part of the renderer -- scene traversal, sorting, etc can all be done across multiple cores. Moreover, in D3D11, you can perform resource management tasks on any thread, and create multiple command buffers, to record draw/state commands on many threads (before a single "main" thread submits those command buffers to the GPU). D3D12/Vulkan head further in this direction.

but it seems that in both render and movement you must at some point merge back to a single chokepoint thread, to marshal all your parallel results together and apply them.

With a job system, you're "going wide" and then having dependent tasks wait on those results constantly.
Note that to do this you don't use mutexes/locks/semaphores/etc very much (or god forbid: volatile) -- the idea that you should be using to schedule access to data is a directed-acyclic-graph of jobs and something akin to dataflow programming / stream processingflow based programming.

Some middleware has started supporting this model too. PhysX allows you to plug your engine's job system into it's work dispatcher, so that it will automatically spread all of it's physics calculations across all of your threads.
In the ISPC language, they support a basic job model with the launch and sync language keywords, which are likewise capable of being hooked into your engine's job system:
e.g. instead of a single-threaded loop:

void DoStuff( uniform uint i );
for( uniform uint i=0; i!=numTasks; ++i )

You can easily write a many-core parallel version as:

void DoStuff( uniform uint i );
inline task void DoStuff_Task() { DoStuff(taskIndex); }//wrapper, pass magic taskIndex as i
launch[numTasks]  DoStuff_Task();//go wide, running across up to numTasks threads
sync;//wait for the tasks to finish

Note that the above doesn't actually create any new threads. In my engine, one thread is created per core at startup and are always running, waiting for jobs to enter their queues. The above launch statement will push jobs into those queues and wake up any idle cores to get to work. That sync statement will block the ispc code until the launched jobs have all completed -- the thread that was running the ispc code will also be hijacked by the job system, and will execute job code until that point in time!


There's also libraries such as TBB or compiler-extensions such as OpenMP that add these kinds of constructs to C++, or you can write them yourself easily enough :wink:

#5293201 Mapping a loaded resource fails

Posted by Hodgman on 24 May 2016 - 07:47 AM

You pass a NULL D3DX11_IMAGE_LOAD_INFO struct when loading the image. Instead, make one of these so you can add some extra options to how the texture is created. Specifically, you want to set the Usage and CpuAccessFlags fields.


However... if you want to load this image data to the CPU, then D3DX11CreateShaderResourceViewFromFileW is a bad way to go about it. This function loads the image data to the CPU, copies it to the GPU, and throws away the CPU copy... then you go and ask D3D11 to copy the data from the GPU back to the CPU for you to read... which is possible, but requires these extra creation options I mentioned above, which will disable the driver from optimizing the GPU memory location/format of this texture...


It would be better to use a different image loading library.

#5293198 Local hash

Posted by Hodgman on 24 May 2016 - 07:13 AM


std::unordered_map<> is a hash table. If you don't need to keep the items in your map sorted by key, it's very likely that you can use unordered_map instead of map, and in many cases the resulting code will be faster (e.g., if you have lots of entries (because hash tables have better asymptotic performance), or if key comparisons are expensive).

With "lots" being "several thousand." A std::map<> with a string key uses simple string comparison, so O(1) on the average length of the string keys (almost always very short) and O(log n) searches on the number of keys.  Unless you choose a custom string hash and bucket sizes very carefully, or have a huge data set you're mapping, std::map<> with std::string keys will beat std::unordered_map<> every time, in terms of speed.  It will also beat it when trying to debug, because the data is ordered making it easier to find missing or incorrect values (I calculate developer time as the most expensive resource I manage).
If your key is an integral value, std::unordered_map<> is sometimes a better bet because of memory allocation characteristics, but metrics are always your friend there.  Fortunately, the standard map classes are almost entirely interchangeable at the code level so switching is easy.
For smaller data sets (on the order of hundreds) a std::vector<> almost always beats any other collection in terms of memory and speed.  If it's created once and treated read-only it will always beat a std::map<> for lookup speed, but std::map<> usually still has simpler code for the same functionality.


I found this a bit hard to believe, so I just tested it from with "lots" being 10/100/1000/10000/10000.
MSVC 2012 (v11) x64. Full optimizations, and all std library safety features disabled.
Test code: http://pastebin.com/Y3rDK6Un
Creates string->int associations in a map, then does string->int lookups.
At every size, unordered_map beat map, but at size 10 it was close.
At size 10, a dumb, unsorted (linear searchvector beats map/unordered_map.
At size 100, unordered_map overtakes the dumb vector.
At size 1k, the dumb vector starts getting pretty bad.

I didn't try a sorted vector.
rde::hash_map is slightly better than std::unordered_map.
A custom solution out of a game engine trounces the generic implementations... even the thread safe (multi-producer/multi-consumer) MpmcHashTable beat them at all sizes :o
n.b. CheatTable is using void*'s copied out of the words array instead of strings, and is using the pointer value itself as an integral hash value, which is cheating... It actually happens to produce the same result as the other tables here -- I guess my compiler happened to merge duplicate string literals in my words array, making a pointer comparison just happen to work as a string comparison. It's in there as a theoretical limit.

std::map             size 10, Write: 0.065574 ms, Read: 0.006733 ms. Total: 0.072307. Result = 450
std::unordered_map   size 10, Write: 0.045082 ms, Read: 0.005269 ms. Total: 0.050351. Result = 450
VectorMap            size 10, Write: 0.007026 ms, Read: 0.007319 ms. Total: 0.014344. Result = 450
rde::hash_map        size 10, Write: 0.005855 ms, Read: 0.005269 ms. Total: 0.011124. Result = 450
eight::CheatTable    size 10, Write: 0.002635 ms, Read: 0.001756 ms. Total: 0.004391. Result = 450
eight::HashTable1    size 10, Write: 0.002927 ms, Read: 0.002342 ms. Total: 0.005269. Result = 450
eight::HashTable2    size 10, Write: 0.002927 ms, Read: 0.003513 ms. Total: 0.006440. Result = 450
eight::MpmcHashTable size 10, Write: 0.004684 ms, Read: 0.003806 ms. Total: 0.008489. Result = 450

std::map             size 100, Write: 0.324942 ms, Read: 0.076405 ms. Total: 0.401348. Result = 82620
std::unordered_map   size 100, Write: 0.218970 ms, Read: 0.045960 ms. Total: 0.264931. Result = 82620
VectorMap            size 100, Write: 0.149883 ms, Read: 0.143736 ms. Total: 0.293619. Result = 82620
rde::hash_map        size 100, Write: 0.048302 ms, Read: 0.038056 ms. Total: 0.086359. Result = 82620
eight::CheatTable    size 100, Write: 0.017564 ms, Read: 0.014930 ms. Total: 0.032494. Result = 82620
eight::HashTable1    size 100, Write: 0.021370 ms, Read: 0.017564 ms. Total: 0.038935. Result = 82620
eight::HashTable2    size 100, Write: 0.030738 ms, Read: 0.030152 ms. Total: 0.060890. Result = 82620
eight::MpmcHashTable size 100, Write: 0.038056 ms, Read: 0.034251 ms. Total: 0.072307. Result = 82620

std::map             size 1000, Write: 1.337533 ms, Read: 0.902227 ms. Total: 2.239761. Result = 9816120
std::unordered_map   size 1000, Write: 0.694089 ms, Read: 0.432671 ms. Total: 1.126760. Result = 9816120
VectorMap            size 1000, Write: 1.533962 ms, Read: 1.538061 ms. Total: 3.072023. Result = 9816120
rde::hash_map        size 1000, Write: 0.392858 ms, Read: 0.375587 ms. Total: 0.768445. Result = 9816120
eight::CheatTable    size 1000, Write: 0.152225 ms, Read: 0.147249 ms. Total: 0.299474. Result = 9816120
eight::HashTable1    size 1000, Write: 0.162764 ms, Read: 0.152811 ms. Total: 0.315575. Result = 9816120
eight::HashTable2    size 1000, Write: 0.368561 ms, Read: 0.280446 ms. Total: 0.649007. Result = 9816120
eight::MpmcHashTable size 1000, Write: 0.329626 ms, Read: 0.327870 ms. Total: 0.657496. Result = 9816120

std::map             size 10000, Write: 8.002951 ms, Read: 7.517586 ms. Total: 15.520537. Result = 998151120
std::unordered_map   size 10000, Write: 4.751770 ms, Read: 4.338713 ms. Total: 9.090483. Result = 998151120
VectorMap            size 10000, Write: 14.235112 ms, Read: 14.102500 ms. Total: 28.337612. Result = 998151120
rde::hash_map        size 10000, Write: 3.685315 ms, Read: 3.499425 ms. Total: 7.184740. Result = 998151120
eight::CheatTable    size 10000, Write: 0.809429 ms, Read: 0.824066 ms. Total: 1.633494. Result = 998151120
eight::HashTable1    size 10000, Write: 1.470145 ms, Read: 1.417744 ms. Total: 2.887889. Result = 998151120
eight::HashTable2    size 10000, Write: 2.505569 ms, Read: 2.508790 ms. Total: 5.014359. Result = 998151120
eight::MpmcHashTable size 10000, Write: 2.883498 ms, Read: 2.860371 ms. Total: 5.743869. Result = 998151120

std::map             size 100000, Write: 71.047925 ms, Read: 68.560212 ms. Total: 139.608137. Result = 1197253312
std::unordered_map   size 100000, Write: 42.130100 ms, Read: 41.244851 ms. Total: 83.374951. Result = 1197253312
VectorMap            size 100000, Write: 142.025591 ms, Read: 142.070088 ms. Total: 284.095679. Result = 1197253312
rde::hash_map        size 100000, Write: 36.894136 ms, Read: 36.317437 ms. Total: 73.211573. Result = 1197253312
eight::CheatTable    size 100000, Write: 8.363608 ms, Read: 8.631758 ms. Total: 16.995366. Result = 1197253312
eight::HashTable1    size 100000, Write: 13.579372 ms, Read: 13.788097 ms. Total: 27.367469. Result = 1197253312
eight::HashTable2    size 100000, Write: 25.309208 ms, Read: 25.565356 ms. Total: 50.874563. Result = 1197253312
eight::MpmcHashTable size 100000, Write: 29.521164 ms, Read: 29.549560 ms. Total: 59.070723. Result = 1197253312

Result is in there to see if the different structures happen to agree on the result of the test, and to thwart the optimizer.

#5293188 Manual generation of mip maps?

Posted by Hodgman on 24 May 2016 - 05:30 AM

I've got no monogame experience... Which version of D3D is it using?

In D3D11, a texture resource is separate from "views" of that resource. A render-target view is basically a link to a particular mip-level in the texture resource. So, you can make an array of RTV's for your texture's mip-chain, and then render to each of them one by one.

In D3D9, a "surface" acts similarly to an RTV. If you've got a renderable texture with a mip-chain, you can use IDirect3DTexture9::GetSurfaceLevel to obtain one writable surface for each mipmap, which can be bound as a render-target.

#5292963 Phong model BRDF

Posted by Hodgman on 22 May 2016 - 07:25 PM

That makes a hell of a lot more sense...  :wub:

#5292853 Phong model BRDF

Posted by Hodgman on 22 May 2016 - 06:16 AM

I don't get it. When does theta ever become negative? When it's 0, n and l point to the same direction, so how can the surface be backfacing?


One more question: in the shader code itself, gl_FragColor will eventually be Lo (outgoing radiance), or the BRDF itself?

Oh, hahah I typed all that and didn't notice the problem...

It should actually be testing whether cos(θ) is above or below zero -- i.e. whether θ is below or above 90º!


Yeah in a traditional Phong shader, Lo is the result.