HodgmanMember Since 14 Feb 2007
Online Last Active Today, 08:15 PM
- Group Moderators
- Active Posts 12,072
- Profile Views 35,836
- Submitted Links 0
- Member Title Moderator - APIs & Tools
- Age 30 years old
- Birthday December 18, 1984
Expert Community Member
- Website URL http://www.22series.com
Posted by Hodgman on 01 November 2013 - 02:34 AM
*In theory* that can give a 4x improvement over the old ones. In practice, you'll probably be bottlenecked by RAM, or non-SIMD-friendly code.
As above, moore's law doesn't predict 2x performance per two years, just 2x complexity. These CPUs, 6 years apart are 8x more complex as predicted, and could potentially offer 8x performance increases in some cases, but yes in other cases it might be even less than a 2x increase.
He careful when benchmarking a particular app, like your link, because you're not just benchmarking the CPU, but the whole system.
It could be that the CPU is capable of giving you a 10x boost, but it's spending half it's time stalled waiting for RAM, so it only ends up giving you a 5x boost, etc...
Watching the multi-core performance increases in the GPU market is more interesting IMHO - as they're working with highly parallel applications, and and therefore have a lot more freedom to simply increase core counts and SIMD-widths rather than all the fancy stuff that Intel does to boost single-core performance.
Posted by Hodgman on 31 October 2013 - 06:58 PM
You can "compress" a probe down to SH data, which makes it smaller, but blurrier.
The innovation in FC3 is solving the problem of figuring out which probes to use when shading each pixel. The probes are arbitrarily scattered around the levels, so this is a search problem.
FC3 chose to compress the probes, and copy them into the cells of a regular 3D grid covering the view. Instead of searching for the nearest probe, you can find it instantly by reading the grid cell that you're in. Instead of searching for the nearest probe, they can always find it at a constant cost -- this allows them to use any number of probes that they like; millions across the island if need be.
This lets them add diffuse IBL extremely cheaply (the probes are pre-generated and stored on disc) to every pixel.
The downsides are that they don't support dynamic geometry (due to the pre generation), they're low res/blurry due to using SH (this means you see diffuse bounced light, but no sharp/glossy reflections) and that on 360/ps3 there was not enough memory to store much probe data, so dynamic lights are ignored by the GI system.
I.e. on current consoles, their system allows the sun and sky to reflect off objects and cause "bounce lighting", but the PC also supports "bounce lighting" caused by other light sources as well.
Posted by Hodgman on 31 October 2013 - 06:36 PM
As for the above optimization posts -- memory optimization is like multi-threading: it's hard to shoe-horn in a solution late in the project, so it does make sense to do some planning and come up with some strategies at the start.
Also, if you've written everything using a general-purpose allocator and later find that this allocator is slow, all you can do is replace it with a different general purpose allocator!
Different allocation strategies have different usage semantics; if you've written code for one class of allocators, you often can't switch to a different class of allocator without rewriting that code.
Lastly, memory optimization isn't always about having the fastest allocator, or the one with the least fragmentation. Often it's about making sure that the relative locations of your allocations are cache-friendly for later usage. If your allocator is causing bad locality if reference, it's hard to spot later on a profiler -- it certainly won't show up as an issue in any memory management code, instead you'll just have every other bit of code running 50% slower than it could be, which you don't know...
So yeah, big engines do plan out a lot of different allocation strategies at the start of the project ;)
Posted by Hodgman on 31 October 2013 - 06:13 PM
SVO is very performance intensive -- the IBL GI shown in the infiltrator demo isn't as accurate, but is cheaper, which means you have more GPU time to draw other stuff.
Btw, it's SVO or sparse voxel octree, or voxel cone tracing, not SVOxel ;)
Posted by Hodgman on 31 October 2013 - 07:40 AM
Well they do all that but when you measure such difference
in the procesing abilities of one core of quad core from 2007
and 2013 you after that can meybe observe 2x speedup
That depends on the code. If you port your SSE code over to the new AVX instructions , then that's a potential/theoretical 4x speedup just there, plus if we say that the CPU is 2x faster overall, then that's an 8x speedup in total.
That's theoretical best case though, if you re-write your code to completely exploit the new CPU.
reaches the memory through the one common 'bus' and they colide on that way?
In very simple terms, the CPU plugs into the motherboard, and the RAM plugs into the motherboard. The motherboard has a link (bus) between the two components. That bus itself has a maximum speed (and these have been getting faster over time).
The problem is that CPUs have been improving at x% per year, while RAM and the bus have been improving at y% per year, and y is smaller than x.
For example -- I'm just making up numbers, but imagine that every two years CPU's are 2x as fast faster than before, but new RAM/motherboards are only 1.5x as fast as before.
RAM has still gotten faster... but the CPU has gotten more faster, faster
Say you then measure the CPU in instructions per second, and you measure RAM/bus in bytes per second...
Let's say that Computer A does 100 instructions/second, and 100 Bytes/second. In relative terms, that's 1 byte per instruction.
Lets say that two years later, Computer B does 200 instructions/second, and 150 Bytes/second. In relative terms, thats 0.75 bytes per instruction!
Then this means that the number of Bytes per Instruction is actually decreasing over time. CPUs are getting so fast, that they can crunch data faster than the RAM can deliver that data...
Posted by Hodgman on 31 October 2013 - 06:43 AM
Yeah I've had ISM's on my mind for a while too
The problem is that they're designed for RSM/etc, where you have a lot of little lights with their own little ISM, which all add up together to approximate a large-area bounce light. Because each of these "bounce lights" is made up of, say, 1000 VPLs, each with their own ISM, the amount of error present in each individual ISM isn't noticable.
I haven't tried it, but I have a feeling that if I've got a street with 100 individual streetlamps on it, and I use ISM to efficiently generate shadows for them, then the ISM artefacts will be hugely noticeable... Maybe I could use ISM's for my "backwards" shadowing idea though, where my "visibility probes" are rendered using the ISM technique? That would allow for huge numbers of dynamic probes...
In Killzone, they use SSR to supplement reflection probes -- SSR, local probes and a global sky probe are all merged.
I'm thinking of a similar thing for the "per-pixel depth hemisphere" system -- you'd have a huge number of approximate visibility probes about the place, which give very coarse mid-range blockers -- e.g. if you're inside a room with one window, don't accept light contributions from a spotlight from behind the wall that's opposite that one window. Obviously these probes would have to be quite conservative, with bleeding/tolerance around the same scale as their radius of influence.
You could then supplement this with screen-space data, in the cases where it's possible to get any results.
On SSAO, it really gives me a headache when games do stuff like this (Far Cry 3, why u do this?)... They've got this great lighting and GI system, and they go and shove a bloody outline filter in the middle of it!?
The last SSAO filter I wrote looks like this (just the contact-darkening on the players, the grass is old-school planar projection stuff, and yeah, there's no shadow-mapping/etc), which I personally don't think is as objectionable as the above, (I would call the above a contrast filter) Maybe I'm biased though...
With the screen-space specular masking, at the moment, I'm pretty hopeful I'll be able to use it. It does break down at the edges of the screen, with thin geo, and with complex depth overlaps... but I also do a lot of image based lighting, and there's not a lot of good ways to shadow an IBL probe, and the scene looks a lot better with these shadows fading in, in the areas where the technique works...
Yeah that could work -- you could kinda have a hierarchy of occlusion values, or shells of occlusion hemispheres. If you know the depth range of each "shell" where the assumptions are valid (occluders and closer than light source), then you can use just the valid ones... You probably wouldn't need too many of these layers to get half decent results...How about two level DSSDO. One for small level detail and one for larger. Then picking better value per pixel?[/size]
Posted by Hodgman on 30 October 2013 - 11:02 PM
I'm just kind of day-dreaming for ideas here
We've gotten to the point now where it's possible to make a real-time renderer with 1000 dynamic lights, but the problem is that we can't really generate 1000 real-time shadow maps yet.
Most games usually only have a handful of dynamic shadow-casting lights, and a large number of small point lights w/o shadows, or a large number of static lights with baked shadows.
What if for all these lights where we can't afford to generate shadows for them, we spun the problem around backwards --- instead of calculating the visibility from the perspective of each light, what if we calculate the approximate visibility from each surface?
That's crazy talk, Hodgman! There's millions of surfaces (pixels) that need to be shaded, and only thousands of lights, so it should be more expensive... at least until we've also got millions of dynamic lights...
However, the results don't have to be perfect -- approximate blurry shadows are better than no shadows for all these extra lights.
And if it's possible, it's a fixed cost; you calculate this per-pixel visibility once, and then use it to get approximate shadows for any number of lights.
There's only a few techniques that come to mind when thinking along these lines:
- DSSDO -- an SSAO type effect, but you store occlusion per direction in an SH basis per pixel. When shading, you can retrieve an approximate occlusion value in the direction of each light, instead of an average occlusion value as with SSAO.
- Screen-space shadow tracing -- not sure what people call this one. Similar to SSAO, but you check occlusion in a straight line (in the direction of your light source) instead of checking occlusion in a hemisphere. I've used it on PS3/360, and IIRC it was used in Crysis 1 too.
The problem with #2 is that it's still per-light -- for each light, you'd have to trace an individual ray, and save out thousands of these occlusion results...
The problem with #1 is that it's just an occlusion value, disregarding distance -- you might find an occluder that's 2m away, but have a light that's only 1m away, which will still cause the light to be occluded. This means it can only be used for very fine details (smaller than the distance from the light to the object).
To make technique #1 more versatile with ranges, what if instead of storing occlusion percentage values, what if we stored depth values, like a typical depth map / shadow map? You could still store it in SH, as long as you use a shadowing algorithm like VSM that tolerates blurred depth maps (in this case you would have one set of SH values to store the z, and another set to store z2 for the VSM algorithm).
You could then generate this data per-pixel using a combination of techniques -- You could bake these "depth hemispheres" per texel for static objects, bake out "depths probes" for mid-range and do screen-space ray-tracing for very fine details, and then merge the results from each together.
Then when lighting a pixel, you could read it's z and z2 values for the specific lighting direction and apply the VSM algorithm to approximately shadow the light.
I haven't tried implementing this yet, it's just day-dreaming, but can anyone point out any obvious flaws in the idea?
To make technique #2 work for more than one light, what if we only use it to shadow specular reflections, not diffuse light. We can make the assumption that any light-source that contributes a visible specular reflection, must be located somewhere in a cone that's centred around the reflection-vector, and who's angle is defined by the surface roughness.
Yeah, this assumption isn't actually true for any microfacet specular (it is true for Phong), but it's close to being true a lot of the time
So, if we trace a ray down the R vector, and also trace some more rays that are randomly placed in this cone, find the distance to the occluder on each ray (or use 0 if no occluder is found), and then average all these distance, we've got a filtered z map. If we square the distances and average them we've got a filtered z2 map, and we can do VSM.
When shading any light, we can use these per-pixel values to shadow just the specular component of the light.
Not sure what you'd call this... Screen Space Specular Shadow Variance Ray Tracing is a mouthful
I have tried implementing this one over the past two days. I hadn't implemented decent SSR/SSRT/RTLR before, so I did that first, with 8 coarse steps at 16 pixels, then 16 fine steps one pixel at a time to find the intersection point. When using this technique to generate depth maps instead of reflection colours, I found that I could completely remove the "fine" tracing part (i.e. use a large step distance) with minimal artefacts -- this is because the artefact is basically just a depth bias, where occluders are pushed slightly towards the light.
At the moment, tracing 5 coarse rays in this cone costs 3.5ms at 2048x1152 on my GeForce GTX 460.
In this Gif, there's a green line-light in the background, reflecting off the road. Starting with a tiny cone centred around the reflection vector, the cone grows to an angle of 90º:
Instead of animating the cone width for testing purposes, my next step is to read the roughness value out of the G-buffer and use that to determine the cone width.
This effect will work best for extremely smooth surfaces, where the cone is small, so that the results are the most accurate. For rough surfaces, you're using the average depth found in a very wide cone, which is a big approximation, but the shadows fade out in this case and it still seems to give a nice "contact" hint.
Posted by Hodgman on 30 October 2013 - 06:37 AM
There's two ways I'd approach it,
1) If the data is going to be completely dynamic every frame and up to maybe a thousand of these events.
2) If the data is static, or new events are added over time but old ones aren't removed.
For #1, I'd do what you've started to describe. Deferred rendering would probably be better, but forward rendering (e.g. 8 at a time) would also work.
Create a floating point frame buffer, and for each "light" calculate the attenuation/falloff, and add all the light values together (e.g. additively blend them into the frame buffer).
Then in a post-processing step, you can make a shader that reads each pixel of that float texture, remaps it to a colour, and writes it out to a regular RGBA/8888 texture for display.
If the float->colour gradient steps needs to know the min/max/average value, then you can create a set of floating point frame-buffers, each half the resolution of the previous until you get to 1x1 pixel (or create a single one with a full mipmap chain -- same thing). After rendering to the full-res float buffer, you can make a post-process shader that reads 4 pixel from it and writes one pixel to the next smallest float buffer. It can take the min/max/average/etc of those 4 values and output it to the next buffer. Repeat and you end up with a 1x1 texture with the value you're looking for.
Your float -> colour post-processing shader can then read from this 1x1 texture in order to know what the avg/min/etc value is.
For #2, I'd work in texture-space instead of screen space. This requires that your model has an unique UV layout -- if the model has "lightmap texture coords" as well as regular ones, then for this, you'd want to be using the "lightmap" ones. These are just an arbitrary mapping to 2D texture space, where each triangle gets a non-overlapping region of a 2D texture to store it's "lightmap"/etc data.
First, make a float texture at a high enough resolution to give good detail on the surface of the model. Then render the model into this float texture, except instead of outputting the vertex position transformed by the WVP matrix, you instead output the "lightmap" texture coordinates (but scaled to the -1 to 1 range, instead of 0 to 1) as if they were the positions. This will allow you to "bake" data into the "lightmap" texture.
Then in the pixel shader, you compute the attenuation/etc as usual, and additively output all the light values.
Then once you've generated your "lightmap", you can perform the same steps as above to reduce it to a 1x1 texture if you want to find an average value, etc.
You can then render the model as usual in 3D (outputting positions to the screen), and you can texture it using the pre-generated lightmap.
If new events arrive, you can additively blend them into the existing lightmap texture.
Posted by Hodgman on 29 October 2013 - 07:57 PM
I'm not a lawyer, but from my understanding of IP law, you've been trading a product using that logo, therefore it is a trademark, and it's your property.
For as long as you keep trading using that logo, a new competitor can't muscle in and take it. You, however, do have the option of attacking any competitor who decides to trade a similar product with a similar logo. I.e. if push came to shove, star citizen would be on the back foot, so it's worth their while for the two of you to just agree that the two logos are different enough to not be infringing.
Posted by Hodgman on 29 October 2013 - 06:31 PM
These new quad-cores have used their smaller transistors to achieve better performance, with much higher clock speeds, more cache, better pathways to RAM, new instruction sets, 16-wide SIMD operations, added an integrated GPU, more complex pipelines, etc...
That's their main consumer focus still, because most consumer software isn't written well for multi-core yet -- so Johnny Average will get a better experience from these upgrades than if they'd left everything the same and just increased the core count to 16.
Both methods of improvement are still occurring - they're adding more cores, while also increasing the complexity of each core. 8-core is now a standard consumer CPU option.
At the same time, Intel has been working on other CPUs and "expansion slot co-processors" that have prioritized core count much higher, so you can buy a 16 core CPU if you want to ;)
Yeah, you don't have to spend $10k to get a 24 core machine. The rendering machine at my studio has two 16 core CPUs in it, and it cost a similar amount to a high-end gaming PC. So, we can practice writing 32 core software right now ;)
Posted by Hodgman on 29 October 2013 - 07:57 AM
If it was true then, it will still be true now. Patents don't expire quickly. I've never heard of microsoft ever enforcing this particular patent though...
The core patent claim is:
A computer implemented method of providing varying levels of detail in terrain rendering, the method comprising: providing representations of plural levels of detail of a terrain, where each level represents a regular grid; determining a viewpoint of a viewer; caching a terrain view in a set of nested regular grids obtained from the plural levels as a function of distance from the viewpoint; and incrementally refilling the set of nested regular grids as the viewpoint moves relative to the terrain.Or in english:
It's a terrain LOD system. Each LOD is a regular grid (evenly spaced vertices). The LOD selection is based on the camera position -- depending on the distance from the camera, a different representation of the terrain is generated. These generated LODs are cached in this nested regular-grid structure.
By the sounds of that, if you implement it using texture-fetching in the vertex shader, you wouldn't be caching the grid structure, so you wouldn't be infringing the claim.
Or, if you use nested circles instead of nested grids... or if your grids weren't regular (e.g. if the vertex spacing gradually got farther apart as you went outwards on the rings).
So it's probably pretty easy to implement clipmaps without actually infringing that patent... and even if they did sue you over it (which they've not done to anyone yet, AFAIK), if you had the money to hire a legal team, you could probably argue that a "nested grid comprised of texture images" is an obvious analogy to the mipmap, so that their claim isn't even valid...
As mentioned in that other thread, it's probably a "defensive patent", where they're not looking to enforce it, unless someone tries to sue them for patent infringement first -- in which case they'll counter attack with 1000 stupidly obvious patents of their own. I bet you use a linked list somewhere, or have a chat system that allows people to send strings of characters that look like faces (":-)") to each other, for example...
The usual disclaimer: I am not a lawyer.
Posted by Hodgman on 28 October 2013 - 11:35 PM
You don't use a Graphics Library instead of an Engine, you use a Graphics Library to build an Engine
Posted by Hodgman on 28 October 2013 - 07:57 PM
Even if all the nodes are visible, you can decide to LOD some of them and use the parent instead of the leaf nodes.
If you're not going to do any stuff that exploits the hierarchical nature of the tree (including what mhagain said above), then you've basically just got a 3D grid, not a 3D tree
The simplest pre-generated PVS system can be extremely simple though. In one game demo that I made, I simply manually created bounding volumes for different sectors, and then manually wrote into a text file lists of which sectors could be seen from which other sectors
As well as pre-generated PVS, there's lots of runtime approaches that use the CPU rather than the GPU --
- Lots of games have used sector/portal systems.
- I know of one proprietary AAA engine that just had the level designers place 3D rectangles throughout the world as occluders (e.g. in the walls of buildings). The engine would then find the ~6 largest of these rectangles that passed the frustum test, and then use them to create frustums, and then brute-force cull any object that was inside of those occluder-frustums.
- Fallout 3 used a combination of both of the above.
- Here's another example of occluder frustums: http://www.gamasutra.com/view/feature/131388/rendering_the_great_outdoors_fast_.php?print=1
- Lots of modern games have implemented software rasterizers, where they render occlusion geometry on the CPU to create a float Z-buffer in main ram, which you can then test object bounding boxes against.
- Personally, I'm using this solution, where you implement a software rasterizer, but only allocate one bit per pixel (1=written to, 0=not yet written to). A clever implementation can then write to (up to) 128 pixels per instruction using SIMD. Occludee and occluder triangles are then sorted from front to back and rasterized. Occluder triangles write 1's, occludee triangles test for 0's (and early exit if a zero is found).
On the GPU side, there's also other options than occlusion queries. e.g. Splinter cell conviction re implemented the HZB idea, doing culling in a pixel shader, allowing them to do 10's of thousands of queries in a fraction of a millisecond, on a GPU from 2006 (much, much faster than HW occluson queries, but slightly less accurate).
Stack trace comes from GL driver which is basically useless
So as long as your usage of the functions is perfectly compliant with the GL spec, then that means you can legitimately blame the driver and/or report a driver bug
Posted by Hodgman on 28 October 2013 - 06:34 PM
Does every node in your tree actually contain an occludee? You can probably skip most of them. Also, you should perform frustum culling before occlusion culling, which will also reduce this number.
Posted by Hodgman on 28 October 2013 - 02:29 AM
Moores law is about transistor density, not computing power -- that the number of transistors you can cram into a particular area seems to double every 2 years.
It's slowing down, and for the past 5 years or so, there's been more of a focus on using those extra transistors to create more cores, rather than using them to speed up a single core.
If you're writing single-threaded code now, it will only be a few % faster in 5 years. Buf if you're writing code right now for a 4-core CPU, but written in a way where it can scale up to a 32-core CPU, then maybe in 5 years you'll be able to run that code 700% faster...