Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Today, 05:58 PM

#5284145 N64, 3DO, Atari Jaguar, and PS1 Game Engines

Posted by on 29 March 2016 - 06:15 PM

The first "commercial game engine" that comes to my mind that supported multiple platforms and used by several AAA titles that remotely gets close to the modern concept of game engines was RenderWare. It wasn't even a game engine, it was a rendering engine. And it wasn't for PS1 / N64 gen.


Licensing it costed several tens of thousands of dollars AFAIK. Wikipedia has a list of its competitors (Unreal Engine & Frostbite). Doesn't matter who came first, none of them were for the era you're looking for because, like others already explained, it was all handcrafted and kept in house; occasionally having their stuff licensed to other studios.

#5283756 Per Triangle Culling (GDC Frostbite)

Posted by on 27 March 2016 - 03:02 PM


I don't know why you insist that much on bandwidth.

Alright I ran your numbers, you've convinced me it isn't as big an issue as I thought it to be... but I'm hazy on one figure of yours.  
edit - also didn't you forget to take into account the Hi-Z buffer bandwidth for per triangle depth culling?


Yes I did. I don't know the exact memory footprint, but 33.33% overhead (like in mipmapping) sounds like a reasonable estimate.

How did you get the 309MB per frame figure?  When I did it I'm getting completely different numbers.
edit - specifically the 305MB number.
Thanks for pointing it out.

1.000.000 * 32 bytes = 30.51MB... dammit I added a 0 and considered 10 million vertices.
The 305MB came from 10 million vertices, not 1 million.
Well... crap.

For 10 million vertices it's 35MB of index data, not 3.5MB. But for 1 million vertices, it's 30.51 MB, not 305.5MB

It only makes it easier to prove. Like I said, at 1920x1080 there shouldn't be much more than 2 million vertices (since there would be one vertex per pixel). Maybe 3 million? Profiling would be needed
So if you provide a massive amount of input vertices (such as 10 million vertices), the culler will end up discarding a lot of vertices.

#5283754 OpenGL Check if VBO Upload is Complete

Posted by on 27 March 2016 - 02:39 PM

So can you use a fence to test for the actual status of a BufferSubData call uploading to server? And that works (...) without issuing a draw call against that buffer?

Yes, but no:
1. In one hand, you can reliable test that the copy has ended by calling glClientWaitSync with GL_SYNC_FLUSH_COMMANDS_BIT set. No need to issue a draw call. An implementation that doesn't behave this way could cause hazards or deadlocks and would be therefore considered broken. However...

2. On the other hand, flushing is something you should avoid unless you're prepared to stall. So we normally query these kind of things without the flush bit set. Some platforms may have already began and eventually finish the copy by the time we query for 2nd or 3rd time (since the 1st time we decided to do something else. Like audio processing). While other platforms/drivers may wait forever to even start the upload because it's waiting for you to issue that glDraw call and decide uploading the buffer will be worth it. Thus the query will always return 'not done yet' until something relevant happens.

So the answer is yes, you can make it work without having to call draw. But no, you should avoid this and hope drivers don't try to get too smart (or profile overly smart drivers).

...and that works consistently across platforms...

If you're using fences and unsynchronized access you're targeting pretty much modern desktop drivers (likely GL 4.x; but works on GL 3.3 drivers too), whether Linux or Windows. It works fine there (unless you're using 3-year-old drivers which had a couple fence bugs)
Few android devices support GL_ARB_Sync. It's not available on iOS afaik either. It's available on OSX but OSX lives in a different world of instability.

Does it work reliably across platforms? Yes (except on OSX where I don't know). Is it available widespread in many platforms? No.

If you're using fences and thus targeting modern GL, this brings me my next point: Just don't use BufferSubData. BufferSubData can stall if the driver ran out of its internal memory to perform the copy.

Instead, map an unsynchronized/persistent mapped region to use as a stash between CPU<->GPU (i.e. what D3D11 knows as Staging Buffers); and then perform a glCopyBufferSubdata to copy from GPU Stash to final GPU data. Just as fast, less stall surprises (you **know** when you've run out of stash space; and fences tell you when older stash regions can be reused again), and gives you tighter control. You can even perform the copy from CPU -> GPU stash in a worker thread, and perform the glCopyBufferSubdata call in the main thread to do the GPU Stash->GPU copy.
This is essentially what you would do in D3D11 and D3D12 (except the GPU->GPU copy doesn't have to be routed to the main thread).

#5283672 OpenGL Check if VBO Upload is Complete

Posted by on 27 March 2016 - 12:15 AM

Hold on guys. There is a way to check if the copy has been performed the way OP asked.


apitest shows how to issue a fence and wait for it. The first time it checks if the fence has been signaled. The second time it tries again but flushing the queue since the driver may not have processed the copy yet (thus the GPU hasn't even started the copy, or whatever you're waiting for. If we don't flush, we'll be waiting forever. aka deadlock)


Of course if you want to just check if the copy has finished, and if not finished then do something else: you just need to do the 'wait' like the first time (i.e. without flushing), but using waiting period of zero (so that you don't wait, and get a boolean-like response like OP wants). We do this in Ogre to check for async transfer's status.


As with all APIs that offer fences (D3D12, Vulkan, OpenGL), the more fences you add, the worse it becomes for performance (due to the added hardware and driver overhead of communicating results, synchronizing, and keeping atomicity). Use them wisely. Don't add fences "just in case" you think you'll want to query for the transfer status. If you have multiple copies to perform, batch them together and then issue the fence.



I'd like to do this in order not to try to draw it before the upload is complete, as this halts the program (and a lag spike is felt).

Calling glDraw* family of functions won't stall because it's also asynchronous. I can't think of a scenario where the API will stall because an upload isn't complete yet. You usually need to check if a download (i.e. GPU->CPU) is completed before you map the buffer to avoid stalling (unless you use unsynchronized or persistent mapping flags; in such case it won't stall but you still need to check if the copy is complete to avoid a hazard)

#5283614 Per Triangle Culling (GDC Frostbite)

Posted by on 26 March 2016 - 03:34 PM

Thats true, but again I wonder at what cost in terms of bandwidth mainly... although the combination of cluster culling and per triangle culling might reduce the cost of the per triangle culling to my liking.

I don't know why you insist that much on bandwidth.
Assuming 1.000.000 vertices per frame with 32 bytes per vertex & 600.000 triangles, that would require 309MB per frame of Read bandwidth (305MB in vertices, 3.5MB in index data). Actual cost can be reduced because only position is needed for the first pass (in which case only 8-16 bytes per vertex are needed). But let's ignore that.
309MB to read.
Now to write, worst case scenario no triangle gets culled and we would need to write 1.800.000 indices (3 indices per triangle). That's 3.5MB of write bandwidth.

Now to read again, in the second pass, we'd need between 309MB and 553MB depending on caches (i.e. accessing an array 1.000.000 of vertices 1.800.000 times).
So let's assume utter worst case scenario (nothing gets culled, cache miss ratio is high):

  • 309MB Read (1st pass)
  • 3.5MB Write (1st pass)
  • 553MB Read (2nd pass)

Total = 865.5MB per frame.
A typical AMD Radeon HD 7770 has 72GB/s of peak memory bw. At 60fps, that's 1228.8MB per frame available. At 30fps that's 2457.6MB per frame.
Let's keep it at 60 fps. You still have left 363.3MB of BW per frame for texture and RenderTargets. RenderTarget at 1080p needs 7.9MB for the colour buffer (RGBA8888) and another 7.9MB for the depth buffer (assuming you hit the worst cases where Hi-Z and Z-compression get disabled or end up useless; which btw you shouldn't run into those because this culling step removes the troublesome triangles that are hostile to Hi-Z and Z-compression. But let's assume you enabled alpha testing on everything).
You still have left 347.47MB per frame for texture sampling and compositing. Textures are the elephant in the room, but note that since non-visible triangles were culled away, texture sampling shouldn't be that inefficient since each pixel should only end up running once (or twice top).
And this is assuming:

  • Nothing could be culled (essentially rendering this technique useless in this case).
  • You draw 1.000.000 vertices per frame. At 1920x1080 that's one vertex every two pixels and looks extremely smooth (if you managed to concentrate many vertices into the same pixel, leaving other pixels with less tesselation, then it contradicts the assertion that nothing could've been culled).
  • Those 1.000.000 are unique and thus can't be cached during reads (e.g. a scene that renders 1.000.000 per frame typically are the same 60.000 vertices or so repeated over and over again i.e. instancing. We're assuming here no caching could've been done)
  • You don't provide a position-only buffer for the first pass (which would greatly reduce BW costs at the expense of more VRAM)
  • You hit horrible cache miss ratios in the 2nd pass
  • Early Z and Hi-Z get disabled
  • You're in the mid-end HD 7770 which has only 72GB/s of BW (vs PS4's 176GB/s)

Vertex bandwidth could be an issue for a game that heavily relies on vertices but not your average game.

#5283200 Is there an elegant way of writing straight hlsl code without preprocessor ma...

Posted by on 24 March 2016 - 12:32 PM

L Spiro mentioned this, but I want to emphazise: The macros that are discouraged are the ones being used as an inline function. That's because they don't behave the way you would expect.


For example:

#define MIN( x, y ) ( x < y ) ? x : y;
float a = 1;
float b = 1;

float r = MIN( a++, b );
//now r = 1; a = 2;

That's because it will translate to:

float a = 0;
float b = 1;

float r = (a++ < b) ? a++ : b;

which is probably not what you intended.


However using #ifdef #else #endif is perfectly fine, other than becoming less pretty to the eye, or going out of hand if not being disciplined; which is the reason some people (i.e. me) prefer to run the code through your own preprocessor to ease some stuff.

#5282996 Per Triangle Culling (GDC Frostbite)

Posted by on 23 March 2016 - 04:17 PM


Note that on AMD's GCN, the compute shader could be ran async while rendering the shadow maps (which barely occupy the compute units), thus making this pass essentially "free".


Given that nvidia doesn't typically allow async compute, does that mean it wouldn't be useful on nvidia? 


It's easy to understand why rendering small triangles is expensive, but this culling process won't be free if it can't overlap other parts of the pipeline, right? I suppose I could see a overall positive benefit if the compute shader needs only position information and can ignore other attributes which won't contribute to culling?


Whether it's a net gain or a net loss depends on the scene. Async compute just increases the likelihood of being a net gain.

#5282647 Naming conventions, software documentation and version control

Posted by on 22 March 2016 - 09:39 AM

What naming conventions to follow? Is this all opinion based? Is there a/an common/industry standard style?

Yes, it's all opinion based.  Follow the one(s) you are comfortable with.  There is no industry style.  It might be informative to look at the naming conventions used in the standard library (defined through an ISO standards document) or through your favourite or most commonly-used toolkit.

While it's true that it's opinion based, some convention have been put more thought than others, and it's important to understand the rationale.

For example for class member variables, the most popular are mMyVariable or m_myVariable.
I prefer 'm_myVariable' because of autocomplete. Once you've typed 'm_'; 90% of the suggestions from autocomplete will be the member variables. Whereas if you just type 'm', the suggestions are crowded by a lot of irrelevant variables and functions, mostly from the global namespace.
However my colleague prefers mMyVar because m_myVar strains his carpel tunnel syndromme (the need to hold shift to type the underscore). So we sticked to mMyVar because his health is more important.
Another option is to always use this->myVar which makes perfect clear it's a member variable and works with autocomplete too. But is more noisy and harder to make everyone follow it.

#5282639 GIT vs Mercurial

Posted by on 22 March 2016 - 09:19 AM

My personal experience:


  • GitHub. Has a lot of good services: the best ticket system, the best free & paid continuous integration solutions, ZenHub (3rd party services that integrate to GH), etc.
  • Has good, fine-granulated history manipulation and multi-commit merging.


  • Horrendously slow (for >= 1GB repos; or repos with many tens of thousands of files)
  • Extremely poor Windows support: Networking bugs (failure to push, failure to pull, authentication problems due to protocol issues), crashes, command line argument limitations causing commits to fail (when you have to commit many files). Substantially slower than its Linux/OSX counterparts.
  • Really poor handling of EOL when compared to Mercurial. autocrlf brings a lot of problems.
  • The repo can bet left itself in an inconsistent state and when that happens it can be hard to leave it in a working state again, sometimes being quicker to just clone the repo again from a working origin.
  • No good UI other than SourceTree (which is relatively slow)




  • TortoiseHg is really good and works identically in all platforms (also there's SourceTree, but it's still slow as with git)
  • Windows support on par with other OSes.
  • Relatively good EOL handling.
  • Really fast.


  • No GitHub
  • Lacks advanced history manipulation or branch merging system. Sometimes this is a blessing, sometimes this is a curse. Using extensions to do history manipulation (other than strip) is often a bad idea.


Overall I strongly prefer Mercurial. TBH, if it weren't for GitHub, I wouldn't even consider git.

#5282461 Per Triangle Culling (GDC Frostbite)

Posted by on 21 March 2016 - 04:29 PM

Pixel shaders are run in 2x2 blocks.


When a triangle covers one pixel; the pixel shader wastes up to 75% of its power in dummy pixels. If this triangle also happens to be occluded by a bigger triangle in front of it, it becomes unnecessary.

Single-pixel triangles are also a PITA for concurrency since there might be not enough pixels gathered to occupy a full wavefront; and when you do, divergence could be higher (since there could be little space coherence between those pixels).

Also triangles that occupy no pixels (because they're too tiny), if too many, can drown the rasterizer and starve the pixel shader.


Due to this, GPUs love triangles covering large areas, and hate super small triangles. Note that on AMD's GCN, the compute shader could be ran async while rendering the shadow maps (which barely occupy the compute units), thus making this pass essentially "free".


It's like a Z Prepass but on steriods. Because instead of only populating the Z-buffer for the 2nd pass, it also removes useless triangles (lowering vertex shader usage and alleviating the rasterizer).


So yeah... it's a performance optimization.

#5282242 Gamma correction. Is it lossless?

Posted by on 20 March 2016 - 07:03 PM

Is the conversion lossless? -Given enough bits, yes.
8-bit sRGB -> 10-bit linear in fixed point or 32-bit floating point (16-bit half floating point is not enough for 100% lossless)
10-bit linear fixed point -> 8-bit sRGB
What happens when you screw up the amount of bits? -You get banding. Mass Effect 3 was particularly bad at this:

That happens when you store linear RGB straight into RGBA8888 format. Either reconvert linear->sRGB sRGB->linear on the fly, or use more bits.

#5281393 Vulkan is Next-Gen OpenGL

Posted by on 15 March 2016 - 04:03 PM

You can build the debug layer yourself; thus fixing the DLL issue.

#5281380 How can I manage multiple VBOs and shaders?

Posted by on 15 March 2016 - 02:17 PM

On my computer, I can draw 125 000 times the same cube, while unbinding-rebinding it for EVERY draw call (It's stupid, but I did it for benchmarking purpose) with still 50 frames per second.

The driver will detect you're rebinding it and ignore it. What you measured was the cost of the call instructions and checking if the VBO was bound, and not the actual cost of changing bound VBOs.
A modern computer can definitely not handle 125.000 swaps at 50hz at all.

#5280899 Selling a game on Steam- Steam and VAT cuts...?

Posted by on 12 March 2016 - 09:18 AM

I guess which is why you want to use a retailer like Steam rather than running your own store. Ain't nobody got time to research the implications of selling something to 300 jurisdictions...

It would be easy if it weren't for the EU that screwed up. Traditionally, "civilized" countries applied the VAT from imports (as seen from the country of the buyer) by withholding at the payment issuer (e.g. typically the credit card). Even if you pay with PayPal, PayPal gets the money via a credit card transaction, a bank transfer, or another PayPal transfer.


Only the last one is hard to withheld (paying w/ PayPal using funds from a previous PayPal transaction), and even then, legislation on "decent" countries is to make the buyer liable for paying the VAT. After all, the VAT is a tax imposed on the consumer, not the seller (but normally the seller is the one who deposits the VAT funds to the Tax agency since it's the most practical thing to do; but not in this case).

But no... the EU wanted to force the buyer to register with their tax administration agencies even if the seller never set a foot there, or despite these EU countries don't even have jurisdiction to enforce such thing. I ranted about it in detail on my website.

You know there is a threshold to vat right? Check with an accountant, but I'm sure that you don't even have to deal with it at all until your turnover per year is greater than £20K...

Yes, but the problem with EU's new legislation is that you have to check the threshold of every single european country, and watch out if there's a country without thresholds.

It's a mess.

#5280773 Selling a game on Steam- Steam and VAT cuts...?

Posted by on 11 March 2016 - 02:18 PM

It's strange that VAT/GST are supposed to be paid on the increase in value - e.g. the difference in retail and wholesale, but my quick research on EU VAT says that for digital goods this isn't taken into account.

You're right about that. However it's the seller who must increase the value. If the seller didn't, then it is assumed the price already included the raise.