• 11
• 9
• 10
• 9
• 11
• ### Similar Content

• By lxjk
Hi guys,
There are many ways to do light culling in tile-based shading. I've been playing with this idea for a while, and just want to throw it out there.
Because tile frustums are general small compared to light radius, I tried using cone test to reduce false positives introduced by commonly used sphere-frustum test.
On top of that, I use distance to camera rather than depth for near/far test (aka. sliced by spheres).
This method can be naturally extended to clustered light culling as well.
The following image shows the general ideas

Performance-wise I get around 15% improvement over sphere-frustum test. You can also see how a single light performs as the following: from left to right (1) standard rendering of a point light; then tiles passed the test of (2) sphere-frustum test; (3) cone test; (4) spherical-sliced cone test

I put the details in my blog post (https://lxjk.github.io/2018/03/25/Improve-Tile-based-Light-Culling-with-Spherical-sliced-Cone.html), GLSL source code included!

Eric

• Hello,
I am trying to make a GeometryUtil class that has methods to draw point,line ,polygon etc. I am trying to make a method to draw circle.
There are many ways to draw a circle.  I have found two ways,
The one way:
public static void drawBresenhamCircle(PolygonSpriteBatch batch, int centerX, int centerY, int radius, ColorRGBA color) { int x = 0, y = radius; int d = 3 - 2 * radius; while (y >= x) { drawCirclePoints(batch, centerX, centerY, x, y, color); if (d <= 0) { d = d + 4 * x + 6; } else { y--; d = d + 4 * (x - y) + 10; } x++; //drawCirclePoints(batch,centerX,centerY,x,y,color); } } private static void drawCirclePoints(PolygonSpriteBatch batch, int centerX, int centerY, int x, int y, ColorRGBA color) { drawPoint(batch, centerX + x, centerY + y, color); drawPoint(batch, centerX - x, centerY + y, color); drawPoint(batch, centerX + x, centerY - y, color); drawPoint(batch, centerX - x, centerY - y, color); drawPoint(batch, centerX + y, centerY + x, color); drawPoint(batch, centerX - y, centerY + x, color); drawPoint(batch, centerX + y, centerY - x, color); drawPoint(batch, centerX - y, centerY - x, color); } The other way:
public static void drawCircle(PolygonSpriteBatch target, Vector2 center, float radius, int lineWidth, int segments, int tintColorR, int tintColorG, int tintColorB, int tintColorA) { Vector2[] vertices = new Vector2[segments]; double increment = Math.PI * 2.0 / segments; double theta = 0.0; for (int i = 0; i < segments; i++) { vertices[i] = new Vector2((float) Math.cos(theta) * radius + center.x, (float) Math.sin(theta) * radius + center.y); theta += increment; } drawPolygon(target, vertices, lineWidth, segments, tintColorR, tintColorG, tintColorB, tintColorA); } In the render loop:
polygonSpriteBatch.begin(); Bitmap.drawBresenhamCircle(polygonSpriteBatch,500,300,200,ColorRGBA.Blue); Bitmap.drawCircle(polygonSpriteBatch,new Vector2(500,300),200,5,50,255,0,0,255); polygonSpriteBatch.end(); I am trying to choose one of them. So I thought that I should go with the one that does not involve heavy calculations and is efficient and faster.  It is said that the use of floating point numbers , trigonometric operations etc. slows down things a bit.  What do you think would be the best method to use?  When I compared the code by observing the time taken by the flow from start of the method to the end, it shows that the second one is faster. (I think I am doing something wrong here ).
Thank you.

• Hi Forum,
in terms of rendering a tiled game level, lets say the level is 3840x2208 pixels using 16x16 tiles. which method is recommended;
method 1- draw the whole level, store it in a texture-object, and only render whats in view, each frame.
method 2- on each frame, loop trough all tiles, and only draw and render it to the window if its in view.

are both of these methods valid? is there other ways? i know method 1 is memory intensive  but method 2 is processing heavy.
• By wobes
Hi there. I am really sorry to post this, but I would like to clarify the delta compression method. I've read Quake 3 Networking Model: http://trac.bookofhook.com/bookofhook/trac.cgi/wiki/Quake3Networking, but still have some question. First of all, I am using LiteNetLib as networking library, it works pretty well with Google.Protobuf serialization. But then I've faced with an issue when the server pushes a lot of data, let's say 10 players, and server pushes 250kb/s of data with 30hz tickrate, so I realized that I have to compress it, let's say with delta compression. As I understood, the client and server both use unreliable channel. LiteNetLib meta file says that unreliable packet can be dropped, or duplicated; while sequenced channel says that packet can be dropped but never duplicated, so I think I have to use the sequenced channel for Delta compression? And do I have to use reliable channel for acknowledgment, or I can just go with sequenced, and send the StateId with a snapshot and not separately?
Thank you.
• By dp304
Hello!
As far as I understand, the traditional approach to the architecture of a game with different states or "screens" (such as a menu screen, a screen where you fly your ship in space, another screen where you walk around on the surface of a planet etc.) is to make some sort of FSM with virtual update/render methods in the state classes, which in turn are called in the game loop; something similar to this:
struct State { virtual void update()=0; virtual void render()=0; virtual ~State() {} }; struct MenuState:State { void update() override { /*...*/ } void render() override { /*...*/ } }; struct FreeSpaceState:State { void update() override { /*...*/ } void render() override { /*...*/ } }; struct PlanetSurfaceState:State { void update() override { /*...*/ } void render() override { /*...*/ } }; MenuState menu; FreeSpaceState freespace; PlanetSurfaceState planet; State * states[] = {&menu, &freespace, &planet}; int currentState = 0; void loop() { while (!exiting) { /* Handle input, time etc. here */ states[currentState]->update(); states[currentState]->render(); } } int main() { loop(); } My problem here is that if the state changes only rarely, like every couple of minutes, then the very same update/render method will be called several times for that time period, about 100 times per second in case of a 100FPS game. This seems a bit to make dynamic dispatch, which has some performance penalty, pointless. Of course, one may argue that a couple hundred virtual function calls per second is nothing for even a not so modern computer, and especially nothing compared to the complexity of the render/update function in a real life scenario. But I am not quite sure. Anyway, I might have become a bit too paranoid about virtual functions, so I wanted to somehow "move out" the virtual function calls from the game loop, so that the only time a virtual function is called is when the game enters a new state. This is what I had in mind:
template<class TState> void loop(TState * state) { while (!exiting && !stateChanged) { /* Handle input, time etc. here */ state->update(); state->render(); } } struct State { /* No update or render function declared here! */ virtual void run()=0; virtual ~State() {} }; struct MenuState:State { void update() { /*...*/ } void render() { /*...*/ } void run() override { loop<MenuState>(this); } }; struct FreeSpaceState:State { void update() { /*...*/ } void render() { /*...*/ } void run() override { loop<FreeSpaceState>(this); } }; struct PlanetSurfaceState:State { void update() { /*...*/ } void render() { /*...*/ } void run() override { loop<PlanetSurfaceState>(this); } }; MenuState menu; FreeSpaceState freespace; PlanetSurfaceState planet; State * states[] = {&menu, &freespace, &planet}; void run() { while (!exiting) { stateChanged = false; states[currentState]->run(); /* Runs until next state change */ } } int main() { run(); } The game loop is basically the same as the one before, except that it now exits in case of a state change as well, and the containing loop() function has become a function template.
Instead of loop() being called directly by main(), it is now called by the run() method of the concrete state subclasses, each instantiating the function template with the appropriate type. The loop runs until the state changes, in which case the run() method shall be called again for the new state. This is the task of the global run() function, called by main().
There are two negative consequences. First, it has become slightly more complicated and harder to maintain than the one above; but only SLIGHTLY, as far as I can tell based on this simple example. Second, code for the game loop will be duplicated for each concrete state; but it should not be a big problem as a game loop in a real game should not be much more complicated than in this example.
My question: Is this a good idea at all? Does anybody else do anything like this, either in a scenario like this, or for completely different purposes? Any feedback is appreciated!

# Matrix Calculation Efficiency

This topic is 640 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi Guys,

At present, I send the W, V, & P matrices to the shader where they are multiplied within the shader to position vertices.

Would it be more efficient to pre-multiply these on the CPU and then pass the result to the shader?

##### Share on other sites

Do not prematurely optimize things, you might end up having to switch to the other method later.  Profile and test things, that is what will make the best determination.  There are very, very few steadfast rules about this stuff, it is highly dependent upon what you're doing code wise, and the data you're pumping through the CPU/GPU, etc.

##### Share on other sites

It's my premature optimisation that is allowing me to be able to render so much in the first place.

I was just wondering what the normal practice was.

##### Share on other sites

Simple answer: yes - doing multiplication once ahead of time, in order to avoid doing it hundreds of thousands of times (once per vertex) is obviously a good idea.

However, there may be cases where uploading a single WVP matrix introduces its own problems too!

For example, lets say we have a scene with 1000 static objects in it and a moving camera.

Each frame, we have to calculate VP = V*P, and then perform 1000 WVP = W * VP calculations, and upload the 1000 resulting WVP matrices to the GPU.

If instead, we sent W and VP to the GPU separetely, then we could pre-upload 1000 W matrices one time in advance, and then upload a single VP matrix per frame.... which means that the CPU will be doing 1000x less matrix/upload work in the second situation... but the GPU will be doing Nx more matrix multiplications, where N is the number of vertices drawn.

The right choice there would depend on the exact size of the CPU/GPU costs incurred/saved, and how close to your GPU/CPU processing budgets you are.

##### Share on other sites
Yes. Multiply once outside is the way to go. If it's doing something static like rendering landscape then yes. A bit more tricky if its your game entities. In that case you need to weigh up instancing for translation and orientation of objects vs updating the matrix on the fly each draw call.

For static yes. For dynamic in low numbers yes. More murky when you start dealing with alot of objects.

##### Share on other sites
Thanks guys!

In my case just about all of the geometry will be pre-transformed in my 3D package. So, there won't be any additional rotations, scaling, etc to do either.

##### Share on other sites

Yes.

And no, no, no, no, no: this is not premature optimization, it's engineering for efficiency, they're not the same thing and don't listen to anyone who tells you different.

##### Share on other sites

I got a similar question about fine performance measurment:

Imagine I have in Geometry Shader two loops with known compile-time consts:

for (x = 0; x < 4; ++x) {
for (y = 0; y < 3; ++y{
... DoStuff();
}
}

This code in release mode gives me "Approximately 22 instruction slots used" (VS compiler will output this info)

If I would place [unroll] before each loop, I would have "Approximately 89 instruction slots used".

Right now I can measure time in NSight's "Events" window with nonosec-precision and can’t see performance gain between the shaders.

Is there a way to measure the difference in a finer way?

The question is similar, because measurement perf. diff in such optimizations (2 matrices vs 1, unroll/not unroll) requires some tool to measure the difference.

Edited by Happy SDE

##### Share on other sites

If you can't see any perf difference it might just be because you're bottlenecked elsewhere; e.g. you might be CPU-bound.

##### Share on other sites
If you can't see any perf difference it might just be because you're bottlenecked elsewhere; e.g. you might be CPU-bound.

No, I am not CPU bound at all.

This code calculates 4 Shadow Maps in one pass, which is faster, that 4 separate calls (I can see difference in NSight, because it is significant like 50-200% win dependent on quality settings).

This is a macro-optimization.

But passing unroll or 1/2 matrices is a micro optimization, which might give me something.

And with current tools I am aware of I can't detect it =(

One option - is to calculate instruction count.

But as I understand:

1. Each instruction has it's own cost and just summing them up is not a good idea.

2. NSight's measurement on same scene, with same shader, gives me error about 0.2% between passes.

So I am keep searching for a tool that will give me ability to measure micro-optimization perf.

The main reason for that: find (and measure) a good practice once, and after that apply it elsewhere without unnecessary code bloating because of some unmeasured speculations.

Edited by Happy SDE