# C++ Fast indexing of a 2x2 block Frame Buffer storage

## Recommended Posts

Hey guys, I'm writing a software renderer for fun and I decided to try and optimize the renderer by storing the frame buffer and depth buffer as rows of 2x2 blocks of pixels. I figured it would be easier to SIMDify and be more cache local. My main function for the software renderer is traversing boxes on the screen to either check for occlusion or to color in pixels. So if my pixel format is rows of pixels, then the main thing I perform is this:

void draw(u32* Pixels, int minx, int miny, int maxx, int maxy)
{
u32* PixelRow = Pixels + NumPixelsX*MinY + MinX;
for (u32 Y = miny; Y < maxy; ++Y)
{
u32* CurrPixel = PixelRow;
for (u32 X = minx; X < maxx; ++X)
{
*CurrPixel++ = some color/depth;
}
PixelRow += NumPixelsX;
}
}

I converted this routine to SIMD and to process 2x2 pixels, but I found that by doing so, I added a big setup cost before the actual loop:

#define BLOCK_SIZE 2
#define NUM_BLOCK_X (ScreenX / BLOCK_SIZE)
{
_mm_setr_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF),
_mm_setr_epi32(0, 0xFFFFFFFF, 0, 0xFFFFFFFF),
};

{
_mm_setr_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF),
_mm_setr_epi32(0xFFFFFFFF, 0, 0xFFFFFFFF, 0),
};

{
_mm_setr_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF),
_mm_setr_epi32(0, 0, 0xFFFFFFFF, 0xFFFFFFFF),
};

{
_mm_setr_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF),
_mm_setr_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0, 0),
};

inline void DrawNode(u32* Pixels, u32 MinX, u32 MaxX, u32 MinY, u32 MaxY)
{
// NOTE: All of this is the setup cost right here
u32 BlockMinX = MinX / BLOCK_SIZE;
u32 BlockMaxX = MaxX / BLOCK_SIZE;
u32 BlockMinY = MinY / BLOCK_SIZE;
u32 BlockMaxY = MaxY / BLOCK_SIZE;

u32 DiffMinX = MinX - BLOCK_SIZE*BlockMinX;
u32 DiffMaxX = MaxX - BLOCK_SIZE*BlockMaxX;
u32 DiffMinY = MinY - BLOCK_SIZE*BlockMinY;
u32 DiffMaxY = MaxY - BLOCK_SIZE*BlockMaxY;

if (DiffMaxX)
{
BlockMaxX += 1;
}
if (DiffMaxY)
{
BlockMaxY += 1;
}

f32* RowDepth = RenderState->DepthMap + Square(BLOCK_SIZE)*(BlockMinY*NUM_BLOCK_X + BlockMinX);
__m128 NodeDepthVec = _mm_set1_ps(Z);
for (u32 Y = BlockMinY; Y < BlockMaxY; ++Y)
{
f32* DepthBlock = RowDepth;
for (u32 X = BlockMinX; X < BlockMaxX; ++X)
{
if (X == BlockMinX)
{
}
if (X == BlockMaxX - 1)
{
}
if (Y == BlockMinY)
{
}
if (Y == BlockMaxY - 1)
{
}

__m128 Compared = _mm_cmp_ps(CurrDepthVec, NodeDepthVec, _CMP_GT_OQ);

DepthBlock += 4;
}

RowDepth += Square(BLOCK_SIZE)*NUM_BLOCK_X;
}
}

(The min/max masks are if the min/max values fall inside a 2x2 block of pixels, we want to mask out writing/reading the pixels that aren't actually part of our range).

Calculating the block min/max as well as the diff min/max all seems to be a little much, I'm not sure if there is a much more efficient way to do that. I also wanted to take advantage of 8 wide SIMD using AVX so I figured I would have rows of 4x4 pixels, but where each block of 4x4 pixels itself stores 4 blocks of 2x2 pixels. Im worried that doing that will add a even larger setup cost to the loop which for my application would negate most of the benefits.

My bottom line is, I want to optimize as much as I can the process of filling a box of pixels on the screen with a color because my software renderer does it a lot every frame (4 million times currently), and I figured storing pixels in 2x2 blocks would make it faster, but I'm not sure if I'm missing some trick to more quickly calculate which pixels I have to iterate over.

##### Share on other sites

This is something I'm meaning to get around to, I read an excellent tutorial on software renderers a few weeks back and it isn't half as difficult as I expected (might even be viable in some cases because there are a lot of setup costs to doing anything on the GPU, versus no setup on the CPU, only the bandwidth for transfer).

Nothing springs out at me straight away with the small blocks as I've not done this yet, but have you also considered doing it as a tile renderer like mobiles do? That way you just have a 1 off cost of clipping your box (or triangle) to the larger tile, and render as normal just into a smaller tile, and hopefully get some cache benefits. SIMD is great in my experience but it works best when you have a whole lot of stuff to go do all at once, BLAT, instead of do a tiny bit, do some branching etc, do another bit...

##### Share on other sites
5 hours ago, lawnjelly said:

This is something I'm meaning to get around to, I read an excellent tutorial on software renderers a few weeks back and it isn't half as difficult as I expected (might even be viable in some cases because there are a lot of setup costs to doing anything on the GPU, versus no setup on the CPU, only the bandwidth for transfer).

Nothing springs out at me straight away with the small blocks as I've not done this yet, but have you also considered doing it as a tile renderer like mobiles do? That way you just have a 1 off cost of clipping your box (or triangle) to the larger tile, and render as normal just into a smaller tile, and hopefully get some cache benefits. SIMD is great in my experience but it works best when you have a whole lot of stuff to go do all at once, BLAT, instead of do a tiny bit, do some branching etc, do another bit...

Yeah, writing software renderers is actually kind of fun :).

So when you say tile rendering, you mean binning? Like dividing the screen into 16x16 blocks, figuring out which objects lie in which block, and then later rendering everything in a block at a time? So I was going to do that (if thats what you are talking about) using multi threading but I guess I should say that my renderer is not a triangle rasterizer.

I'm writing a octree voxel splatter (again). I represent my geometry in the scene as an octree and all the leaf nodes are either marked full or empty, so the geometry is just a ton of cubes assembled with an octree. When I render, I traverse the octree in front to back order and render nodes that aren't occluded. When I render the nodes, rather than rasterizing a cube, I find the cubes bounding box on screen and fill those pixels as if that entire block was the cube. The idea is that once I have a around a single node per pixel, rasterizing a cube vs rendering its min/max box becomes the same. Below I attached some photos to show what I mean, I'm changing the max recursion depth so you can see what I'm rendering.

So in my circumstance, its a little hard to do binning mostly because the main speed up for my algorithm is the occlusion culling, which I don't think can be made parallel. That's why I'm trying to optimize stuff like how fast I can render or check a box on the screen for some depth value instead.

I've been thinking about the 4x4 blocks of pixels that I wanted to implement and I figured that instead of having 4x4 blocks with 2x2 blocks inside, I can have a 4x4 block and one AVX vector will be able to index 2 2x2 blocks at a time anyways, because its 8 wide. I attached a picture below showing what I mean.

I'm just thinking out loud right now, I have no idea if this will be any good. I was hoping there was some resource that explained how to better do stuff like this but from looking around online, I can't seem to find anything.

##### Share on other sites

Another obvious thing if you are writing rects that are bigger than the blocks, is to predetermine which will be completely filled, and only do the complex check for edge cases, and use a simpler routine for fully filled blocks. But then it depends on the proportion of completely filled boxes, as to whether this is a win, and as you say you are going down finally to the level of pixels.

I'm presuming you done lots of profiling of this version versus more naive versions? What kind of results did you get? A comparison of a naive non SIMD line by line version versus the block version would be interesting. You could also try and make things branchless like you do in shaders and see if that helps. And I suspect you'd get more answers with a few more comments in the code, reading other people's intrinsics can be a little impenetrable lol!

##### Share on other sites
19 hours ago, lawnjelly said:

Another obvious thing if you are writing rects that are bigger than the blocks, is to predetermine which will be completely filled, and only do the complex check for edge cases, and use a simpler routine for fully filled blocks. But then it depends on the proportion of completely filled boxes, as to whether this is a win, and as you say you are going down finally to the level of pixels.

I'm presuming you done lots of profiling of this version versus more naive versions? What kind of results did you get? A comparison of a naive non SIMD line by line version versus the block version would be interesting. You could also try and make things branchless like you do in shaders and see if that helps. And I suspect you'd get more answers with a few more comments in the code, reading other people's intrinsics can be a little impenetrable lol!

I actually had a routine I wrote for doing just that. It would only do masked reads for the edges/corners of the box. The only issue was that it failed for rendering 1 pixel sized boxes, it needed more if statement checks to do that correctly. I just left it alone since I figured I should at least get the easy case right.

So I did profiling before and as I was writing code, but I think I made the most rookie mistake. When I was adding SIMD to my code, I didn't just apply it for pixel checks, I also did it for other parts of the program. So when I bench marked that, I was way faster than scalar code. I originally used AVX to check a line of 8 pixels at a time and later I did the 2x2 pixel approach. Using Very Sleepy profiler, I saw that those routines took around 8-12% of my rendering time so I figured oh they need to be optimized. What I should have checked was if the pixel checks only are faster in SIMD than they are in scalar code.

So these are my profiling results:

Average Scalar: 611ms, Average 8x1 AVX: 652ms, Average 2x2 SSE: 644ms

With other parts of the rendering optimized for SIMD, the results are

Average Scalar: 103ms, Average 8x1 AVX: 112ms, Average 2x2 SSE: 129ms

So these are the average times it takes to render my frame, with 8x1 AVX checking a line of 8 pixels at a time, and 2x2 SSE checking a block of 2x2 pixels at a time. So yeah, this is really wild :P.

With Very Sleepy, the routine for pixel checking in scalar code takes like 2% of the total runtime, so its already incredibly fast. I guess I should have checked this earlier :P. It may be that in the future, once my scenes get more complex, Ill have a lot more nodes being occluded which require checking a block of pixels (currently, its somewhere between 10-20% of traversed nodes). So yeah, I lost a lot of time because of my ignorance XD.

I ended up looking only and I found that Ryg's blog has a post on traversing images which store data in 2x2 blocks. He provided the below code that shows how to address a particular pixel location if the texture is stored in tiles of pixels, with 4x4 pixels inside:

  // per-texture constants
uint tileW = 4;
uint tileH = 4;
uint widthInTiles = (width + tileW-1) / tileW;

uint tileX = x / tileW;
uint tileY = y / tileH;
uint inTileX = x % tileW;
uint inTileY = y % tileH;

pixel = image[(tileY * widthInTiles + tileX) * (tileW * tileH)
+ inTileY * tileW
+ inTileX];

The post can be found here: https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/. So the optimization would be to use the % operator instead of a multiplication and subtraction. I'll probably look more into this, mainly because its interesting to reorder 2d buffers into these kinds of formats, but it looks like for my app now, its a decent amount slower.

Lastly, yeah I should have commented my code more. I guess I got to into what I was doing and didn't realize if it was readable or not  Thanks a lot for still sticking by though

##### Share on other sites

Yes, that old chestnut, we've all been there, where our optimization made things slower!  It is well worth doing these experiments though, but the lesson is let the profiler be your guide. I often find that SIMD version isn't faster than scalar code, very often it is something else like memory bandwidth / cache that is limiting, hence why packing your data well can be such a win. The more you do it the more you get a feel for where the bottlenecks are likely to be, and of course, it is fun!

##### Share on other sites

Yeah it for sure is. Ill try to be more careful, but I'm happy I implemented the changes anyways, now I have an idea of how to do it :P. I think in my case I'm completely computationally bounded, almost all the time is taken projecting octree nodes on to the screen, and I wrote a SIMD version of that (its what got my times from 600ms to 100ms), but I got a couple more ideas for optimizations to make it faster! I'll try to be more careful from now XD.

## Create an account

Register a new account

• 40
• 15
• 10
• 23
• 10
• ### Similar Content

• Hi again,  After some looking around I have decided to base my game directly on Direct X rather than using an existing game engine.  Because of the nature of the stuff I'm doing it just didn't seem to fit very well and I kept running into road blocks.  At this point I have a big blob of code for doing fractal world generation and some collision code,  and I'm trying to put it into some form that resembles a game engine.  Since I've never used one before It's a bit alien to me ..... so can someone direct me to a book, website, article, whatever... that covers this?  I'm mainly looking for stuff that covers C++ library design. I'm not adverse to using 3rd party tools for stuff I can used them for.
• By chiffre
Introduction:
In general my questions pertain to the differences between floating- and fixed-point data. Additionally I would like to understand when it can be advantageous to prefer fixed-point representation over floating-point representation in the context of vertex data and how the hardware deals with the different data-types. I believe I should be able to reduce the amount of data (bytes) necessary per vertex by choosing the most opportune representations for my vertex attributes. Thanks ahead of time if you, the reader, are considering the effort of reading this and helping me.
I found an old topic that shows this is possible in principal, but I am not sure I understand what the pitfalls are when using fixed-point representation and whether there are any hardware-based performance advantages/disadvantages.
(TLDR at bottom)
The Actual Post:
To my understanding HLSL/D3D11 offers not just the traditional floating point model in half-,single-, and double-precision, but also the fixed-point model in form of signed/unsigned normalized integers in 8-,10-,16-,24-, and 32-bit variants. Both models offer a finite sequence of "grid-points". The obvious difference between the two models is that the fixed-point model offers a constant spacing between values in the normalized range of [0,1] or [-1,1], while the floating point model allows for smaller "deltas" as you get closer to 0, and larger "deltas" the further you are away from 0.
To add some context, let me define a struct as an example:
struct VertexData { float[3] position; //3x32-bits float[2] texCoord; //2x32-bits float[3] normals; //3x32-bits } //Total of 32 bytes Every vertex gets a position, a coordinate on my texture, and a normal to do some light calculations. In this case we have 8x32=256bits per vertex. Since the texture coordinates lie in the interval [0,1] and the normal vector components are in the interval [-1,1] it would seem useful to use normalized representation as suggested in the topic linked at the top of the post. The texture coordinates might as well be represented in a fixed-point model, because it seems most useful to be able to sample the texture in a uniform manner, as the pixels don't get any "denser" as we get closer to 0. In other words the "delta" does not need to become any smaller as the texture coordinates approach (0,0). A similar argument can be made for the normal-vector, as a normal vector should be normalized anyway, and we want as many points as possible on the sphere around (0,0,0) with a radius of 1, and we don't care about precision around the origin. Even if we have large textures such as 4k by 4k (or the maximum allowed by D3D11, 16k by 16k) we only need as many grid-points on one axis, as there are pixels on one axis. An unsigned normalized 14 bit integer would be ideal, but because it is both unsupported and impractical, we will stick to an unsigned normalized 16 bit integer. The same type should take care of the normal vector coordinates, and might even be a bit overkill.
struct VertexData { float[3] position; //3x32-bits uint16_t[2] texCoord; //2x16bits uint16_t[3] normals; //3x16bits } //Total of 22 bytes Seems like a good start, and we might even be able to take it further, but before we pursue that path, here is my first question: can the GPU even work with the data in this format, or is all I have accomplished minimizing CPU-side RAM usage? Does the GPU have to convert the texture coordinates back to a floating-point model when I hand them over to the sampler in my pixel shader? I have looked up the data types for HLSL and I am not sure I even comprehend how to declare the vertex input type in HLSL. Would the following work?
struct VertexInputType { float3 pos; //this one is obvious unorm half2 tex; //half corresponds to a 16-bit float, so I assume this is wrong, but this the only 16-bit type I found on the linked MSDN site snorm half3 normal; //same as above } I assume this is possible somehow, as I have found input element formats such as: DXGI_FORMAT_R16G16B16A16_SNORM and DXGI_FORMAT_R16G16B16A16_UNORM (also available with a different number of components, as well as different component lengths). I might have to avoid 3-component vectors because there is no 3-component 16-bit input element format, but that is the least of my worries. The next question would be: what happens with my normals if I try to do lighting calculations with them in such a normalized-fixed-point format? Is there no issue as long as I take care not to mix floating- and fixed-point data? Or would that work as well? In general this gives rise to the question: how does the GPU handle fixed-point arithmetic? Is it the same as integer-arithmetic, and/or is it faster/slower than floating-point arithmetic?
Assuming that we still have a valid and useful VertexData format, how far could I take this while remaining on the sensible side of what could be called optimization? Theoretically I could use the an input element format such as DXGI_FORMAT_R10G10B10A2_UNORM to pack my normal coordinates into a 10-bit fixed-point format, and my verticies (in object space) might even be representable in a 16-bit unsigned normalized fixed-point format. That way I could end up with something like the following struct:
struct VertexData { uint16_t[3] pos; //3x16bits uint16_t[2] texCoord; //2x16bits uint32_t packedNormals; //10+10+10+2bits } //Total of 14 bytes Could I use a vertex structure like this without too much performance-loss on the GPU-side? If the GPU has to execute some sort of unpacking algorithm in the background I might as well let it be. In the end I have a functioning deferred renderer, but I would like to reduce the memory footprint of the huge amount of vertecies involved in rendering my landscape.
TLDR: I have a lot of vertices that I need to render and I want to reduce the RAM-usage without introducing crazy compression/decompression algorithms to the CPU or GPU. I am hoping to find a solution by involving fixed-point data-types, but I am not exactly sure how how that would work.

• Well i found out Here what's the problem and how to solve it (Something about world coordinates and object coordinates) but i can't understand how ti works. Can you show me some examples in code on how you implement this???

Scaling Matrix:
m_Impl->scale = glm::mat4(1.0f); m_Impl->scale = glm::scale(m_Impl->scale, glm::vec3(width, height, 0)); Verticies:
//Verticies. float verticies[] = { //Positions. //Texture Coordinates. 1.0f, 1.0f, 0.0f, 0.0f, 2.0f, 1.0f, 1.0f, 0.0f, 2.0f, 2.0f, 1.0f, 1.0f, 1.0f, 2.0f, 0.0f, 1.0f }; Rendering:
//Projection Matrix. glm::mat4 proj = glm::ortho(0.0f, (float)window->GetWidth(), 0.0f, (float)window->GetHeight(), -1.0f, 1.0f); //Set the uniform. material->program->setUniformMat4f("u_MVP", proj * model); //model is the scale matrix from the previous code. //Draw. glDrawElements(GL_TRIANGLES, material->ibo->GetCount(), GL_UNSIGNED_INT, NULL);
#shader vertex #version 330 core layout(location = 0) in vec4 aPos; layout(location = 1) in vec2 aTexCoord; out vec2 texCoord; uniform mat4 u_MVP; void main() { gl_Position = u_MVP*aPos; texCoord = aTexCoord; } #shader fragment #version 330 core out vec4 colors; in vec2 texCoord; uniform sampler2D u_Texture; void main() { colors = texture(u_Texture, texCoord); }
Before Scaling (It's down there on the bottom left corner as a dot).

After Scaling

Problem: Why does the position also changes?? If you see my Verticies, the first position starts at 1.0f, 1.0f , so when i'm scaling it should stay at that position

• Hey guys!

Ok so I have been developing some ideas to get to work on and I have one specifically that I need some assistance with. The App will be called “A Walk On the Beach.” It’s somewhat of a 3D representation of the Apple app “Calm.” The idea is that you can take a virtual stroll up and down a pier on the beach. Building the level of a pier seems self explanatory to me... but my question is this.... How could I make it so that players can leave notes on the pier for other users to read and or respond to? I was thinking something like a virtual “peg board” at the end of the pier where players can “pin up” pictures or post it’s.

• Hello Everyone!
I'm learning openGL, and currently i'm making a simple 2D game engine to test what I've learn so far.  In order to not say to much, i made a video in which i'm showing you the behavior of the rendering.
Video:

What i was expecting to happen, was the player moving around. When i render only the player, he moves as i would expect. When i add a second Sprite object, instead of the Player, this new sprite object is moving and finally if i add a third Sprite object the third one is moving. And the weird think is that i'm transforming the Vertices of the Player so why the transformation is being applied somewhere else?

Take a look at my code:
Sprite Class
(You mostly need to see the Constructor, the Render Method and the Move Method)
#include "Brain.h" #include <glm/gtc/matrix_transform.hpp> #include <vector> struct Sprite::Implementation { //Position. struct pos pos; //Tag. std::string tag; //Texture. Texture *texture; //Model matrix. glm::mat4 model; //Vertex Array Object. VertexArray *vao; //Vertex Buffer Object. VertexBuffer *vbo; //Layout. VertexBufferLayout *layout; //Index Buffer Object. IndexBuffer *ibo; //Shader. Shader *program; //Brains. std::vector<Brain *> brains; //Deconstructor. ~Implementation(); }; Sprite::Sprite(std::string image_path, std::string tag, float x, float y) { //Create Pointer To Implementaion. m_Impl = new Implementation(); //Set the Position of the Sprite object. m_Impl->pos.x = x; m_Impl->pos.y = y; //Set the tag. m_Impl->tag = tag; //Create The Texture. m_Impl->texture = new Texture(image_path); //Initialize the model Matrix. m_Impl->model = glm::mat4(1.0f); //Get the Width and the Height of the Texture. int width = m_Impl->texture->GetWidth(); int height = m_Impl->texture->GetHeight(); //Create the Verticies. float verticies[] = { //Positions //Texture Coordinates. x, y, 0.0f, 0.0f, x + width, y, 1.0f, 0.0f, x + width, y + height, 1.0f, 1.0f, x, y + height, 0.0f, 1.0f }; //Create the Indicies. unsigned int indicies[] = { 0, 1, 2, 2, 3, 0 }; //Create Vertex Array. m_Impl->vao = new VertexArray(); //Create the Vertex Buffer. m_Impl->vbo = new VertexBuffer((void *)verticies, sizeof(verticies)); //Create The Layout. m_Impl->layout = new VertexBufferLayout(); m_Impl->layout->PushFloat(2); m_Impl->layout->PushFloat(2); m_Impl->vao->AddBuffer(m_Impl->vbo, m_Impl->layout); //Create the Index Buffer. m_Impl->ibo = new IndexBuffer(indicies, 6); //Create the new shader. m_Impl->program = new Shader("Shaders/SpriteShader.shader"); } //Render. void Sprite::Render(Window * window) { //Create the projection Matrix based on the current window width and height. glm::mat4 proj = glm::ortho(0.0f, (float)window->GetWidth(), 0.0f, (float)window->GetHeight(), -1.0f, 1.0f); //Set the MVP Uniform. m_Impl->program->setUniformMat4f("u_MVP", proj * m_Impl->model); //Run All The Brains (Scripts) of this game object (sprite). for (unsigned int i = 0; i < m_Impl->brains.size(); i++) { //Get Current Brain. Brain *brain = m_Impl->brains[i]; //Call the start function only once! if (brain->GetStart()) { brain->SetStart(false); brain->Start(); } //Call the update function every frame. brain->Update(); } //Render. window->GetRenderer()->Draw(m_Impl->vao, m_Impl->ibo, m_Impl->texture, m_Impl->program); } void Sprite::Move(float speed, bool left, bool right, bool up, bool down) { if (left) { m_Impl->pos.x -= speed; m_Impl->model = glm::translate(m_Impl->model, glm::vec3(-speed, 0, 0)); } if (right) { m_Impl->pos.x += speed; m_Impl->model = glm::translate(m_Impl->model, glm::vec3(speed, 0, 0)); } if (up) { m_Impl->pos.y += speed; m_Impl->model = glm::translate(m_Impl->model, glm::vec3(0, speed, 0)); } if (down) { m_Impl->pos.y -= speed; m_Impl->model = glm::translate(m_Impl->model, glm::vec3(0, -speed, 0)); } } void Sprite::AddBrain(Brain * brain) { //Push back the brain object. m_Impl->brains.push_back(brain); } pos *Sprite::GetPos() { return &m_Impl->pos; } std::string Sprite::GetTag() { return m_Impl->tag; } int Sprite::GetWidth() { return m_Impl->texture->GetWidth(); } int Sprite::GetHeight() { return m_Impl->texture->GetHeight(); } Sprite::~Sprite() { delete m_Impl; } //Implementation Deconstructor. Sprite::Implementation::~Implementation() { delete texture; delete vao; delete vbo; delete layout; delete ibo; delete program; }
Renderer Class
#include "Renderer.h" #include "Error.h" Renderer::Renderer() { } Renderer::~Renderer() { } void Renderer::Draw(VertexArray * vao, IndexBuffer * ibo, Texture *texture, Shader * program) { vao->Bind(); ibo->Bind(); program->Bind(); if (texture != NULL) texture->Bind(); GLCall(glDrawElements(GL_TRIANGLES, ibo->GetCount(), GL_UNSIGNED_INT, NULL)); } void Renderer::Clear(float r, float g, float b) { GLCall(glClearColor(r, g, b, 1.0)); GLCall(glClear(GL_COLOR_BUFFER_BIT)); } void Renderer::Update(GLFWwindow *window) { /* Swap front and back buffers */ glfwSwapBuffers(window); /* Poll for and process events */ glfwPollEvents(); }
#shader vertex #version 330 core layout(location = 0) in vec4 aPos; layout(location = 1) in vec2 aTexCoord; out vec2 t_TexCoord; uniform mat4 u_MVP; void main() { gl_Position = u_MVP * aPos; t_TexCoord = aTexCoord; } #shader fragment #version 330 core out vec4 aColor; in vec2 t_TexCoord; uniform sampler2D u_Texture; void main() { aColor = texture(u_Texture, t_TexCoord); } Also i'm pretty sure that every time i'm hitting the up, down, left and right arrows on the keyboard, i'm changing the model Matrix of the Player and not the others.

Window Class:
#include "Window.h" #include <GL/glew.h> #include <GLFW/glfw3.h> #include "Error.h" #include "Renderer.h" #include "Scene.h" #include "Input.h" //Global Variables. int screen_width, screen_height; //On Window Resize. void OnWindowResize(GLFWwindow *window, int width, int height); //Implementation Structure. struct Window::Implementation { //GLFW Window. GLFWwindow *GLFW_window; //Renderer. Renderer *renderer; //Delta Time. double delta_time; //Frames Per Second. int fps; //Scene. Scene *scnene; //Input. Input *input; //Deconstructor. ~Implementation(); }; //Window Constructor. Window::Window(std::string title, int width, int height) { //Initializing width and height. screen_width = width; screen_height = height; //Create Pointer To Implementation. m_Impl = new Implementation(); //Try initializing GLFW. if (!glfwInit()) { std::cout << "GLFW could not be initialized!" << std::endl; std::cout << "Press ENTER to exit..." << std::endl; std::cin.get(); exit(-1); } //Setting up OpenGL Version 3.3 Core Profile. glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3); glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 3); glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE); /* Create a windowed mode window and its OpenGL context */ m_Impl->GLFW_window = glfwCreateWindow(width, height, title.c_str(), NULL, NULL); if (!m_Impl->GLFW_window) { std::cout << "GLFW could not create a window!" << std::endl; std::cout << "Press ENTER to exit..." << std::endl; std::cin.get(); glfwTerminate(); exit(-1); } /* Make the window's context current */ glfwMakeContextCurrent(m_Impl->GLFW_window); //Initialize GLEW. if(glewInit() != GLEW_OK) { std::cout << "GLEW could not be initialized!" << std::endl; std::cout << "Press ENTER to exit..." << std::endl; std::cin.get(); glfwTerminate(); exit(-1); } //Enabling Blending. GLCall(glEnable(GL_BLEND)); GLCall(glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)); //Setting the ViewPort. GLCall(glViewport(0, 0, width, height)); //**********Initializing Implementation**********// m_Impl->renderer = new Renderer(); m_Impl->delta_time = 0.0; m_Impl->fps = 0; m_Impl->input = new Input(this); //**********Initializing Implementation**********// //Set Frame Buffer Size Callback. glfwSetFramebufferSizeCallback(m_Impl->GLFW_window, OnWindowResize); } //Window Deconstructor. Window::~Window() { delete m_Impl; } //Window Main Loop. void Window::MainLoop() { //Time Variables. double start_time = 0, end_time = 0, old_time = 0, total_time = 0; //Frames Counter. int frames = 0; /* Loop until the user closes the window */ while (!glfwWindowShouldClose(m_Impl->GLFW_window)) { old_time = start_time; //Total time of previous frame. start_time = glfwGetTime(); //Current frame start time. //Calculate the Delta Time. m_Impl->delta_time = start_time - old_time; //Get Frames Per Second. if (total_time >= 1) { m_Impl->fps = frames; total_time = 0; frames = 0; } //Clearing The Screen. m_Impl->renderer->Clear(0, 0, 0); //Render The Scene. if (m_Impl->scnene != NULL) m_Impl->scnene->Render(this); //Updating the Screen. m_Impl->renderer->Update(m_Impl->GLFW_window); //Increasing frames counter. frames++; //End Time. end_time = glfwGetTime(); //Total time after the frame completed. total_time += end_time - start_time; } //Terminate GLFW. glfwTerminate(); } //Load Scene. void Window::LoadScene(Scene * scene) { //Set the scene. m_Impl->scnene = scene; } //Get Delta Time. double Window::GetDeltaTime() { return m_Impl->delta_time; } //Get FPS. int Window::GetFPS() { return m_Impl->fps; } //Get Width. int Window::GetWidth() { return screen_width; } //Get Height. int Window::GetHeight() { return screen_height; } //Get Input. Input * Window::GetInput() { return m_Impl->input; } Renderer * Window::GetRenderer() { return m_Impl->renderer; } GLFWwindow * Window::GetGLFWindow() { return m_Impl->GLFW_window; } //Implementation Deconstructor. Window::Implementation::~Implementation() { delete renderer; delete input; } //OnWindowResize void OnWindowResize(GLFWwindow *window, int width, int height) { screen_width = width; screen_height = height; //Updating the ViewPort. GLCall(glViewport(0, 0, width, height)); }
Brain Class
#include "Brain.h" #include "Sprite.h" #include "Window.h" struct Brain::Implementation { //Just A Flag. bool started; //Window Pointer. Window *window; //Sprite Pointer. Sprite *sprite; }; Brain::Brain(Window *window, Sprite *sprite) { //Create Pointer To Implementation. m_Impl = new Implementation(); //Initialize Implementation. m_Impl->started = true; m_Impl->window = window; m_Impl->sprite = sprite; } Brain::~Brain() { //Delete Pointer To Implementation. delete m_Impl; } void Brain::Start() { } void Brain::Update() { } Window * Brain::GetWindow() { return m_Impl->window; } Sprite * Brain::GetSprite() { return m_Impl->sprite; } bool Brain::GetStart() { return m_Impl->started; } void Brain::SetStart(bool value) { m_Impl->started = value; } Script Class (Its a Brain Subclass!!!)
#include "Script.h" Script::Script(Window *window, Sprite *sprite) : Brain(window, sprite) { } Script::~Script() { } void Script::Start() { std::cout << "Game Started!" << std::endl; } void Script::Update() { Input *input = this->GetWindow()->GetInput(); Sprite *sp = this->GetSprite(); //Move this sprite. this->GetSprite()->Move(200 * this->GetWindow()->GetDeltaTime(), input->GetKeyDown("left"), input->GetKeyDown("right"), input->GetKeyDown("up"), input->GetKeyDown("down")); std::cout << sp->GetTag().c_str() << ".x = " << sp->GetPos()->x << ", " << sp->GetTag().c_str() << ".y = " << sp->GetPos()->y << std::endl; }
Main:
#include "SpaceShooterEngine.h" #include "Script.h" int main() { Window w("title", 600,600); Scene *scene = new Scene(); Sprite *player = new Sprite("Resources/Images/player.png", "Player", 100,100); Sprite *other = new Sprite("Resources/Images/cherno.png", "Other", 400, 100); Sprite *other2 = new Sprite("Resources/Images/cherno.png", "Other", 300, 400); Brain *brain = new Script(&w, player); player->AddBrain(brain); scene->AddSprite(player); scene->AddSprite(other); scene->AddSprite(other2); w.LoadScene(scene); w.MainLoop(); return 0; }

I literally can't find what is wrong. If you need more code, ask me to post it. I will also attach all the source files.
Brain.cpp
Error.cpp
IndexBuffer.cpp
Input.cpp
Renderer.cpp
Scene.cpp
Sprite.cpp
Texture.cpp
VertexArray.cpp
VertexBuffer.cpp
VertexBufferLayout.cpp
Window.cpp
Brain.h
Error.h
IndexBuffer.h
Input.h
Renderer.h
Scene.h
SpaceShooterEngine.h
Sprite.h
Texture.h
VertexArray.h
VertexBuffer.h
VertexBufferLayout.h
Window.h