Let me preface this entry by saying that I'm a newbie when it comes to HLSL and GPU rendering. This is mostly a cronological thought process of my research and development of my first GPU based primitive system.
----------------------
I've been working on creating a "primitive system", or as some people call it, "geometry system". The goal is to be able to render the common primitives as quickly as possible and to be able to use the primitives system as the backbone for bill boards, point sprites, particles, etc. In order to achieve the fastest speeds possible, I've decided to push as much computation out to the graphics card as possible, and to minimize the amount of data being sent to the graphics card. The more the GPU can do, the less the CPU has to do. The less data I need to send, the more data can fit on the GPU.
I've managed to reduce a quad vertex down to 4 bytes. Drawing a quad only requires sending 4 vertices, each 4 bytes, so in total we've got 16 bytes of VRAM for a quad. Every single quad vertex is pretty much the same for each quad, so we can get really clever. The main differentiation between one quad and the next comes down to position offset, rotation, and scale. Hey, these are commonly handled by a world transformation matrix, right?!
So, let's create two vertex buffers. The first vertex buffer will contain the per-vertex data for the quad. The second vertex will store the world transformation matrix. Finally, we'll have an index buffer to specify the draw order for the quad vertices.
Here is what my four byte quad structure looks like:
Note that I am using a manager to manage a collection of quads, so each quad in the collection will have a different world transformation matrix. Therefore, setting a world transformation matrix as a global variable in the shader just won't work.
Here is the code for storing matrix data on a per-vertex basis:
But, if we can use 64 bytes for four vertices, we can sort of justify this by saying that we're really using 64/4 = 16 bytes per vertex, so we're around a total of 20 bytes per vertex -- IF we can reuse the matrix for all four vertices in the quad.
It took me a few hours to write the associated shader code. It didn't work and I couldn't figure out why. It's a royal bitch to debug shaders on a per-vertex basis, especially when the compiler detects unused code and optimizes it away (so, no inline debug code!). For the longest time, I was trying to figure out why my shader was creating garbage data. Through rigorous testing and examination in the PIX debugger, I have learned the following lessons:
-------------------------------------------------------------------------------
-The COLOR semantic makes all of the values of an XNA Color type be in the range of 0.0->1.0 for each color channel. They are NOT a byte value which range from 0-255.
Pre-VS:
-The position.w value MUST be 1.0 before being multiplied by a matrix.
-Why? Because when you multiply a position by a float4x4 matrix, the multiplication function isn't smart enough to know how to handle vector3s. If you use a vector3, the output will be funky (I get black screens).
Post-VS:
-The position should have four variables: x,y,z,w
-The position.z and position.w values should be non-zero after they are transformed by the view and projection matrices. I don't know why, or how its relevant to screen space coordinates.
-1.#Q0 is the shaders version of an invalid value. If you see this in your debugger, there might be a problem!
- The floating point value 0.000 is the same as -0.000; Don't worry about negative zero.
-------------------------------------------------------------------------------
Anyways, onto the HLSL! The idea here is that we can derive vertex specific information if we can figure out which vertex in the quad we're working with. Since there are four vertices, we have to somehow store the vertex ID. I decided to store this value in the alpha channel of the color (the RGB values are just for tinting, not transparency). We can also send in the camera position and derive the normal face since we know the quad position. Since we know the normal, we can also derive the left and up vectors, and thus, derive each of the four corners of a quad. Finally, we apply the quad specific world transformation matrix to each point to get it from object space into world space. I'm leading up to the huge mistake I made here...
HLSL Vertex Shader Code:
So, we have three buffers going. One is an index buffer, the other two are vertex buffers. For the sake of simplicity, let's assume that our world matrix is an identity matrix. What do the buffers look like in VRAM?
Index Buffer:
0 - 0
1 - 1
2 - 2
3 - 0
4 - 2
5 - 3
(Nothing fancy here...)
Vertex Buffer 0: Byte color data in rgba format
0 - 0xFFFFFF00
1 - 0xFFFFFF04
2 - 0xFFFFFF08
3 - 0xFFFFFF0C
Vertex Buffer 1: 64 bytes used to store one vertex containing identity matrix info
0 - 1
1 - 0
2 - 0
3 - 0
4 - 0
5 - 1
6 - 0
7 - 0
(etc, to 64)
So, what's the problem with the Vertex Shader? The code is all correct...
When we compute vertex 0 in the quad, everything works just as is expected. We're pulling data from block 0 in Vertex Buffer 0, and data from blocks 0-64 in vertex buffer 1 because block 0 in the index buffer points to vertex 0. However, when we move to vertex 1 in the index buffer, we're now going to read in vertex 1 data. Vertex Buffer 0 posses no problems, we correctly read in the vertex 1 byte data. However, Vertex buffer 1, doesn't have data for vertex 1, which would fall in the memory block range of 65-128. So, we end up reading in whatever garbage data is on the video card in this uninitialized memory block. Sometimes we'll get zeros and think all is good, but sometimes we'll read in -1.#Q0 which is not a number and will zero out our matrix. Who knows what we have in there. Now, when we go to multiply our vertex position with a fucked up matrix, we'll get zeros for all x,y,z values and scratch our heads wondering what the hell happened, and try to figure out why the matrix data keeps changing randomly.
Since I explained the problem already, it's obvious that one potential fix is to repeat the world matrix four times in Vertex Buffer 1. It'll work, but it's also ridiculous. We're now looking at using 256 + 4 bytes of data per quad!
It would be *really* nice if we could have two index buffers per quad, right? One to point at the vertex drawing order, the other to point to the ID of the shared world transformation matrix. Unfortunately, this capability is not available in the current DirectX API... and it's almost certainly never going to be available. So, what is our recourse for action?
*sigh*
Well, I've done a bit of digging and the tentative solution is hardware instancing (MSDN Article). This is exactly what I want to use it for...
(Copyright belongs to MSFT, but used under Fair Use)
And, oh hey! Look! I'm not the first person to come up with this novel idea. In fact, there's a whole article and function call designed specifically for doing what I've been trying to do. Fancy that!
Next Day:
Alright, so I was woefully ignorant. I'm sure if you read all of the garbage I posted above, you've probably been scratching your head wondering why I didn't just do what I'm about to describe.
So, the MSDN article describes in a murky sort of fashion, how to do hardware instancing. The article alone isn't really enough to get something off the ground, so they also included a handy sample project to demo the idea. In a nutshell, what is the gist of hardware instancing?
Well, you want to use two vertex buffers. The first vertex buffer contains all of the vertex data for the object you want to instance. It's sort of like a container for a rubber stamp which you're going to use a lot. The second vertex buffer will contain instance specific vertex data, for each instance you want to use. You'll also have just one index buffer, matching the draw order of your first vertex buffer shape. Since I'm working with quads, my vertex buffer will only contain four vertices. This is very small!
In my previous days work, I was thinking that I'd have to duplicate each vertex for each instance, so I was trying to be very clever in my approach to reduce the vertex byte size to 4 bytes. My total size for a quad would be 16 bytes, and the rest of the quad data would be inferred by my GPU shader. Well, that's workable but it also adds a bit of additional complexity. It turns out that with hardware instancing, I will reuse the same vertex information for each instance, so whether I have one or a thousand instances on the screen, my first vertex buffer will be unchanged in size. Therefore, the reasoning goes, there's no point in trying to skimp on memory footprint size for the first vertex buffer. I can put all the data I need to use into the vertex declaration and remove the portions of shader code which tries to infer vertex information based on assumptions. Less instructions on the GPU means faster performance!
As for the second vertex buffer, it stores a float4x4 matrix for the world transforms. Automatically, it's already 64 bytes in size per instance. Last night as I was falling asleep, I was thinking that I *could* get away with storing the values used to create a world transformation matrix. It would look something like this:
On the other hand, it is a lot more flexible to use a world transformation matrix. Using a world xform matrix, I could certainly do the common Scale, Rotate, Translate (SRT) transforms, but it would also support variations and additions on the SRT which I'd lose out on with the method mentioned above.
Anyways, I did get hardware instancing to work for my point sprites. The code is pretty straight forward if you look at the Microsoft sample code, so I'll leave that discussion out. There were a few gotchas with the shader code though.
First off, if you send in a float4x4 matrix as vertex shader input parameter, for some mysterious reason which I cannot fathom, it is rotated and needs to be transposed before you use it.
Since I was writing the shader for point sprites, which have the behavior of always facing the camera, I had to manually process the vertex transforms. The vertex transforms are stored in a world matrix, so I had to figure out how to pull out translations, scaling, and rotations.
The translations are the easiest to pull out. They're simply stored in the Matrix._41, Matrix._42, and Matrix._43 cells.
Scaling values are a bit more challenging. First, you have to grab the determinant of the matrix. I have no idea what this is or means, but you end up getting some float value. This value is the scale of the matrix cubed, so to get the matrix scale, you have to take the cube root of the determinant (note: you can get roots by raising a value to a fractional power, ie, 2^(1/3) is the cube root of 2.)
The rotation values are a bit more bullshitty to deal with because you can have multiple rotations stored in the matrix. Luckily, we're dealing with point sprites, so the only rotation we're going to deal with is a rotation around the z-axis. The other two rotations are dependent on the camera position - center position vector. That makes it really easy to deal with: Matrix._11 will contain a value which is the cosine of theta, so getting theta is as straight forward as taking the inverse cosine.
Again, none of this would be necessary if I just passed in the instanced values directly instead of through a transformation matrix...
Here is the resulting shader code for point sprites:
----------------------
I've been working on creating a "primitive system", or as some people call it, "geometry system". The goal is to be able to render the common primitives as quickly as possible and to be able to use the primitives system as the backbone for bill boards, point sprites, particles, etc. In order to achieve the fastest speeds possible, I've decided to push as much computation out to the graphics card as possible, and to minimize the amount of data being sent to the graphics card. The more the GPU can do, the less the CPU has to do. The less data I need to send, the more data can fit on the GPU.
I've managed to reduce a quad vertex down to 4 bytes. Drawing a quad only requires sending 4 vertices, each 4 bytes, so in total we've got 16 bytes of VRAM for a quad. Every single quad vertex is pretty much the same for each quad, so we can get really clever. The main differentiation between one quad and the next comes down to position offset, rotation, and scale. Hey, these are commonly handled by a world transformation matrix, right?!
So, let's create two vertex buffers. The first vertex buffer will contain the per-vertex data for the quad. The second vertex will store the world transformation matrix. Finally, we'll have an index buffer to specify the draw order for the quad vertices.
Here is what my four byte quad structure looks like:
public struct QuadVertex : IVertexType { /*So, we're gonna get funky here. The R,G,B components of the color denote any color TINT for the quad. Since we also have an alpha channel, we're going to store the CornerID of the vertex within it!*/ public Color Color; /// <summary> /// Creates a vertex which contains position, normal, color, and texture UV info /// </summary> /// <param name="cornerID">A value indicating which corner of the quad this vertex belongs to (range: 0->3)</param> /// <param name="color">the RGB color value indicating any tinting to use</param> public QuadVertex(byte cornerID, Color color) { Color = color; Color.A = cornerID; } public static readonly VertexDeclaration VertexDeclaration = new VertexDeclaration( new VertexElement(0, VertexElementFormat.Color, VertexElementUsage.Color, 0) //4 bytes (argb) ); public const int SizeInBytes = 4; VertexDeclaration IVertexType.VertexDeclaration { get { return VertexDeclaration; } } }Look good so far? Here's where I start to get stupid. When you create a quad, point sprite or billboard, you can set an associated world transformation matrix which has the quad instance data loaded into it. In my mistaken belief, I believed that I could use the same matrix for four vertices. Makes sense, right? Why would you want more than one matrix if the data is going to be the same for all four vertices?
Note that I am using a manager to manage a collection of quads, so each quad in the collection will have a different world transformation matrix. Therefore, setting a world transformation matrix as a global variable in the shader just won't work.
Here is the code for storing matrix data on a per-vertex basis:
public struct MatrixVertex { Vector4 m_1, m_2, m_3, m_4; public MatrixVertex(Matrix world) { m_1 = new Vector4(world.M11, world.M12, world.M13, world.M14); m_2 = new Vector4(world.M21, world.M22, world.M23, world.M24); m_3 = new Vector4(world.M31, world.M32, world.M33, world.M34); m_4 = new Vector4(world.M41, world.M42, world.M43, world.M44); } public static readonly VertexDeclaration VertexDeclaration = new VertexDeclaration( new VertexElement(0, VertexElementFormat.Vector4, VertexElementUsage.Position, 0), //16 bytes new VertexElement(16, VertexElementFormat.Vector4, VertexElementUsage.Position, 1), //16 bytes new VertexElement(32, VertexElementFormat.Vector4, VertexElementUsage.Position, 2), //16 bytes new VertexElement(48, VertexElementFormat.Vector4, VertexElementUsage.Position, 3) //16 bytes ); public const int SizeInBytes = 64; }Notice that we're sitting at 64 bytes in size per quad! (Atrocious!)
But, if we can use 64 bytes for four vertices, we can sort of justify this by saying that we're really using 64/4 = 16 bytes per vertex, so we're around a total of 20 bytes per vertex -- IF we can reuse the matrix for all four vertices in the quad.
It took me a few hours to write the associated shader code. It didn't work and I couldn't figure out why. It's a royal bitch to debug shaders on a per-vertex basis, especially when the compiler detects unused code and optimizes it away (so, no inline debug code!). For the longest time, I was trying to figure out why my shader was creating garbage data. Through rigorous testing and examination in the PIX debugger, I have learned the following lessons:
-------------------------------------------------------------------------------
-The COLOR semantic makes all of the values of an XNA Color type be in the range of 0.0->1.0 for each color channel. They are NOT a byte value which range from 0-255.
Pre-VS:
-The position.w value MUST be 1.0 before being multiplied by a matrix.
-Why? Because when you multiply a position by a float4x4 matrix, the multiplication function isn't smart enough to know how to handle vector3s. If you use a vector3, the output will be funky (I get black screens).
Post-VS:
-The position should have four variables: x,y,z,w
-The position.z and position.w values should be non-zero after they are transformed by the view and projection matrices. I don't know why, or how its relevant to screen space coordinates.
-1.#Q0 is the shaders version of an invalid value. If you see this in your debugger, there might be a problem!
- The floating point value 0.000 is the same as -0.000; Don't worry about negative zero.
-------------------------------------------------------------------------------
Anyways, onto the HLSL! The idea here is that we can derive vertex specific information if we can figure out which vertex in the quad we're working with. Since there are four vertices, we have to somehow store the vertex ID. I decided to store this value in the alpha channel of the color (the RGB values are just for tinting, not transparency). We can also send in the camera position and derive the normal face since we know the quad position. Since we know the normal, we can also derive the left and up vectors, and thus, derive each of the four corners of a quad. Finally, we apply the quad specific world transformation matrix to each point to get it from object space into world space. I'm leading up to the huge mistake I made here...
HLSL Vertex Shader Code:
VSOUT VS_3DPointSpriteTex(VertexShaderInput input, WorldMatrixInput wi) { VSOUT output = (VSOUT)0; float4x4 worldMatrix = {worldInput.M1, worldInput.M2, worldInput.M3, worldInput.M4}; float3 m_center = {wi.M4.x, wi.M4.y, wi.M4.z}; float3 m_normal = normalize(CameraPosition - m_center); float3 m_left = cross(m_normal, CameraUp); float3 m_up = cross(m_left, m_normal); float3 pos = (float3)0; float3 sw = m_left - m_up; float3 nw = m_left + m_up; //Note that color values are stored as floats which range from 0.0 -> 1.0. int cornerID = input.Color.a * 255; //we now have our up and left vectors on the quad plane. //We can now figure out the corners for the quad based on size and rotation. if(cornerID == 0) //bottom left corner { pos += sw; output.TextureCoord.x = 0; output.TextureCoord.y = 1; } else if(cornerID == 64) //top left corner { pos += nw; output.TextureCoord.x = 0; output.TextureCoord.y = 0; } else if(cornerID == 128) //bottom right corner { pos -= sw; output.TextureCoord.x = 1; output.TextureCoord.y = 0; } else if(cornerID == 192) //top right corner { pos -= nw; output.TextureCoord.x = 1; output.TextureCoord.y = 1; } float4 finalPos = mul(pos, worldMatrix); float4 viewPosition = mul(finalPos, View); output.Position = mul(viewPosition, Projection); output.Color = input.Color; return output; }Everything looks right here, right?
So, we have three buffers going. One is an index buffer, the other two are vertex buffers. For the sake of simplicity, let's assume that our world matrix is an identity matrix. What do the buffers look like in VRAM?
Index Buffer:
0 - 0
1 - 1
2 - 2
3 - 0
4 - 2
5 - 3
(Nothing fancy here...)
Vertex Buffer 0: Byte color data in rgba format
0 - 0xFFFFFF00
1 - 0xFFFFFF04
2 - 0xFFFFFF08
3 - 0xFFFFFF0C
Vertex Buffer 1: 64 bytes used to store one vertex containing identity matrix info
0 - 1
1 - 0
2 - 0
3 - 0
4 - 0
5 - 1
6 - 0
7 - 0
(etc, to 64)
So, what's the problem with the Vertex Shader? The code is all correct...
When we compute vertex 0 in the quad, everything works just as is expected. We're pulling data from block 0 in Vertex Buffer 0, and data from blocks 0-64 in vertex buffer 1 because block 0 in the index buffer points to vertex 0. However, when we move to vertex 1 in the index buffer, we're now going to read in vertex 1 data. Vertex Buffer 0 posses no problems, we correctly read in the vertex 1 byte data. However, Vertex buffer 1, doesn't have data for vertex 1, which would fall in the memory block range of 65-128. So, we end up reading in whatever garbage data is on the video card in this uninitialized memory block. Sometimes we'll get zeros and think all is good, but sometimes we'll read in -1.#Q0 which is not a number and will zero out our matrix. Who knows what we have in there. Now, when we go to multiply our vertex position with a fucked up matrix, we'll get zeros for all x,y,z values and scratch our heads wondering what the hell happened, and try to figure out why the matrix data keeps changing randomly.
Since I explained the problem already, it's obvious that one potential fix is to repeat the world matrix four times in Vertex Buffer 1. It'll work, but it's also ridiculous. We're now looking at using 256 + 4 bytes of data per quad!
It would be *really* nice if we could have two index buffers per quad, right? One to point at the vertex drawing order, the other to point to the ID of the shared world transformation matrix. Unfortunately, this capability is not available in the current DirectX API... and it's almost certainly never going to be available. So, what is our recourse for action?
*sigh*
Well, I've done a bit of digging and the tentative solution is hardware instancing (MSDN Article). This is exactly what I want to use it for...
(Copyright belongs to MSFT, but used under Fair Use)
And, oh hey! Look! I'm not the first person to come up with this novel idea. In fact, there's a whole article and function call designed specifically for doing what I've been trying to do. Fancy that!
Next Day:
Alright, so I was woefully ignorant. I'm sure if you read all of the garbage I posted above, you've probably been scratching your head wondering why I didn't just do what I'm about to describe.
So, the MSDN article describes in a murky sort of fashion, how to do hardware instancing. The article alone isn't really enough to get something off the ground, so they also included a handy sample project to demo the idea. In a nutshell, what is the gist of hardware instancing?
Well, you want to use two vertex buffers. The first vertex buffer contains all of the vertex data for the object you want to instance. It's sort of like a container for a rubber stamp which you're going to use a lot. The second vertex buffer will contain instance specific vertex data, for each instance you want to use. You'll also have just one index buffer, matching the draw order of your first vertex buffer shape. Since I'm working with quads, my vertex buffer will only contain four vertices. This is very small!
In my previous days work, I was thinking that I'd have to duplicate each vertex for each instance, so I was trying to be very clever in my approach to reduce the vertex byte size to 4 bytes. My total size for a quad would be 16 bytes, and the rest of the quad data would be inferred by my GPU shader. Well, that's workable but it also adds a bit of additional complexity. It turns out that with hardware instancing, I will reuse the same vertex information for each instance, so whether I have one or a thousand instances on the screen, my first vertex buffer will be unchanged in size. Therefore, the reasoning goes, there's no point in trying to skimp on memory footprint size for the first vertex buffer. I can put all the data I need to use into the vertex declaration and remove the portions of shader code which tries to infer vertex information based on assumptions. Less instructions on the GPU means faster performance!
As for the second vertex buffer, it stores a float4x4 matrix for the world transforms. Automatically, it's already 64 bytes in size per instance. Last night as I was falling asleep, I was thinking that I *could* get away with storing the values used to create a world transformation matrix. It would look something like this:
struct InstanceVertex { Vector4 position_scale; Vector3 rotations; public InstanceVector(vector3 position, float scale, float x_rot, float y_rot, float z_rot) { position_scale.x = position.x; position_scale.y = position.y; position_scale.z = position.z; position_scale.w = scale; rotations = new Vector3(x_rot, y_rot, z_rot); //and then, the world xform matrix would be created in the shader. } public const int ByteSize = 28; }Considering that I could have hundreds of thousands of instances, any savings in vertex byte size is huge. 28 bytes looks a lot better than 64 bytes. In addition, if my per-vertex memory budget was 64 bytes, I could still afford to add in a bunch of extra miscellaneous data, such as velocity, angular rotation velocities, life spans, etc. and create a sexy GPU based particle system.
On the other hand, it is a lot more flexible to use a world transformation matrix. Using a world xform matrix, I could certainly do the common Scale, Rotate, Translate (SRT) transforms, but it would also support variations and additions on the SRT which I'd lose out on with the method mentioned above.
Anyways, I did get hardware instancing to work for my point sprites. The code is pretty straight forward if you look at the Microsoft sample code, so I'll leave that discussion out. There were a few gotchas with the shader code though.
First off, if you send in a float4x4 matrix as vertex shader input parameter, for some mysterious reason which I cannot fathom, it is rotated and needs to be transposed before you use it.
Since I was writing the shader for point sprites, which have the behavior of always facing the camera, I had to manually process the vertex transforms. The vertex transforms are stored in a world matrix, so I had to figure out how to pull out translations, scaling, and rotations.
The translations are the easiest to pull out. They're simply stored in the Matrix._41, Matrix._42, and Matrix._43 cells.
Scaling values are a bit more challenging. First, you have to grab the determinant of the matrix. I have no idea what this is or means, but you end up getting some float value. This value is the scale of the matrix cubed, so to get the matrix scale, you have to take the cube root of the determinant (note: you can get roots by raising a value to a fractional power, ie, 2^(1/3) is the cube root of 2.)
The rotation values are a bit more bullshitty to deal with because you can have multiple rotations stored in the matrix. Luckily, we're dealing with point sprites, so the only rotation we're going to deal with is a rotation around the z-axis. The other two rotations are dependent on the camera position - center position vector. That makes it really easy to deal with: Matrix._11 will contain a value which is the cosine of theta, so getting theta is as straight forward as taking the inverse cosine.
Again, none of this would be necessary if I just passed in the instanced values directly instead of through a transformation matrix...
Here is the resulting shader code for point sprites:
//3D Textured point sprites/////////////////////////////////////////////////////////////////////// VSOUT VS_3DPointSpriteTex(VertexShaderInput input, float4x4 instanceTransform : BLENDWEIGHT) { /* SUMMARY: A point sprite is a special type of quad which will always face the camera. The point sprite can be scaled and rotated around the camera-sprite axis (normal) by any arbitrary angle. Because of these special behaviors, we have to apply some special instructions beyond just multiplying a point by the world matrix. */ float4x4 m_world = transpose(instanceTransform); //I have no idea why, but the input matrix is on its side. To fix this, we transpose it. float m_scale = pow(determinant(m_world), (1.0f/3.0f)); //the scale can be found by taking a cube root of the determinant of the world matrix. float m_rotation = acos(clamp((m_world._11 / m_scale), -1.0f, 1.0f)); //Derive the radian value of the rotation around the Z-axis in object space float3 m_center = {m_world._41, m_world._42, m_world._43}; //this is the transformed center position for the quad. float3 m_normal = normalize(CameraPosition - m_center); //the normal is going to be dependent on the camera position and the center position float3 m_left = cross(m_normal, CameraUp); //the left vector can be derived from the camera orientation and quad normal float3 m_up = cross(m_left, m_normal); //the up vector is simply a cross of the left vector and normal vector float3x3 m_rot = CreateRotation(m_rotation, m_normal); //Create a rotation matrix around the object space normal axis by the given radian amount. //This rotation matrix must then be applied to the left and up vectors. m_left = mul(m_left, m_rot) * m_scale; //apply rotation and scale to the left vector m_up = mul(m_up, m_rot) * m_scale; //apply rotation and scale to the up vector //Since we have to orient our quad to always face the camera, we have to change the input position values based on the left and up vectors. //the left and up vectors are in untranslated space. We know the translation, so we just set the vertex position to be the translation added to //the rotated and scaled left/up vectors. float3 pos = (float)0; if(input.Position.x == -1 && input.Position.y == -1) //bottom left corner { pos = m_center + (m_left - m_up); } else if(input.Position.x == -1 && input.Position.y == 1) //top left corner { pos = m_center + (m_left + m_up); } else if(input.Position.x == 1 && input.Position.y == 1) //top right corner { pos = m_center - (m_left - m_up); } else //bottom right corner { pos = m_center - (m_left + m_up); } //Since we've already manually applied our world transformations, we can skip that matrix multiplication. //note that we HAVE to use a Vector4 for the world position because our view & projection matrices are 4x4. //the matrix multiplication function isn't smart enough to use a vector3. The "w" value must be 1. float4 worldPosition = 1.0f; worldPosition.xyz = pos; VSOUT output; output.Position = mul(mul(worldPosition, View), Projection); output.Color = input.Color; output.TexCoord = input.TexCoord; return output; }As a proof of concept, I tried to render 1,000 point sprites using instancing. I was able to do it in one draw call at a framerate of 900+!