Sign in to follow this  
Riko Ophorst

Recommended byte size of a single vertex

Recommended Posts

I'm wondering, what is the recommended byte size of a single vertex?

Currently, we're using 80 bytes per vertex in our engine:

struct Vertex
{
DirectX::XMFLOAT3 position; // 0 + 12 = 12 bytes
DirectX::XMFLOAT3 normal; // 12 + 12 = 24 bytes
DirectX::XMFLOAT3 tangent; // 24 + 12 = 36 bytes
DirectX::XMFLOAT3 bitangent; // 36 + 12 = 48 bytes
DirectX::XMFLOAT2 uv; // 48 + 8 = 56 bytes
DirectX::XMFLOAT4 color; // 56 + 16 = 72 bytes
uint32_t bone_id; // 72 + 4 = 76 bytes
float bone_weight; // 76 + 4 = 80 bytes
}

What are the things to keep in mind when setting up a vertex layout?

 

Share this post


Link to post
Share on other sites

Have it be a multiple of your GPU's cache line size (which typically means a multiple of 32) and be as small as possible.

In your case there is ample opportunity for compression: you can, for example, drop the bitangent and compute it at runtime (in your vertex shader) from your normal and tangent - that's a 12-byte saving (and may even run faster since compute is very cheap these days).  Similarly, your color could be reduced to a ubyte4 type which is normally sufficient - another 12 bytes, and now your vertex size is 56.

Edited by mhagain

Share this post


Link to post
Share on other sites

Have it be a multiple of your GPU's cache line size (which typically means a multiple of 32) and be as small as possible.

In your case there is ample opportunity for compression: you can, for example, drop the bitangent and compute it at runtime (in your vertex shader) from your normal and tangent - that's a 12-byte saving (and may even run faster since compute is very cheap these days).  Similarly, your color could be reduced to a ubyte4 type which is normally sufficient - another 12 bytes, and now your vertex size is 56.

Is that sort of stuff worth it though? Reducing memory usage at the cost of increasing computation?

Share this post


Link to post
Share on other sites

Is that sort of stuff worth it though? Reducing memory usage at the cost of increasing computation?


Computation is cheap - and in this case it's just a cross-product per vertex so it's not something you really need to worry excessively over.

Also worth noting that we're not talking about reducing overall memory usage as a goal here.  I need to be absolutely clear on that, because (so long as you're not swapping) amount of memory used is almost never an arbiter of performance.  The goal is instead to reduce the vertex size so that it fits into fewer cache lines.

Share this post


Link to post
Share on other sites

First a question, its been a while regarding skeletal animation but shouldn't you have 3 bone weights since you have 4 bones?

Second the optimal layout would be each stream being a power of 2 in size.  So if you have a full vertex of 48bytes split it into one 32byte and one 16byte per element stream.  Also like previously mentioned the smaller the better.

edit - forget the question, I misread.

Edited by Infinisearch

Share this post


Link to post
Share on other sites

In your case there is ample opportunity for compression: you can, for example, drop the bitangent and compute it at runtime (in your vertex shader) from your normal and tangent - that's a 12-byte saving (and may even run faster since compute is very cheap these days).

Or go a step smaller and store a quaternion tangent frame instead of a separate normal and tangent. Crytek did a presentation a while ago on this.

Share this post


Link to post
Share on other sites

Normals can also be compressed into 4 bytes without quality loss -- 11_11_10 is ok, 16_16 octahedral is better.
That means you can store normal + tangent in 8 bytes.
You probably don't need full 32-bit float precision for UV's, so you could put them in 16-bit half-floats, which reduces them from 8 bytes to 4 bytes total.
If colour is not HDR, then it can be stored in 8-bit per channel, reducing them from 16 bytes to 4 bytes.
Bone ID and weight can often be 8 bit, but we've got space in the structure leftover so let's say they need to be 16bit each.
This gives you a 32byte vertex -- a 2.5x improvement!

struct Vertex
{
 float3 position;     // 0 + 12 = 12 bytes
 uint normal;         // 12 + 4 = 16 bytes
 uint  tangent;       // 16 + 4 = 20 bytes
 half2 uv;            // 20 + 4 = 24 bytes
 uint color;          // 24 + 4 = 28 bytes
 uint bone_id_weight; // 28 + 4 = 32 bytes
}

You also need one bit to tell you whether the binormal is cross(normal,tangent) or -cross(normal,tangent), but you could squeeze that into one of the bits in the bone_id_weight field -- 15 bits for weight is still way more than required.

Is that sort of stuff worth it though? Reducing memory usage at the cost of increasing computation?

Computers double in computation power every two years, and double in memory bandwidth every 10 years... Which is another way to say that every decade the "bytes transferred per FLOP" performance of computers gets 32 times WORSE! :o (on the flipside, "FLOPS available per transferred byte" gets 32 times better each decade! :D )
Optimizing for memory bandwidth has always been important for GPU's, and is getting more important with every new generation of hardware :(

Edited by Hodgman

Share this post


Link to post
Share on other sites

.

Is that sort of stuff worth it though? Reducing memory usage at the cost of increasing computation?

Computers double in computation power every two years, and double in memory bandwidth every 10 years... Which is another way to say that every decade the "bytes transferred per FLOP" performance of computers gets 32 times WORSE! :o (on the flipside, "FLOPS available per transferred byte" gets 32 times better each decade! :D )
Optimizing for memory bandwidth has always been important for GPU's, and is getting more important with every new generation of hardware :(

 

Well not anymore, Moore's law has technically been dead for around 2+ years now. And HBM is a huge boost in memory bandwidth! Buuuut, that doesn't apply to consoles. Or even to almost any GPU out now for that matter. So for the most part a handful of decompression instructions in order to minimize memory bandwidth and size is an easy tradeoff to make.

Share this post


Link to post
Share on other sites

Well not anymore, Moore's law has technically been dead for around 2+ years now. And HBM is a huge boost in memory bandwidth!

 [citation needed] :D

http://www.telegraph.co.uk/technology/2017/01/05/ces-2017-moores-law-not-dead-says-intel-boss/
Technically Moore's law is about how many transistors you can fit onto a surface, and Intel is still keeping up the pace in 2017. In practice the number of transistors you can fit onto a surface does roughly corellate with compute performance. 
In recent years single-core perf has plateaued, but we can still just keep adding more and more cores to keep the performance charts growing.
On a timescale where bandwidth doubles every decade, sudden leaps like HBM don't really affect that long term trend -- or if they do, you can't be sure until a decade's time when you look back over the data :)

Share this post


Link to post
Share on other sites

 

Well not anymore, Moore's law has technically been dead for around 2+ years now. And HBM is a huge boost in memory bandwidth!

 [citation needed] :D

http://www.telegraph.co.uk/technology/2017/01/05/ces-2017-moores-law-not-dead-says-intel-boss/
Technically Moore's law is about how many transistors you can fit onto a surface, and Intel is still keeping up the pace in 2017. In practice the number of transistors you can fit onto a surface does roughly corellate with compute performance. 
In recent years single-core perf has plateaued, but we can still just keep adding more and more cores to keep the performance charts growing.
On a timescale where bandwidth doubles every decade, sudden leaps like HBM don't really affect that long term trend -- or if they do, you can't be sure until a decade's time when you look back over the data :)

 

 

Nah, they missed the ship date for 14nm by a few months. Technically it was within "2 years" of their 22nm node, but technically Moore's law states "Every 18-24 months" not years. Of course an Intel PR announcement is going to tell you differently.

Share this post


Link to post
Share on other sites

Nah, they missed the ship date for 14nm by a few months. Technically it was within "2 years" of their 22nm node, but technically Moore's law states "Every 18-24 months" not years. Of course an Intel PR announcement is going to tell you differently.

 22nm to 14nm was a ~2.5x improvement in component area, not a 2x improvement though, so Moore's law states that this particular step is expected to take longer than 18-24 months :P

14nm to 10nm is a ~2x improvement, so should've been out last year instead of coming out this year. Still one outlying data point doesn't change the long term trend, yet.

Share this post


Link to post
Share on other sites

Nah, they missed the ship date for 14nm by a few months. Technically it was within "2 years" of their 22nm node, but technically Moore's law states "Every 18-24 months" not years. Of course an Intel PR announcement is going to tell you differently.


We're talking about GPUs though.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this