Recommended byte size of a single vertex

Started by
11 comments, last by 21st Century Moose 7 years ago

I'm wondering, what is the recommended byte size of a single vertex?

Currently, we're using 80 bytes per vertex in our engine:


struct Vertex
{
DirectX::XMFLOAT3 position; // 0 + 12 = 12 bytes
DirectX::XMFLOAT3 normal; // 12 + 12 = 24 bytes
DirectX::XMFLOAT3 tangent; // 24 + 12 = 36 bytes
DirectX::XMFLOAT3 bitangent; // 36 + 12 = 48 bytes
DirectX::XMFLOAT2 uv; // 48 + 8 = 56 bytes
DirectX::XMFLOAT4 color; // 56 + 16 = 72 bytes
uint32_t bone_id; // 72 + 4 = 76 bytes
float bone_weight; // 76 + 4 = 80 bytes
}

What are the things to keep in mind when setting up a vertex layout?

Advertisement

Have it be a multiple of your GPU's cache line size (which typically means a multiple of 32) and be as small as possible.

In your case there is ample opportunity for compression: you can, for example, drop the bitangent and compute it at runtime (in your vertex shader) from your normal and tangent - that's a 12-byte saving (and may even run faster since compute is very cheap these days). Similarly, your color could be reduced to a ubyte4 type which is normally sufficient - another 12 bytes, and now your vertex size is 56.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Have it be a multiple of your GPU's cache line size (which typically means a multiple of 32) and be as small as possible.

In your case there is ample opportunity for compression: you can, for example, drop the bitangent and compute it at runtime (in your vertex shader) from your normal and tangent - that's a 12-byte saving (and may even run faster since compute is very cheap these days). Similarly, your color could be reduced to a ubyte4 type which is normally sufficient - another 12 bytes, and now your vertex size is 56.

Is that sort of stuff worth it though? Reducing memory usage at the cost of increasing computation?

Profile it and see.

I've seen significant improvements by reducing vertex size before.

Is that sort of stuff worth it though? Reducing memory usage at the cost of increasing computation?


Computation is cheap - and in this case it's just a cross-product per vertex so it's not something you really need to worry excessively over.

Also worth noting that we're not talking about reducing overall memory usage as a goal here. I need to be absolutely clear on that, because (so long as you're not swapping) amount of memory used is almost never an arbiter of performance. The goal is instead to reduce the vertex size so that it fits into fewer cache lines.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

First a question, its been a while regarding skeletal animation but shouldn't you have 3 bone weights since you have 4 bones?

Second the optimal layout would be each stream being a power of 2 in size. So if you have a full vertex of 48bytes split it into one 32byte and one 16byte per element stream. Also like previously mentioned the smaller the better.

edit - forget the question, I misread.

-potential energy is easily made kinetic-

In your case there is ample opportunity for compression: you can, for example, drop the bitangent and compute it at runtime (in your vertex shader) from your normal and tangent - that's a 12-byte saving (and may even run faster since compute is very cheap these days).

Or go a step smaller and store a quaternion tangent frame instead of a separate normal and tangent. Crytek did a presentation a while ago on this.

Normals can also be compressed into 4 bytes without quality loss -- 11_11_10 is ok, 16_16 octahedral is better.
That means you can store normal + tangent in 8 bytes.
You probably don't need full 32-bit float precision for UV's, so you could put them in 16-bit half-floats, which reduces them from 8 bytes to 4 bytes total.
If colour is not HDR, then it can be stored in 8-bit per channel, reducing them from 16 bytes to 4 bytes.
Bone ID and weight can often be 8 bit, but we've got space in the structure leftover so let's say they need to be 16bit each.
This gives you a 32byte vertex -- a 2.5x improvement!


struct Vertex
{
 float3 position;     // 0 + 12 = 12 bytes
 uint normal;         // 12 + 4 = 16 bytes
 uint  tangent;       // 16 + 4 = 20 bytes
 half2 uv;            // 20 + 4 = 24 bytes
 uint color;          // 24 + 4 = 28 bytes
 uint bone_id_weight; // 28 + 4 = 32 bytes
}

You also need one bit to tell you whether the binormal is cross(normal,tangent) or -cross(normal,tangent), but you could squeeze that into one of the bits in the bone_id_weight field -- 15 bits for weight is still way more than required.

Is that sort of stuff worth it though? Reducing memory usage at the cost of increasing computation?

Computers double in computation power every two years, and double in memory bandwidth every 10 years... Which is another way to say that every decade the "bytes transferred per FLOP" performance of computers gets 32 times WORSE! :o (on the flipside, "FLOPS available per transferred byte" gets 32 times better each decade! :D )
Optimizing for memory bandwidth has always been important for GPU's, and is getting more important with every new generation of hardware :(

.

Is that sort of stuff worth it though? Reducing memory usage at the cost of increasing computation?

Computers double in computation power every two years, and double in memory bandwidth every 10 years... Which is another way to say that every decade the "bytes transferred per FLOP" performance of computers gets 32 times WORSE! :o (on the flipside, "FLOPS available per transferred byte" gets 32 times better each decade! :D )
Optimizing for memory bandwidth has always been important for GPU's, and is getting more important with every new generation of hardware :(

Well not anymore, Moore's law has technically been dead for around 2+ years now. And HBM is a huge boost in memory bandwidth! Buuuut, that doesn't apply to consoles. Or even to almost any GPU out now for that matter. So for the most part a handful of decompression instructions in order to minimize memory bandwidth and size is an easy tradeoff to make.

Well not anymore, Moore's law has technically been dead for around 2+ years now. And HBM is a huge boost in memory bandwidth!

[citation needed] :D

http://www.telegraph.co.uk/technology/2017/01/05/ces-2017-moores-law-not-dead-says-intel-boss/
Technically Moore's law is about how many transistors you can fit onto a surface, and Intel is still keeping up the pace in 2017. In practice the number of transistors you can fit onto a surface does roughly corellate with compute performance.
In recent years single-core perf has plateaued, but we can still just keep adding more and more cores to keep the performance charts growing.
On a timescale where bandwidth doubles every decade, sudden leaps like HBM don't really affect that long term trend -- or if they do, you can't be sure until a decade's time when you look back over the data :)

This topic is closed to new replies.

Advertisement