Is using vec4 vertex buffer slow?

Started by
4 comments, last by japro 11 years, 10 months ago
I'm writing OGL programme in its core-profile style. I was using vec3 as vertex buffers (the input attribute of the shader). Now I wanna change to vec4, because I rewrote the cpu-side maths code by sse/sse2.

What I'm wondering is whether using vec4 as vertex buffers really slows down the shader programme? I suspect that data stream of vec4 would occupy more bandwidth of the video memory, doesn't it?

Thanks!!
Advertisement
On the other hand, I also think using vec4 vertex buffers really waste the video memory, since I'm actually not using homogeneous coordinates on cpu-side, the extra component of vec4 is just placeholder to let SIMD work. vec3 or vec4, which would you prefer and why?

Thanks a lot!
Each GPU register is 4 floats wide anyway, so GPU-side it doesn't matter how many you send, you're going to be taking up a single 4-float register (disclaimer: your shader compiler may decide to pack 2 vec2s into a single register).

For bandwidth consumption, why don't you try it and see? CPU to GPU data transmission is a complex topic, and what may seem intuitive if you were talking about a pure CPU-only setup can quite often trip you up. So "it has more data, so it uses more memory, so it must be slower" can often turn out to be a completely invalid premise. If your vertex size is still falling within a multiple of 16 or 32 there's a very high chance that you won't even notice any difference at all. It may even run faster if it can bring your vertex size to exactly a multiple of 16 or 32.

You've got a clear tradeoff happening here. You're accepting some extra memory overhead in exchange for (hopefully!) faster operations CPU-side, which in-turn (and also hopefully) give a measurable overall performance gain, so I don't think it's valid to use words like "waste" when describing it. The extra memory is clearly being used - you may not be doing anything directly with it, but you're definitely using it to be able to get those faster operations.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

I use 3xfloat position streams on desktop OpenGL, and 3xunsigned byte position streams on GLES2.

I have profiled that using 3xunsigned byte position streams on desktop OpenGL was slower on my Intel HD 3000 laptop than using 3xfloat. On Android Tegra2/3, it's been the other way around - 3xbyte was faster than 3xfloat.

Using 4xfloat may be a bit faster than 3xfloat on desktop, if there's some alignment benefit to the GPU, but I haven't profiled this. Perhaps that's something to try in practice?
To [color="#284b72"]mhagain:

Thanks for your analysis! I didn't know "Each GPU register is 4 floats wide". If it is so, then the situation on GPU side should be similar to CPU side. I mean, vec4 buffer should be, at least, no slower than vec3.

Your theory about "The extra memory is clearly being used" sounds quite reasonable. Thank you again.

To clb:

Thanks for your reply. If you get 4xfloat profiled one day, don't forget to share your result :)

Thanks for your analysis! I didn't know "Each GPU register is 4 floats wide".

And you shouldn't assume it is. Nvidia hardware for example doesn't work that way afaik while AMD does... But i guess even there that will vary between architecture. Another thing to consider is how big your vertex format ends up being. Your probably want it to be a multiple of 16byte or 32byte...

This topic is closed to new replies.

Advertisement