Texture formats

Started by
24 comments, last by Chris_F 10 years, 4 months ago

Reading around on the Internet there are a few tidbits I have come across several times, they are:

1. Modern GPUs don't have hardware support for uncompressed RGB textures. They will be converted to RGBA internally because the GPU doesn't like 3 component textures.

2. The best transfer format to use (especially on Windows?) is GL_BGRA. If you use GL_RGBA, the driver will have to swizzle your texture data when you call glTexImage2d, slowing the performance.

I've read both of those countless places it seems and so I decided to corroborate this by using ARB_internalformat_query2. I did the appropriate glGetInternalformativ calls with GL_INTERNALFORMAT_PREFERRED and GL_TEXTURE_IMAGE_FORMAT. What I got from this was different from what I expected considering the things I read.

According to glGetInternalformativ the internal format used when you ask for GL_RGB8 is GL_RGB8, the optimum transfer format for GL_RGB8 is GL_RGB and the optimum transfer format for GL_RGBA8 is GL_RGBA. So, is what I read outdated, or is my graphics driver lying to me?

I am using a AMD HD 5850 with the latest drivers.

Advertisement

RGBA, BGRA, or some other 32-bit format will be the internal format instead of RGB (24-bit). Processors generally don't operate on 24 bits at a time, so they'd pick a 32-bit format. But it's still faster to send the data across the incredibly slow bus as RGB and let the incredibly fast GPU convert it to RGBA for you.

As for RGBA and BGRA, I'd like to think video cards handle both at the same speed by now. I honestly haven't checked though. You might check if the tools you're using really are giving you RGBA, and not BGRA. I kind of thought BGRA was picked in the first place so we wouldn't have to reorder bytes before we DMA it over to the video card. You probably want to stick with the actual memory layout used by your tools so you can memcpy instead of iterating through every pixel and every color component copying them individually.

Edit: I realized my first paragraph is very misleading. It's only useful to worry about bus traffic being generated if you actually generate it, and the normal case when you have dedicated graphics memory is that color data doesn't change after you load it during the loading screen or whatever. Just keep the 24-bit option in your pocket when you know you'll be passing the data back and forth for any reason since it could easily be your performance bottleneck (obviously test to confirm you are bandwidth-limited).

Regarding RGBA vs BGRA, I would make a strong guess that most modern GPUs would have swizzle functionality built into the texture fetch hardware, so if the data is stored in one order, but the shader wants it in another order, the swizzling would be free. GPUs actually have a bunch of neat funcionality like this hidden away, so that they can implement both the GL and D3D APIs, given their differences -- such as GL vs D3D's difference in Z range, or D3D9 vs everything-else's pixel centre coordinates...

I would guess that the Windows obsession with BGRA is probably a legacy of their software-rendered desktop manager, which probably chose BGRA ordering arbitrarily and then forced all other software to comply with it.

Not sure about your other question. When the driver says that the actual internal format is RGB, maybe it's reporting that because it's actually using "RGBX" or "XRGB" (i.e. RGB with a padding byte), but this format doesn't exist in the GL enumerations?

The best way as always is to test, and it's simple enough to knock up a quick program to test various combinations and see where the performance is.

glTexImage on it's own is not good for testing with because the driver needs to set up a texture, perform various internal checks, allocate storage, etc. The overhead of that is likely to overwhelm any actual transfer performance.

So using glTexSubImage you can more reasonably isolate transfer performance, and your test program will initially specify a texture (using glTexImage), then perform a bunch of timed glTexSubImage calls. By swapping out the parameters you can get a good feel for which combinations are the best to use.

For OpenGL the internalFormat parameter is the only one that specifies the format of the texture itself; the format and type parameters have absolutely nothing to do with the texture format, and instead describe the data that you're sending in the last parameter of your glTex(Sub)Image call. This is made clearer if you use the newer glTexStorage API (which only takes internalFormat to describe the texture).

So there are 3 factors at work here:

  • The internal format of the texture as it's stored by the GPU/driver.
  • The format of the data that you send when filling the texture.
  • Any conversion steps that the driver needs to do in order to convert the latter to the former.

In theory the most suitable combination to use is one that allows the driver to do the equivalent of a straight memcpy, whereas the worst will be one that forces the driver to allocate a new block of temp storage, move the data component-by-component into that new block, fixing it up as it goes, then do it's "equivalent of memcpy" thing, finally releasing the temp storage.

It's a few years since I've written such a program to benchmark transfers, but at the time, the combination of format GL_BGRA and type GL_UNSIGNED_INT_8_8_8_8_REV was fastest overall on all hardware. That didn't mean that it was measurably faster on individual specific hardware, and when I broke it down by GPU vendor things got more interesting. AMD was seemingly impervious to changes in these two parameters, so it didn't really matter which you used, you got similar performance with all of them. NVIDIA was about 6x faster with format GL_BGRA than with GL_RGBA, but it didn't mind so much what you used for type. Intel absolutely required both format and type to be set as I give, as it was about 40x faster if you used them. With the optimal parameters identified and in place, NVIDIA was overall fastest, then Intel, finally AMD. What all of this underlines is the danger of testing a single vendor's hardware in isolation; you really do have to test on and balance things out between all vendors.

Regarding data volume versus data conversion overheads, at the time I had tested data conversion was by far the largest bottleneck so it was a more than fair tradeoff to accept the extra 8 unused bytes per texel in exchange for a faster transfer. That may be changed on more recent hardware but I don't have up to date figures.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Regarding RGBA vs BGRA, I would make a strong guess that most modern GPUs would have swizzle functionality built into the texture fetch hardware, so if the data is stored in one order, but the shader wants it in another order, the swizzling would be free. GPUs actually have a bunch of neat funcionality like this hidden away, so that they can implement both the GL and D3D APIs, given their differences -- such as GL vs D3D's difference in Z range, or D3D9 vs everything-else's pixel centre coordinates...

I would guess that the Windows obsession with BGRA is probably a legacy of their software-rendered desktop manager, which probably chose BGRA ordering arbitrarily and then forced all other software to comply with it.

Not sure about your other question. When the driver says that the actual internal format is RGB, maybe it's reporting that because it's actually using "RGBX" or "XRGB" (i.e. RGB with a padding byte), but this format doesn't exist in the GL enumerations?

decided to omit a more detailed description, but IIRC, basically this (BGR / BGRA / BGRX) was what was used generally by the graphics hardware.

Windows likely followed suit mostly because this was what would be cheapest to draw into the graphics hardware's frame-buffer (no need to swap components, ...).

also, I suspect related, is if you put RGB in a hex number:

0xRRGGBB

then write it to memory in little-endian ordering:

BB GG RR 00

if the spare bits are used for alpha:

BB GG RR AA

on newer hardware, I don't really know.

but, it appears that not much has changed here.

Although a little out of date, the following may be of interest - it's shows the internal formats used by NVidia on a range of different hardware.

https://developer.nvidia.com/content/nvidia-opengl-texture-formats


RGBA, BGRA, or some other 32-bit format will be the internal format instead of RGB (24-bit). Processors generally don't operate on 24 bits at a time, so they'd pick a 32-bit format. But it's still faster to send the data across the incredibly slow bus as RGB and let the incredibly fast GPU convert it to RGBA for you.

GL_INTERNALFORMAT_PREFERRED is supposed to give you the internal format the the driver is actually going to use. So if GL_INTERNALFORMAT_PREFERRED returns GL_RGB then the driver is saying that is how it plans on storing the data internally. So either newer Radeon cards have hardware support for 24-bit texture formats, or AMD's implementation of ARB_internalformat_query2 has its pants on fire.


Not sure about your other question. When the driver says that the actual internal format is RGB, maybe it's reporting that because it's actually using "RGBX" or "XRGB" (i.e. RGB with a padding byte), but this format doesn't exist in the GL enumerations?

I have no clue. That thought had crossed my mind too, but it is only speculation. Is there some way of actually uploading a RGB texture and then seeing definitively how much GPU memory is being taken up by it?

For OpenGL the internalFormat parameter is the only one that specifies the format of the texture itself; the format and type parameters have absolutely nothing to do with the texture format, and instead describe the data that you're sending in the last parameter of your glTex(Sub)Image call. This is made clearer if you use the newer glTexStorage API (which only takes internalFormat to describe the texture).

So there are 3 factors at work here:

  • The internal format of the texture as it's stored by the GPU/driver.
  • The format of the data that you send when filling the texture.
  • Any conversion steps that the driver needs to do in order to convert the latter to the former.

This is what confuses me. If the internal format is going to be RGBA then why would a transfer format of BGRA ever be faster than RGBA? Another source of confusion for me can be found here. Under "Texture only" it says that RGB8 is a "required format" that a OpenGL implementation must support. Is the word "support" very loose? i.e. your hardware could only support 32-bit RGBA floating point textures but the driver converts RGB8 textures to that so it counts as supported.

I guess I'm going to have to do testing as you said, but ideally this kind testing would not be necessary at all assuming that ARB_internalformat_query2 is present and gives good information. I was under the impression that was the whole point of this extension.

Also, what is the difference between using GL_UNSIGNED_INT_8_8_8_8(_REV) and GL_UNSIGNED_BYTE?

Also, what is the difference between using GL_UNSIGNED_INT_8_8_8_8(_REV) and GL_UNSIGNED_BYTE?

The packed format types store the color components of a pixel within a larger data type, while the non-packed format types store the color components linearly in memory.

For example, with UNSIGNED_INT_8_8_8_8, the data type of a pixel is assumed to be a GLuint, and the first color component is stored in bits 0 to 7, second in bits 8 to 15, and so on. The actual physical order in memory depends on the endianness of a GLuint, but you always know that the first component is the lower 8 bits of a GLuint. On the other hand, UNSIGNED_BYTE means that each color component is 8 bits and each color component is stored in consecutive memory locations.

The difference between packed and non-packed types is basically that packed types ensures that you can read color components from a type in an endian-safe order but that the physical memory storage is endian-dependent, and the non-packed formats ensures that the color components are stored in a specific format but that reading/writing components from a pixel as a whole is endian-dependent.

edit: And the _REV variants just reverse the order of the bit ranges, so that bits 24 to 31 is the first component, bits 16 to 23 is the second component, and so on.

Here is the spec information that explains why ARB_internalformat_query2 exists, what it does, and such. I'm not sure where you heard the preferred formats are one-to-one mappings of internal formats actually used, but I've never heard that before.

And support for RGB can definitely be implemented internally as RGBA or BGRA or even GBAR if some hardware manufacturer really wanted to go crazy. We've already had video cards that did implement 3-component color data internally as 4-component. I assume the newer cards are more flexible as a result of their support for general-purpose computations.

This topic is closed to new replies.

Advertisement