Sign in to follow this  
  • entries
  • comments
  • views

Efficient texture loading on iOS (and other mobile) devices

Sign in to follow this  


While working on our current game title on mobile platforms I ran into the need for extremely efficient texture loading. Now that I have some free time I figured I would write an article on the most efficient way of loading texture data on iOS/Android devices that are using a PowerVR chip.

What problem does this solve?
The main reason was a need for streaming textures during gameplay due to memory limitations. This could be due to having a complex 3d world and needing to stream objects into the level, or in our case it was having to stream a lot of textures for use in 2d rendering (we pre-render our world, objects, and character animations to texture atlases and UV them to billboards to achieve high quality visuals, lighting, and complex environments/characters).

If your issue is needing some milliseconds back, loading textures on the fly in a 3d world the problem is much less pronounced as you would be using compressed PVR textures, but there is still a great performance benefit to this method.

If you are in the same boat that I was in the latter case of needing to load them for texturing billboards, you are likely loading them from a lossless format such as PNG (the most common I've seen people using), BMP, TGA, or uncompressed PVR. Using the techniques outlined in this article you can achieve massive performance benefits by changing the format you are using and how you are loading.

Step One: File Format
If you are using your textures on objects in a 3d world then you will definitely want to use PVRTC format as it is a lossy, fixed-rate texture compression format that supports both 4bpp and 2bpp ARGB data.

For lossless textures your best option is to use the uncompressed PVR file format. This consists of the PVR file header along with your data in an allowed format such as RGBA4444, RGBA8888, BGRA8888, etc. If you are using 32bit data (like we do for the most part) there is really no reason at all to be using RGBA8888 as the driver has to process this cpu-side and you can easily pre-process your data to BGRA8888 for a straight upload to memory.

(Notes of Interest)
- If you are using an RGB8 data format you are making the CPU do even more unnecessary work as it will have to add a byte of padding so make sure to always use RGBA/BGRA even if you don't need the channel.

- Using an uncompressed texture format obviously has the end-result of requiring more space on disk. This won't affect the download size of your application (IPA/APK will have your content zipped therefore providing compression) but when installed will take more space on disk. If you find yourself using more disk space than you would like you can always supply your textures in .zip packages and at initial load time or as a threaded job decompress what you need and delete them when your application exits/terminates.

Step Two: File I/O
From most articles I've read online or code I've seen people write they are loading their texture data (either on their own or using a library like stb_image) by opening a file, reading the data, decompressing the data if needed, and then calling glTexImage2d(...) specifying the RGBA format. While this approach works fine if you're loading all your texture data up front, as soon as you need to stream textures with the game running you'll hit some serious bottlenecks. There are some unnecessary allocation/copy (and possibly decompression depending on format) operations that you can get rid of extremely easily giving you a noticeable impact on the time taken during the frame.

The way to avoid this is to use memory-mapped file I/O. This means that the file contents are not read from disk and so do not use physical memory, instead they are cached by the OS in kernel memory space and paged in and out when needed. This can actually add a little latency to file access, say your average is roughly 0.012ms for the kernel to load the page on access there are times I have seen up to 0.135ms, but as it reduces an alloc/memcpy (which cost well above kernel page load time) the performance gains are well worth it (not to mention you won't need to worry about memory fragmentation if you are using platform malloc/free calls).

To achieve this you would do the following (to simplify the example I pulled out error handling):
[source lang="cpp"]
#include < sys/stat.h >
#include < sys/mman.h >
#include < fcntl.h >
#include < unistd.h >

int32_t file = open("my_texture_file.pvr", O_RDONLY);
struct stat file_status;
fstat(file, &file_status);
int32_t file_size = (int32_t)file_status.st_size;
void* data = mmap(0, file_size, PROT_READ, MAP_PRIVATE, file, 0);
// Note this will not close the file/mapping right now, as it will be held until unmapped.

// When finished with this data you call this to unmap/close.
munmap(data, file_size);
Now we have now created a virtual mapping and promised at the kernel level that we are only using it for read-only access which can give us optimization benefits.

(Note of interest)
In most cases memory mapped files are only effective with large files having a file size with a multiple of the page size (i.e. a multiple of 4096 bytes) in order to avoid wasting page space. Obviously there are times when textures will not adhere to this, although for our use case it has never been an issue and we have always achieved a net performance gain.

Step Three: Texture Upload
The first thing you want to do is get the PVR header struct from the file which contains all the needed info. This way you can verify the format using the magic 4CC and have all the needed metadata. Once you've achieved this you can upload your texture data to the GPU for use.

Below is a simple way of doing this (to simplify the example I pulled out error handling, made assumptions on constants, etc)

[source lang="cpp"]struct PvrHeader
uint32_t header_length;
uint32_t height;
uint32_t width;
uint32_t mipmap_count;
uint32_t flags;
uint32_t data_length;
uint32_t bpp;
uint32_t bitmask_red;
uint32_t bitmask_green;
uint32_t bitmask_blue;
uint32_t bitmask_alpha;
uint32_t pvr_tag;
uint32_t surface_count;

static const uint32_t kPVRTC2 = 24;
static const uint32_t kPVRTC4 = 25;
static const uint32_t kBGRA8888 = 26;

PvrHeader* header = (PvrHeader*)data; // data being your mapped file from step two
uint32_t pvr_tag = header->_pvr_tag;
// Here you would check the pvr_tag against the 4CC "PVR!" to verify

uint32_t flags = header->flags;
uint32_t format_flag = flags & 0xFF;

void* data_start = data + sizeof(PvrHeader);
if(format_flags == kBGRA8888)
// Note: I am assuming that you have already generated, bound, and set texture parameters.
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, header->width, header->height, 0, GL_BGRA, GL_UNSIGNED_BYTE, data_start);
else if(format_flags == kPVRTC4 || format_flags == kPVRTC2)
// You would do the same as above but using glCompressedTexImage2d(...);

Step Four: Effective Use
At this point you are using the most efficient texture format for your needs, you have the file mapped, and you can upload the data to the GPU by simply getting the kernel to load the page and copy. It should be obvious but these are not steps that you want to be doing consecutively each time you want to load a texture. For efficient usage you would build a cache of mapped texture files (storing the header and pointer to the data_start) by mapping them all when you are initially loading the game (John Carmack found that on iOS for whatever reason you only have about 700MB available so if you need more you will have to manage your cache more efficiently by using mmap/munmap with your job pool). When a texture is needed you glTexImage2d and the kernel loads the page, uploads the data to the GPU in native format, and you are ready to go. On termination you destroy your cache by unmapping all of the files.

What Next?
Depending on how many textures you were streaming in and their size you should already have a really great net performance gain by implementing the above solution, although with iOS6 you can go even further resulting in greater performance gains. I'll save this as a topic for a future journal post, but for anyone that wants to implement this the addition I'm speaking about allows you to reupload texture data without having to go through the usual binding process. This allows the driver to work much more efficiently with managing memory and avoiding allocs, fragmentation, etc. You are able to build a much more efficient caching system and basically doing your GPU allocs up front and never having to do them again. If you are interested you can find a video regarding this (and other additions in iOS6 such as cheap nearly free programmable blending woohoo!) here (you need an Apple ID) in the video Advances in OpenGL and OpenGL ES.

If you run across any errors in this post please let me know right away so I can make sure to correct it. Thanks!
Sign in to follow this  

1 Comment

Recommended Comments

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now