I'd like to put all the instance data with the same vertex format into the same buffers. Mainly to save on vertexAttribPointer calls (because they are very slow). However, I can't render all of those instances in one draw call (because of transparency...) so I'd need a way to start drawing with a certain instance offset, so when I draw 1 instance (because I can't batch more) from the middle of the buffer, it pulls (e.g.) the correct matrix from the accompanying vertex buffer that has them. Desktop GL has the BaseInstance versions of all instanced draw calls, but they don't exist in WebGL. Is it possible to emulate them somehow?
I'm optimizing a 2D renderer and I'm mainly looking to optimize how quads are drawn (since they are almost 100% of our rendering load... quads for text, quads for sprites, quads for tiles, etc.).
I batch it by duplicating everything that would be per-instance data (e.g. transform data, colors, etc.). Of course because I'm not instancing yet so I also have to duplicate the actual quad vertex data for each sprite, even though it's the same (they all use a basic 1x1 quad set of vertices and all stretching and sizing is done using the transformation matrix).
Using instancing would not only allow me to use less memory in total, it would also allow me to skip a whole lot of copying around of memory each frame. When the order of quads changes, right now I need to copy 4x (4 byte color, 8 byte UVs, 16 byte transform) for each rendered quad (112 bytes). I need to transfer positions too, of course, but that can be done once because they are all the same and never change.
With instancing, all instances would use the same position stream (just 6 elements) and to simulate instancing for UVs (which are different for each quad), I would add another 1 byte 'vertex ID stream' (i.e. also one buffer with 6 elements) and upload the 8 byte of UVs as instance data in a matrix or something, and use the vertex ID stream to index that UV data. So per quad I would only have to handle 1x(16 byte transform, 8 byte UVs, 4 byte color) each frame (= 28 byte, vs previously 116). What I currently don't do is separate all of this stuff out into different streams, so I only need to update the ones that actually change (i.e. only upload matrices each frame), for several reasons which I don't need to get into... but anyways, if I do that as well the ratio of improvement should be the same.
(I think, not sure because I haven't implemented it) I could achieve all of the above using emulated software instancing, but then I would have to go through various hassles because the WebGL API is quite limited (so I would have to use 'data textures', a custom instance id stream etc.) so it would be neater if there was a way to just emulate base instance somehow.