I am working on a 3D viewer with WebGL, and want to optimize the particle emitters that it supports.

The information for each particle is its rectangle in world space, texture coordinates, and a color, all of which can change over time.

At first I did it the simple way, with a big buffer where each vertex had all the above information, for a total of 54 floats per particle (2 triangles, 6 vertices, 9 floats per vertex: [X, Y, Z] [U, V] [R, G, B, A]).

Note that the vertex positions here are already in world space and form a rectangle.

This works fine, but is a bit on the slow side on the CPU, simply because updating the buffer each frame takes a lot of work.

So the next stage was to make pseudo instancing.

WebGL doesn't support anything beside 2D textures, so I have to use that for arbitrary storage.

The idea is, then, to make a 2D texture that can hold all the data for each particle (so now only 9 floats are needed for the per-particle data), and for the vertex attributes use just the instance ID, and vertex ID.

For example, the first particle is [0, 0, 0, 1, 0, 2, 0, 0, 0, 2, 0, 3], where each pair is the instance and vertex IDs.

Instead of sending the world positions that already form a rectangle, I just send the center, and size of each particle. A normalized rectangle is computed once and sent as a uniform, and then all the particles add it to their position scaled by their scale.

Instead of using the computed texture coordinates, I instead send an ID that says where in the texture this particle is, and the actual coordinates are computed in the shader.

So every particle has a total of 21 floats instead of 54 (for some reason I can't use non-float attributes, is this only in WebGL? it has been quite some time since I touched OpenGL), and out of those only 9 need updates every frame.

For a start, I wanted to get it done quickly and not waste too much time on further optimizations, so I just picked a square power of 2 texture size that fit my needs for a test model, which happened to be 32x32 pixels.

While only 9 floats were really needed for each particle, I just chose a 4x4 matrix format for now, and padded the data with zeroes.

So a 32x32 RGBA texture, in this scenario, can hold 256 particles (32*32/4).

Even though I chose to not make it optimized, it still requires far less bandwidth and updates than the original design.

Every frame after updating all the particles, I upload the new texture data, and render all the particles.

But here's the issue: for some reason, this is a whole lot slower than just using the flat, much bigger buffer.

I simply can't understand how that's even possible, I am uploading much less data, and doing a lot less work on CPU.

Is it possible that glTexSubImage2D is somehow much slower than glBufferSubData?

These are the most relevant pieces of code, and after that the vertex shader:

// Setup
this.textureColumns = 32;
this.textureRows = 32;

this.particleTexture = gl.createTexture();

gl.bindTexture(gl.TEXTURE_2D, this.particleTexture);
gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA,  this.textureColumns, this.textureRows, 0, gl.RGBA, gl.FLOAT, null);
gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, gl.NEAREST);
gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, gl.NEAREST);
gl.bindTexture(gl.TEXTURE_2D, null);


// After updating all the particles
gl.bindTexture(gl.TEXTURE_2D, this.particleTexture);
// hwarray is a Float32Array object with all the particle data
gl.texSubImage2D(gl.TEXTURE_2D, 0, 0, 0, this.textureColumns, this.textureRows, gl.RGBA, gl.FLOAT, this.hwarray);
// Bind it to the sampler uniform
viewer.setParameter("u_particles", 3);
// This buffer holds all the instance and vertex IDs, it never changes
gl.bindBuffer(gl.ARRAY_BUFFER, this.buffer);
// viewer is my WebGL wrapper
gl.vertexAttribPointer(viewer.getParameter("a_instanceID"), 1, gl.FLOAT, false, 8, 0);
gl.vertexAttribPointer(viewer.getParameter("a_vertexID"), 1, gl.FLOAT, false, 8, 4);
// Finally draw all the particles, each one has 6 vertices
gl.drawArrays(gl.TRIANGLES, 0, 256 * 6);

The vertex shader:

uniform mat4 u_mvp; // Model-view-projection matrix
uniform mat4 u_plane; // This is the plane that each particle uses
uniform float u_cells; // This is the number of sub-textures inside the particle image
uniform float u_pixel_size; // This is the size of each pixel in relation to the texture size, so 1 / 32 in this scenario.
uniform float u_pixels; // This is the number of pixels for each row in the particle texture, 32 in this scenario
uniform sampler2D u_particles; // The actual particle data
uniform mat4 u_uvs; // This holds a normalized UV rectangle, every column's XY values are a coordinate

attribute float a_instanceID;
attribute float a_vertexID;

varying vec2 v_uv;
varying vec4 v_color;

// Gets the index-th particle as a matrix
mat4 particleAt(float index) {
  float x = u_pixel_size * mod(index, u_pixels);
  float y = u_pixel_size * floor(index / u_pixels);
  return mat4(texture2D(u_particles, vec2(x, y)), texture2D(u_particles, vec2(x + u_pixel_size, y)), texture2D(u_particles, vec2(x + u_pixel_size * 2., y)), texture2D(u_particles, vec2(x + u_pixel_size * 3., y)));

void main() {
  mat4 particle = particleAt(a_instanceID);
  vec3 position = particle[0].xyz; // Particle's position
  vec3 offset = u_plane[int(a_vertexID)].xyz; // The plane's vertex for the current vertex ID
  float index = particle[1][0]; // This is the sub-texture index for the texture coordinates
  float scale = particle[1][1]; // The size of this particle
  vec4 color = particle[2]; // The color of this particle
  vec2 uv = u_uvs[int(a_vertexID)].xy; // The texture coordinate for the current vertex ID
  // Final texture coordinate calculations
  vec2 cell = vec2(mod(index, u_cells), index / u_cells); // 
  v_uv = cell + vec2(1.0 / u_cells, 1.0 / u_cells) * uv;
  v_color = color;
  // And set the vertex to the particle's position offset by the plane and scaled to its size
  gl_Position = u_mvp * vec4(position + offset * scale, 1.0);

Thanks for any help!

