I've found it's often faster to use glBufferData to overwrite the entire buffer rather than use glBufferSubData, even if only part of the buffer actually needs to be updated. If you have lots of data that can be updated, divide it into multiple buffers of for example 256 vertices per buffer and try to use a method that updates as few buffers as possible each frame.
Not sure I understand the question exactly... but mul(vector, matrix) is always multiplied like a single-row vector dot the columns of the matrix.. while mul(matrix, vector) is each row of the matrix dot the column vector (ofcourse, as it's how matrix multiplication works). So mul(vector, matrix) == mul(T(matrix), vector).
Then for multiplication order.. mul(vector, matrix1 * matrix2) == mul(T(matrix2) * T(matrix1), vector).
Whether the driver then somehow behind the scenes rearranges that to fit it's preferred memory layout I don't know but it won't matter for the calculation itself.
Point in sphere and distance checks by themselves can be done by comparing the squared distance to the squared radius, thereby avoiding sqrt.
When you actually need sqrt... it's not very evil on newer desktop processors, but at the same time the other instructions have also gotten faster, so they can still be relatively faster.
There are also special instructions on many newer processors for calculating them. One reference I found put sqrt for a single float in SSE at 19 clockcycles, while an instruction for 1 / sqrt which is only an approximation with some number of bits accuracy only takes 3 cycles so if that would work then it would probably be the fastest way.
First, only Quadro and FirePro GPUs support 10 bit output. NOT GeForce or Radeon.
That actually isn't completely true. In fullscreen it works perfectly fine for Geforce and Radeon, and it works with HDMI. It's only 10 bit desktop modes that require the pro cards (like using it in Photoshop). The DirectX SDK has a '10-bit scanout' sample for D3D10 that shows the difference compared to 8-bit for fullscreen gradients, and it's quite a difference for such cases.
One reason to do what the op does is that it can give smooth edges on magnified textures, and it can be easily used regardless of rendering order. The reasons for the slowdown seems outlined by others, and I just wanted to point out that I have seen the opposite behavior, where adding discard for alpha < 0.5 increases performance for alpha-blended triangles that have large areas with alpha = 0. However, when alpha-blended geometry is drawn back to front as it was in my case, there is no need for depth writes so there were no conditional depth writes.
Using the technique only on triangles that need it (and if the reason for it is not rendering order, possibly combined with drawing affected geometry last) should limit the performance impact.
If you live in North Korea I wouldn't recommend letting the players smash your homeland. Other than that you're probably reasonably safe, though there might be similar concerns at a few other places in the world.