I agree that doubles are more expensive than floats on GPUs (not on CPUs, as you already noted)
I didn't quite note just that -- I pointed out that they have the potential to halve performance, because memory bandwidth is usually more of a bottleneck than CPU speed.
how would one calculate (and draw) precise position (of the vertices) on the globe (where even radius cannot be represented with a meter accuracy with floats) without doubles on the CPU side?
Neither floats or doubles are a great choice for storing globe surface points relative to the globe, because both formats dedicate the bulk of their precision to representing points within the globe's core. What a waste!
The surface of earth only varies vertically by about 20km, so if you need sub-metre height accuracy you could use a 16-bit int to store the height difference from average, or a 32-bit int would give you near-micron accuracy.
If you need the globe vertices displaced horizontally as well as vertically, then you could then compliment the height with two spherical coordinates, or smoothed-cube coordinates that are trendy in planetary renderers.
Why? Because a more efficient storage format takes up less space, and efficiency in memory layouts is one of the primary optimisations on modern computers (
arguably more important that reducing CPU cycles -- in relative terms of bandwidth per CPU cycle, memory is getting slower and slower every day...).
Please, before disapprove something generally, consider cases when and where it might be a better solution.
Keep in mind I only jumped in here because you claimed that
all applications you've developed required double precision -- that seems to be the same generalization on the other side of the fence
The main benefit of SIMD is not computation time (although it's a very nice bonus), but the amount of time you spend reading / writing data. Loading 4xfloat as a packed register is much quicker than the FPU equivalent.
I'm not sure how true that is... Yes, you can load 4 values with one instruction (just how you can do 4 of many other ops with one instruction), but those 16 bytes aren't magically transferred from RAM faster than 16 bytes requested via 'normal' means.
Many applications don't see any performance improvement after porting to SIMD (
despite using ~4x less CPU cycles) exactly because the memory bandwidth has remained the same.