HW accel Matrices on android (NDK)

Started by
17 comments, last by Aks9 10 years, 10 months ago

I also thought at the first glance he has something seriously do with matrix calculus, but from the rest of his posts I really doubt it is so.
Transforming large meshes? On a CPU? Could you give an example? I really think that should be done on a GPU.

This is mobile, and even though the mobile GPUs are quite powerful, they still are far from desktop systems, and sometimes you want to use the GPU cycles for shading, and offload some vertex processing work to the CPU. In those cases it feels foolish to not use the entire CPU, and ignore the excellent SIMD instructions. You quite easily get up to 4x the processing performance by using NEON.

Also, there is more then the graphics pipeline that can use some matrix calculations, and with limited resources, you don't want to needlessly waste any :)

Advertisement

I also thought at the first glance he has something seriously do with matrix calculus, but from the rest of his posts I really doubt it is so.
Transforming large meshes? On a CPU? Could you give an example? I really think that should be done on a GPU.

This is mobile, and even though the mobile GPUs are quite powerful, they still are far from desktop systems, and sometimes you want to use the GPU cycles for shading, and offload some vertex processing work to the CPU. In those cases it feels foolish to not use the entire CPU, and ignore the excellent SIMD instructions. You quite easily get up to 4x the processing performance by using NEON.

Also, there is more then the graphics pipeline that can use some matrix calculations, and with limited resources, you don't want to needlessly waste any smile.png

But that's ignoring the fact that you're going to have a lot of data to transfer to the GPU each frame, which will eat bandwidth and force you to have to deal with dynamic buffer management and contention. That alone (unless the data is truly dynamic to begin with) favours doing it on the GPU; yes, the extra shader instructions will be extra GPU overhead, but I'm betting that they'll be nothing by comparison to the bandwidth and contention overhead of a CPU-based approach.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


In all application I have developed I needed double precision for the matrices calculations.
To quote Tom Forsyth - "Double precision has no place in games. If you think you need double precision, you either need 64-bit fixed point, or you don't understand the algorithm."

He's being a bit facetious; they might occasionally have a use... but they definitely should not be your default choice, especially on 32-bit architectures. In my experience, doubles are very, very, very rarely used in games (and as Tom says, when you do see them used, it's often done without understanding -- "oh float was having trouble so I just changed it to double"). Float and double have a huge range, yes, but their logarithmic precision is usually not the most efficient choice.

Memory bandwidth is a bigger bottleneck that CPU ALU speeds these days. A PC x86 CPU will likely crunch through floats and doubles at the exact same speed (by treating them both as 80-bit internally...). The performance impact comes from the fact that doubles double your memory bandwidth, and if that's your bottleneck (which it often is these days), then this ~doubles your execution times.

As for the need for optimized matrix/vector classes - many games do skeletal animation on the CPU, where you might have two dozen characters each with five dozen bones, which is a few local->global matrix computations that need to be done each frame. This is enough work to be >1ms and show up on your profiler wink.png

i port a lot of code to mobile. what im currently porting now used some SIMD so if i can find a CPU instruction to do it. i'd like to use it. i have my own matrix library for platforms that don't have special cpu instructions. while i haven't optimized it. it is pretty swifty. but if i can get 2% CPU time back by using a special cpu instruction set i would like to so i can spend more time processing the rest of the frame.

To quote Tom Forsyth - "Double precision has no place in games. If you think you need double precision, you either need 64-bit fixed point, or you don't understand the algorithm."

He's being a bit facetious; they might occasionally have a use... but they definitely should not be your default choice, especially on 32-bit architectures. In my experience, doubles are very, very, very rarely used in games (and as Tom says, when you do see them used, it's often done without understanding -- "oh float was having trouble so I just changed it to double"). Float and double have a huge range, yes, but their logarithmic precision is usually not the most efficient choice.

I agree that doubles are more expensive than floats on GPUs (not on CPUs, as you already noted), but they make many things easier and some of them even faster on a GPU. When precision is needed, using doubles instead of single-double floats is significantly faster.

In my example, how would one calculate (and draw) precise position (of the vertices) on the globe (where even radius cannot be represented with a meter accuracy with floats) without doubles on the CPU side? Maybe with some mathematical gymnastics. But what for? The calculation is done at the same speed on the CPU side, only floats (just several floating point values for hundreds of thousands of vertices generated on a GPU) are transfered to a GPU and everything is done using FP arithmetics on the GPU side? With all respect to you and Tom Forsyth, it does not make any sense. Please, before disapprove something generally, consider cases when and where it might be a better solution.

Thank you for the link! I'll read it carefully. smile.png

I agree that doubles are more expensive than floats on GPUs (not on CPUs, as you already noted)

I didn't quite note just that -- I pointed out that they have the potential to halve performance, because memory bandwidth is usually more of a bottleneck than CPU speed.

how would one calculate (and draw) precise position (of the vertices) on the globe (where even radius cannot be represented with a meter accuracy with floats) without doubles on the CPU side?

Neither floats or doubles are a great choice for storing globe surface points relative to the globe, because both formats dedicate the bulk of their precision to representing points within the globe's core. What a waste!
The surface of earth only varies vertically by about 20km, so if you need sub-metre height accuracy you could use a 16-bit int to store the height difference from average, or a 32-bit int would give you near-micron accuracy.
If you need the globe vertices displaced horizontally as well as vertically, then you could then compliment the height with two spherical coordinates, or smoothed-cube coordinates that are trendy in planetary renderers.
Why? Because a more efficient storage format takes up less space, and efficiency in memory layouts is one of the primary optimisations on modern computers (arguably more important that reducing CPU cycles -- in relative terms of bandwidth per CPU cycle, memory is getting slower and slower every day...).

Please, before disapprove something generally, consider cases when and where it might be a better solution.

Keep in mind I only jumped in here because you claimed that all applications you've developed required double precision -- that seems to be the same generalization on the other side of the fence wink.png


The main benefit of SIMD is not computation time (although it's a very nice bonus), but the amount of time you spend reading / writing data. Loading 4xfloat as a packed register is much quicker than the FPU equivalent.

I'm not sure how true that is... Yes, you can load 4 values with one instruction (just how you can do 4 of many other ops with one instruction), but those 16 bytes aren't magically transferred from RAM faster than 16 bytes requested via 'normal' means.
Many applications don't see any performance improvement after porting to SIMD (despite using ~4x less CPU cycles) exactly because the memory bandwidth has remained the same.

I didn't quite note just that -- I pointed out that they have the potential to halve performance, because memory bandwidth is usually more of a bottleneck than CPU speed.

We again misunderstood each other. I don't "promote" double precision models, just calculation. There is no impact on the bandwidth since only few floats are sent to the GPU.

Neither floats or doubles are a great choice for storing globe surface points relative to the globe, because both formats dedicate the bulk of their precision to representing points within the globe's core. What a waste!
The surface of earth only varies vertically by about 20km, so if you need sub-metre height accuracy you could use a 16-bit int to store the height difference from average, or a 32-bit int would give you near-micron accuracy.
If you need the globe vertices displaced horizontally as well as vertically, then you could then compliment the height with two spherical coordinates, or smoothed-cube coordinates that are trendy in planetary renderers.
Why? Because a more efficient storage format takes up less space, and efficiency in memory layouts is one of the primary optimisations on modern computers (arguably more important that reducing CPU cycles -- in relative terms of bandwidth per CPU cycle, memory is getting slower and slower every day...).

Can you elaborate this, please?

I'm already using 16-bit storage for the height map (DEM). It is enough for 0.14m accuracy on the global level (without need for average block values or differential coding). Quite enough for the global elevation data currently available.

Keep in mind I only jumped in here because you claimed that all applications you've developed required double precision -- that seems to be the same generalization on the other side of the fence wink.png

You are right about this. Sorry! wink.png


We again misunderstood each other. I don't "promote" double precision models, just calculation. There is no impact on the bandwidth since only few floats are sent to the GPU.

... Can you elaborate this, please? I'm already using 16-bit storage for the height map (DEM).

Ah yes, I thought that you were storing vertices in double-precision format.

I guess you're reading in some compact data (e.g. 16-bit elevation), doing a bunch of double-precision trasforms on it, then outputting 32-bit floats?

That's much less offensive to performance than what I assumed you were doing cool.png

However, it may still be that double-precision calculations aren't necessary... you may be able to rearrange your order of operations, or the coordinate systems that you're working in so that everything works ok with just 32-bit precision. Whether that's at all worthwhile when you've already got a working solution is another whole topic though!

I guess if ALU-time was a performance bottleneck for you and you wanted to make use of 4-wide (or 16-wide on new PC CPUs) SIMD, then it might be worthwhile, otherwise if it aint broke don't fix it wink.png

While on this topic though, it's worth noting that some compilers, such as MSVC, actually output really horribly bad assembly code when you use floats, depending on the compiler settings. MSVC has "Enhanced Instruction Set" and "Floating point model". With the FP model set to "strict" or "precise", then it will produce assembly code with a LOT of redundant instructions to take every 80-bit intermediate values and round it down to 32-bit precision, so it your code behaves as if the FPU actually used 32-bit precision internally. When using double, it doesn't bother with all this redundant rounding code, which can actually make double seem like it's much faster than float!

Personally, I always set the instruction set to SSE2 and the FP model to "fast", which makes MSVC produce more sensible x86 code for floats.

Ah yes, I thought that you were storing vertices in double-precision format.

I guess you're reading in some compact data (e.g. 16-bit elevation), doing a bunch of double-precision trasforms on it, then outputting 32-bit floats?

That's much less offensive to performance than what I assumed you were doing cool.png

Nope! In fact I'm generating terrain completely on the GPU. Only 16-bit elevation data, and different overlays are sent through textures. Everything is rendered without a single attribute (in a GLSL sense). CPU calculates precise position on the globe and relevant parameters used to generate full ellipsoid calculation and height correction on the GPU per vertex. Everything is done using FP on the GPU side, but coefficients are calculated on the CPU in DP, downcasted to FP, and sent to GPU as uniforms. Once again, no attributes are used. The representation cannot be more compact. But I still need DP to do accurate math on the CPU.

While on this topic though, it's worth noting that some compilers, such as MSVC, actually output really horribly bad assembly code when you use floats, depending on the compiler settings. MSVC has "Enhanced Instruction Set" and "Floating point model". With the FP model set to "strict" or "precise", then it will produce assembly code with a LOT of redundant instructions to take every 80-bit intermediate values and round it down to 32-bit precision, so it your code behaves as if the FPU actually used 32-bit precision internally. When using double, it doesn't bother with all this redundant rounding code, which can actually make double seem like it's much faster than float!

Personally, I always set the instruction set to SSE2 and the FP model to "fast", which makes MSVC produce more sensible x86 code for floats.

Thank you for the advice! Although I've been using VS since version 4.1, I have never had need to tweak compiler options. I'll try what you have suggested! wink.png

This topic is closed to new replies.

Advertisement