"The Advanced SIMD extension (aka NEON or "MPE" Media Processing Engine) is a combined 64- and 128-bit single instruction multiple data (SIMD) instruction set that provides standardized acceleration for media and signal processing applications. NEON is included in all Cortex-A8 devices but is optional in Cortex-A9 devices."
"In NEON, the SIMD supports up to 16 operations at the same time."
"Devices such as the ARM Cortex-A8 and Cortex-A9 support 128-bit vectors but will execute with 64 bits at a time, whereas newer Cortex-A15 devices can execute 128 bits at a time."
You didn't mention what platform you tested the "raw C++" on. The Intel SSE can do all 128-bits at once.
Also, in "supports 16 operations at the same time", it might be that those operations are executed on 16 single-byte integers, not the 16 single-precision floats you probably tested with?