As with any typical SIMD pipeline, there are nasty hidden overheads and pitfalls. Register aliasing, processor mode switching, compiler injected code, etc. I think NEON benefits from manual instruction pairing as well; see the ARM docs for the details. Microbenchmarking single ops is going to be pointless. Take a reasonably sized algorithm that translates easily (physics integrators, eg a softbody sim, are good candidates) and rewrite it in intrinsics. Measure that for X iterations against a plain C implementation. If you have a cycle timing profiler, that would be best, else just use high precision timing.
And remember the standard SIMD rules: prefetch your data from nice cache aligned blocks, don't mode switch between FP and SIMD, manage your data hazards intelligently, etc.Tight ALU on resident data is going to be best. When dealing with intrinsics, make sure to generate assembly listings and take a look at what you're actually getting; you want to make sure things are being inlined and elided properly.