Taking Advantage of SIMD on PlayStation Vita

Started by
7 comments, last by L. Spiro 10 years, 7 months ago

PlayStation Vita has an ARM Cortex-A9 MPCore CPU which is equipped with an optional NEON SIMD instruction set extension performing up to 16 operations per instruction.

Checking the assembly, I do indeed find a single instruction emitted for an intrinsic such as __vfqadd_32 (or such; I don’t have it in front of me right now) but my benchmarking shows that a simple vector += operation in raw C++ is faster than the intrinsics doing the same operation.

So I am guessing PlayStation Vita is like Nintendo 3DS, where a single instruction can do 16 operations, but those operations are not actually parallel—each operation is sequential.

Can anyone verify this and if there is a way to get SIMD on PlayStation Vita, how?

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Advertisement

"The Advanced SIMD extension (aka NEON or "MPE" Media Processing Engine) is a combined 64- and 128-bit single instruction multiple data (SIMD) instruction set that provides standardized acceleration for media and signal processing applications. NEON is included in all Cortex-A8 devices but is optional in Cortex-A9 devices."

"In NEON, the SIMD supports up to 16 operations at the same time."

"Devices such as the ARM Cortex-A8 and Cortex-A9 support 128-bit vectors but will execute with 64 bits at a time, whereas newer Cortex-A15 devices can execute 128 bits at a time."

Source: http://en.wikipedia.org/wiki/ARM_NEON#Advanced_SIMD_.28NEON.29

You didn't mention what platform you tested the "raw C++" on. The Intel SSE can do all 128-bits at once.

Also, in "supports 16 operations at the same time", it might be that those operations are executed on 16 single-byte integers, not the 16 single-precision floats you probably tested with?

Everything was tested on a PlayStation Vita.

The thing I do not trust about that article is that a Nintendo 3DS uses a Dual-Core ARM11 MPCore, but even if you do so it works only on 32 bits at a given time, I believe, and SIMD is often misunderstood to mean all data the instruction handles will be done in parallel. It really just means one instruction handles multiple data, but it can work on that data linearly, partially in-parallel, or in full parallel.

The article says PlayStation Vita’s core should be able to work with 2 floats in parallel, but my benchmarking suggests otherwise. I would expect double performance from the NEON/VFP intrinsics if it is really doing 64 bits at a time.

And finally, as you mentioned, the wording can be tricky. When it refers to what it is doing in parallel it could be referring to integer registers only, not the S/D/Q floating-point ones. That’s why I am not trusting the documentation so much.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Is this for official Vita development? If so, you could ask on Sony's private developer site. They'll get back to you with an authoritative answer.

As far as I know, NEON uses its own float registers.

Lets see what ARM has to say about it: http://www.arm.com/products/processors/technologies/neon.php


  • Registers are considered as vectors of elements of the same data type
  • Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single precision floating point
  • Instructions perform the same operation in all lanes

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

Is this for official Vita development? If so, you could ask on Sony's private developer site. They'll get back to you with an authoritative answer.

I think I will have to do that, but I won’t be able to have an account made until Monday, so I was hoping someone here knew.

As far as I know, NEON uses its own float registers.

Lets see what ARM has to say about it: http://www.arm.com/products/processors/technologies/neon.php

  • Registers are considered as vectors of elements of the same data type
  • Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single precision floating point
  • Instructions perform the same operation in all lanes

Thank you TheChubu but there is too much fine print in the documents. For example, yes, it can treat all types, including floats, as vectors, but it doesn’t mention that the registers for integers and floats are different and can be handled by a VFP co-processor (for example) which will ultimately end up determining floating-point performance.
It mentions that what happens happens in all lanes, but does not specify parallelism.
It also says it can execute with up to 64 bits at a time, but does not say that 2 32-bit floats will therefore be run in parallel. It simply means that it has a limit, and nothing more.

So basically the documents are simply not that useful for this question. You can’t infer anything from them except exactly what is written, and it almost seems as if they are specifically trying to be abstract on this issue and use words that lead many to believe parallelism is implied.


A perfect example is on Nintendo 3DS where I start writing VFP functions by pushing and popping all registers just for ease in coding, then remove the useless pushes and pops afterwards.
You can push and pop multiple registers at a time:
PUSH  {S0-S31}
What they don’t tell you is that just because it is pushing multiple registers with a single instruction, each register takes an extra cycle. In other words the above example takes twice as long as:
PUSH  {S0-S15}
I’ve benchmarked it. SIMD doesn’t imply parallelism.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

I assume you already know this but better to double check.

The NEON and VFP registers alias the same hardware. Switching modes requires a pipeline flush. Developers using NEON code should do all of it in batch and ensure no VFP instructions are interleaved.

That said, my guess is that it will do two sequential 64-bit-parallel operations internally.

As with any typical SIMD pipeline, there are nasty hidden overheads and pitfalls. Register aliasing, processor mode switching, compiler injected code, etc. I think NEON benefits from manual instruction pairing as well; see the ARM docs for the details. Microbenchmarking single ops is going to be pointless. Take a reasonably sized algorithm that translates easily (physics integrators, eg a softbody sim, are good candidates) and rewrite it in intrinsics. Measure that for X iterations against a plain C implementation. If you have a cycle timing profiler, that would be best, else just use high precision timing.

And remember the standard SIMD rules: prefetch your data from nice cache aligned blocks, don't mode switch between FP and SIMD, manage your data hazards intelligently, etc.Tight ALU on resident data is going to be best. When dealing with intrinsics, make sure to generate assembly listings and take a look at what you're actually getting; you want to make sure things are being inlined and elided properly.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

I have solved the issue.

When I made my first test I had only 30 minutes and I quickly put up a test case and found them to be almost equal in time but did not have time to check the assembly. Home duty calls when it calls.

But I had remembered a past time when I had used PlayStation Vita’s SIMD for my DXT algorithm to help make it real-time on the device and it actually made it slower, so I wasn’t so suspicious.

I came to the office today (Saturday) to finish my investigation, starting with seeing the assembly.

It turns out that the test case I created allowed the C++ to be optimized down to adding a single scalar instead of 4. When I made a case in which it was forced to perform 4 adds on the C++ side, the VFP/NEON side became twice as fast, as expected if it handles 64 bits at a time, 2×32 bits in parallel.

However I did learn about the 64-bit limit here, and I may be still scratching my head wondering why it is not 4 times faster instead otherwise.

Thank you.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

This topic is closed to new replies.

Advertisement