• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
L. Spiro

Taking Advantage of SIMD on PlayStation Vita

8 posts in this topic

PlayStation Vita has an ARM Cortex-A9 MPCore CPU which is equipped with an optional NEON SIMD instruction set extension performing up to 16 operations per instruction.

 

Checking the assembly, I do indeed find a single instruction emitted for an intrinsic such as __vfqadd_32 (or such; I don’t have it in front of me right now) but my benchmarking shows that a simple vector += operation in raw C++ is faster than the intrinsics doing the same operation.

 

So I am guessing PlayStation Vita is like Nintendo 3DS, where a single instruction can do 16 operations, but those operations are not actually parallel—each operation is sequential.

 

Can anyone verify this and if there is a way to get SIMD on PlayStation Vita, how?

 

 

L. Spiro

1

Share this post


Link to post
Share on other sites

"The Advanced SIMD extension (aka NEON or "MPE" Media Processing Engine) is a combined 64- and 128-bit single instruction multiple data (SIMD) instruction set that provides standardized acceleration for media and signal processing applications. NEON is included in all Cortex-A8 devices but is optional in Cortex-A9 devices."

"In NEON, the SIMD supports up to 16 operations at the same time."

"Devices such as the ARM Cortex-A8 and Cortex-A9 support 128-bit vectors but will execute with 64 bits at a time, whereas newer Cortex-A15 devices can execute 128 bits at a time."

Source: http://en.wikipedia.org/wiki/ARM_NEON#Advanced_SIMD_.28NEON.29

 

You didn't mention what platform you tested the "raw C++" on. The Intel SSE can do all 128-bits at once.

 

Also, in "supports 16 operations at the same time", it might be that those operations are executed on 16 single-byte integers, not the 16 single-precision floats you probably tested with?

Edited by tonemgub
0

Share this post


Link to post
Share on other sites

Everything was tested on a PlayStation Vita.

 

The thing I do not trust about that article is that a Nintendo 3DS uses a Dual-Core ARM11 MPCore, but even if you do so it works only on 32 bits at a given time, I believe, and SIMD is often misunderstood to mean all data the instruction handles will be done in parallel.  It really just means one instruction handles multiple data, but it can work on that data linearly, partially in-parallel, or in full parallel.

The article says PlayStation Vita’s core should be able to work with 2 floats in parallel, but my benchmarking suggests otherwise.  I would expect double performance from the NEON/VFP intrinsics if it is really doing 64 bits at a time.

 

And finally, as you mentioned, the wording can be tricky.  When it refers to what it is doing in parallel it could be referring to integer registers only, not the S/D/Q floating-point ones.  That’s why I am not trusting the documentation so much.

 

 

L. Spiro

0

Share this post


Link to post
Share on other sites

Is this for official Vita development? If so, you could ask on Sony's private developer site. They'll get back to you with an authoritative answer.

0

Share this post


Link to post
Share on other sites

As far as I know, NEON uses its own float registers.

Lets see what ARM has to say about it: http://www.arm.com/products/processors/technologies/neon.php 

 

 


  • Registers are considered as vectors of elements of the same data type
  • Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single precision floating point
  • Instructions perform the same operation in all lanes
Edited by TheChubu
0

Share this post


Link to post
Share on other sites

Is this for official Vita development? If so, you could ask on Sony's private developer site. They'll get back to you with an authoritative answer.

I think I will have to do that, but I won’t be able to have an account made until Monday, so I was hoping someone here knew.
 

As far as I know, NEON uses its own float registers.

Lets see what ARM has to say about it: http://www.arm.com/products/processors/technologies/neon.php 
 
 

  • Registers are considered as vectors of elements of the same data type
  • Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single precision floating point
  • Instructions perform the same operation in all lanes

Thank you TheChubu but there is too much fine print in the documents. For example, yes, it can treat all types, including floats, as vectors, but it doesn’t mention that the registers for integers and floats are different and can be handled by a VFP co-processor (for example) which will ultimately end up determining floating-point performance.
It mentions that what happens happens in all lanes, but does not specify parallelism.
It also says it can execute with up to 64 bits at a time, but does not say that 2 32-bit floats will therefore be run in parallel. It simply means that it has a limit, and nothing more.

So basically the documents are simply not that useful for this question. You can’t infer anything from them except exactly what is written, and it almost seems as if they are specifically trying to be abstract on this issue and use words that lead many to believe parallelism is implied.


A perfect example is on Nintendo 3DS where I start writing VFP functions by pushing and popping all registers just for ease in coding, then remove the useless pushes and pops afterwards.
You can push and pop multiple registers at a time:
PUSH  {S0-S31}
What they don’t tell you is that just because it is pushing multiple registers with a single instruction, each register takes an extra cycle. In other words the above example takes twice as long as:
PUSH  {S0-S15}
I’ve benchmarked it. SIMD doesn’t imply parallelism.


L. Spiro
0

Share this post


Link to post
Share on other sites

I assume you already know this but better to double check.

 

The NEON and VFP registers alias the same hardware. Switching modes requires a pipeline flush. Developers using NEON code should do all of it in batch and ensure no VFP instructions are interleaved.

 

That said, my guess is that it will do two sequential 64-bit-parallel operations internally. 

0

Share this post


Link to post
Share on other sites

As with any typical SIMD pipeline, there are nasty hidden overheads and pitfalls. Register aliasing, processor mode switching, compiler injected code, etc. I think NEON benefits from manual instruction pairing as well; see the ARM docs for the details. Microbenchmarking single ops is going to be pointless. Take a reasonably sized algorithm that translates easily (physics integrators, eg a softbody sim, are good candidates) and rewrite it in intrinsics. Measure that for X iterations against a plain C implementation. If you have a cycle timing profiler, that would be best, else just use high precision timing. 

 

And remember the standard SIMD rules: prefetch your data from nice cache aligned blocks, don't mode switch between FP and SIMD, manage your data hazards intelligently, etc.Tight ALU on resident data is going to be best. When dealing with intrinsics, make sure to generate assembly listings and take a look at what you're actually getting; you want to make sure things are being inlined and elided properly.

Edited by Promit
0

Share this post


Link to post
Share on other sites

I have solved the issue.

When I made my first test I had only 30 minutes and I quickly put up a test case and found them to be almost equal in time but did not have time to check the assembly.  Home duty calls when it calls.

 

But I had remembered a past time when I had used PlayStation Vita’s SIMD for my DXT algorithm to help make it real-time on the device and it actually made it slower, so I wasn’t so suspicious.

 

I came to the office today (Saturday) to finish my investigation, starting with seeing the assembly.

 

It turns out that the test case I created allowed the C++ to be optimized down to adding a single scalar instead of 4.  When I made a case in which it was forced to perform 4 adds on the C++ side, the VFP/NEON side became twice as fast, as expected if it handles 64 bits at a time, 2×32 bits in parallel.

 

 

However I did learn about the 64-bit limit here, and I may be still scratching my head wondering why it is not 4 times faster instead otherwise.

 

Thank you.

 

 

L. Spiro

0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0