• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Vortez

Is SIMD worth it?

5 posts in this topic

Hi, i've been experimenting with some SIMD lately, like mmx and SSE/SSE2, and im a bit disapointed by the result. Im doing simple stuff like filling 2 arrays with randoms numbers, then adding the result in another array, using c++, mmx and SSE (im using inline assembly in the last 2 functions, not the intrinsic functions).

ex:
[CODE]
const int NumElements = 10000;
const int NumLoops = 1000;

int a[1000];
int b[1000];
int c[1000];

void CPPTest(){
for(int i = 0; i < NumLoops; i++){
for(int j = 0; j < NumElements; j++){
c[j] = a[j] + b[j];
}
}
}

[/CODE]

I dont have the code with me atm but that basically what i do, then i do the same for the 2 other functions but in mmx or SSE, replacing the inner loop with assembly code.

Sure, the debug version with no optimization is about 10-12 time faster with SSE,and mmx show some improvement as well, but in release mode, the mmx version is about 10% slower, and the SSE version only slighly better, maybe 5%. I have to say i was expecting better result. I also noticed that if i use smaller buffers, i get better results, if i use biggers one, the result even out. I suspect the cache is doing this.

So, that's why im asking, is it still worth it to use those instructions with a compiler so good at optimizing the code?
0

Share this post


Link to post
Share on other sites
Short answer: yes. You're simply not testing the right thing.

Long answer: microbenchmarks are useless for this kind of test. You need to know what the actual benefits would be [i]in real usage situations[/i]. Cache pressure, pipelining, etc. etc. can have massive impacts on the performance of a real piece of code. At the end of the day, [b]profile[/b]. Don't guess. Find something that you can [i]prove[/i] is slow - via timing - and then see if SIMD benefits it.

As a side note, compilers can emit pretty good SSE instructions for most code nowadays. If you're building a 64-bit binary you're already using SSE whether you know it or not. Also, writing your own hand-rolled assembly is never a good idea for performance-intensive operations: it inhibits certain compiler optimizations in the vicinity of your code. Use intrinsics instead.
1

Share this post


Link to post
Share on other sites
In your example, the bottleneck of the program is almost certainly memory bandwidth -- reading/writing your arrays -- so optimizing the ALU-cost of the algorithm should be expected to have little impact.
1

Share this post


Link to post
Share on other sites

As with all optimizations, you should profile first.

But putting that aside, one of the main issues with SIMD optimizations is that you need to layout your data and computations in a way, that you can benefit from SIMD. This is a task, that the compiler can't do for you. If you have your data layout like in the example above, then adding intrinsics is a no-brainer and I would assume the compiler already did it to a large degree. In a real world example (and I think this is what ApochPiQ was getting at), those values would be spread out amoung random structs, somewhere in the heap, and there is nothing the compiler can do. This is where you will see massive speed improvements, not only because of SIMD but also because you are forced to reorganize your stuff in a way that is more compiler and CPU/cache friendly.

 

 

0

Share this post


Link to post
Share on other sites

The code has a buffer overrun. NumElements should be 1000.

 

For smaller array sizes (i.e. totals < a few MB), the arrays are likely to reside in the cache, and you should notice a performance boost with SSE. For array sizes that exceed the cache size, it's like that chunks will have to be ejected and re-read (as Hodgman has said). In this case, you can improve performance a little bit more by using _mm_stream_si128 to write the value to memory without placing a copy in the cache (which leaves more room for the input data, which should help performance a little bit). Really though, your approach needs to be fine tuned towards the hardware a little better. At the moment you are basically benchmarking how fast you main memory is, which probably isn't that useful as a metric.

 

Memory is slow. The less you read/write the better the performance will be. Once you have read some memory, try to do as much work on that data as possible, BEFORE you write it back out again (i.e. one loop that does lots of work, is better than lots of loops that do very little). By doing more work for each SIMD value you read, you will hopefully be able to mask the latency of the memory, and you should get some pretty decent performance from SIMD.

1

Share this post


Link to post
Share on other sites

SIMD is very much about knowing your data, it's transforms and how the CPU is going to treat it in terms of memory I/O.

 

Your problem with this test is, as mentioned, memory bandwidth bound; you are doing very little ALU work and while the CPU will be prefetching ahead in this case you don't have enough ALU work to do to cover the stalls to main memory.

 

With your C++ code then between working on 1 floating point op at a time and simply working thought the data some of the latency to main memory will be covered anyway.

Assuming your SIMD routines are working on 4 values at a time when you are doing the ALU work 4 times faster but still bumping into potential memory stalls all that much faster.

 

Throw some more register heavy ALU work in there and you'll notice a speed up.

 

This is where thinking about the data and the transforms come in to it as well; for example if you have a simple 2D particle system running then you can break up your data in such as way as to split up the processing done so that you do all the 'x' components first, then 'y', then update any velocity for x, then y and so on with each section of data being in its own nicely aligned chunk of memory meaning you can stream through taking advantage I-cache and d-cache coherency, prefetching and (on x64) work with the limited set of registers that you have to hand.

 

In short; done correctly and with enough ALU work SIMD is most definitively worth it once you know what you are doing :)

0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0