Performance on multiple threads

Started by
8 comments, last by WitchLord 16 years, 10 months ago
Hi, I'm using AngelScript in a context where multiple threads are simultaneously calling a function in AngelScript. There are very frequent calls, and it could be up to eight threads calling it at the same time. Now when I run it on my Core Duo laptop I get pretty nice performance. But when I run it on a 2 x quad core Xeon system the same operation that took 20 seconds on my laptop takes over a minute. Not what you expect from a machine which is among the fastest PC:s there is. Note that this "operation" means thousands of calls to angelscript (image processing). Anyways. I've tried a number of things. I've even gone so far that I made sure that each thread uses a different engine instance. Did not help. And when I removed the USE_THREADS define in AngelScript I get a crash approx when half of the calls to angelscript are done. This makes me think that even separate engine instances share some memory and thus uses critical sections in those places, slowing down things when using multiple threads. As for the script I'm running. I'm basically just doing a loop over 20 iterations. And I also pass an object by &in reference to the function, which returns an instance of a class I have (very simple class, basically a struct with r,g,b and a components). As for errors I get none and the result I get (in this case an image) looks exactly as expected. Any tips on how to improve the performance? Has it anything to do with Xeon processors? I've tried it in a 2 x core duo machintosh running bootcamp and there I had good performance as well.
Advertisement
perhaps the xeon server processors are not suited to image processing.

Perhaps the addition of more processors is causing more contention and thus creating livelock (temporary deadlock), but i doubt this is the problem.
Well the machines are insanely fast when doing the same operation I do in plain C++, so it is a scripting thing.

An interesting twist is that if I add a critical section around the prepare-execute-return block I get slightly better performance on the systems with 8 cores, but naturally less performance on my laptop (bumped up to 43 seconds). So it appears to be something in the Execute function that makes the processors feel sad.

[Edited by - Malmer on June 11, 2007 5:16:23 AM]
i think you should prepare the function only 1 time and cache its result, not every time you need to call it. angelscript is quite slow in doing prepare, but rather fast in executing an already prepared function :)

you should slam timing down by removing it from your expensive loop.

cheers
Well, that would just be an optimization. It still wouldn't explain the reason why the faster machines perform worse than the slower laptop.
There are a few places where critical sections are used to protect the data from simultaneous access from different threads. You can search the source code for ENTERCRITICALSECTION to find these.

The most likely bottleneck for you is the call to asCThreadManager::GetLocalData(), which is used to obtain the data structure used to store thread local data (of course). From your description it would seem that it is most frequently called in relation with Context::Execute calls, since AngelScript needs to keep track of the active contexts in a stack (asPushActiveContext and asPopActiveContext).

The more active threads you have the more collision you'll have in the GetLocalData call, thus it is reasonable that the overhead is larger on your 2 x quad core Xeon system. Though I wouldn't have thought the execution time would actually get longer, that would mean nearly all the execution time is spent inside the critical sections, which doesn't make sense to me.

If you feel up to it, you could try removing the overhead with the GetLocalData during executions, simply by calling this during the creation of the context instead of during asPushActiveContext and asPopActiveContext. The pointer to the local data must then be stored in the context and passed to asPushActiveContext and asPopActiveContext as parameter. If the bottleneck is indeed in GetLocalData you'll almost completely eliminate it like this. Of course, you must also make sure you reuse your contexts instead of creating new ones for each execution.

Let me know how this works out. It sounds like it would be a good improvement to add to the SVN as well.

Regards,
Andreas

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

I went a somewhat easier path than you suggested, which should have the same result.

I create a context at first call from a thread. Otherwise it just reuses the context previously used by that thread.

Then I removed asPushActiveContext/asPopActiveContext from Execute and moved them to the constructor and Release() respectively. That should rid it all of any GetLocalData() calls.

Result: Not any measurable difference at all.

But if, as I said before, I put a critical section around my Execute call I get better performance on the multicore xeons than before, allthough I get less performance on my dual core laptop.

This makes me believe it has something to do with cache misses or something in the whole ExecuteNext(thingie) that occurs on systems with dual processors or Xeon specifically.
Yeah, if removing the Push/PopActiveContext calls didn't have any noticeable effect then it's pretty safe to assume the problem isn't with the critical sections.

Since AngelScript is based on the use of a VM it will naturally require a lot of memory reads. If multiple cores share the cache then they are highly likely to suffer even more from cache misses due to this. Is the Xeon processor designed with a shared cache, or a separate cache for each core?

Really, my knowledge of high performance software is not near good enough to be of much help to you on that subject. I'll accept any suggestions for improvement on this.

It would be interesting to see if other script engines suffer from the same problem or not. If my guess is right, then I guess they would. Maybe the only salvation is to use native code, either by compiling directly with C++, or using JIT compilation.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

Figured that was the case. Thanks for the help.

Anyways, the thing runs very smooth in most cases though. In the general case most scripts are pretty short and then it works very fast and good anyways. It is when you enter into long for-loops it gets a bit slower. For more simple scripts it should work ok. AngelScript really is very good. Like it a lot. Easy to work with and performance is very satisfying (except for in the heavy script + multicore scenario).
Which may not be a situation well suited for scripting to begin with. :)

I'm glad to hear you like AngelScript and that it is performing well.

Which version of AS are you using? I've made lots of bug fixes for the upcoming release which should hopefully benefit all users. There are still a few ones to root out before I release 2.8.1, but hopefully it will be available in a couple of weeks.

Following that I'll continue to make changes that should make AngelScript be even better, e.g. a lot less memory allocations during script execution, more intuitive application interface, and so on. I'm in a phase where I'm improving what's already in AngelScript instead of adding lots of new features.

By the way, did you give up on your game? You are the same Malmer from malmer.nu, aren't you?

Regards,
Andreas

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

This topic is closed to new replies.

Advertisement