• FEATURED

View more

View more

View more

Image of the Day Submit

IOTD | Top Screenshots

The latest, straight to your Inbox.

Subscribe to GameDev.net Direct to receive the latest updates and exclusive content.

strange performance patterns

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

22 replies to this topic

#1Storyyeller  Members

Posted 11 January 2012 - 09:23 AM

I have a python script which does some calculations, and I am trying to figure out how fast it is. I invoke it from the command line and it prints out the time taken after completing.

However, there is a mysterious pattern to the timings. When I first run it, it starts out relatively fast. However, subsequent runs become slower and slower. If I leave the computer idle for a while, it becomes fast again. I can't really think of any explanation for this. What is going on?
I trust exceptions about as far as I can throw them.

#2thok  Members

Posted 11 January 2012 - 09:41 AM

What is the difference in time? Have you run your code through a profiler? (Have a look at cProfile: http://docs.python.org/library/profile.html.)

What OS?
What version of Python?
Can you give a basic description of what your code is doing?

#3swiftcoder  Senior Moderators

Posted 11 January 2012 - 10:14 AM

However, there is a mysterious pattern to the timings. When I first run it, it starts out relatively fast. However, subsequent runs become slower and slower. If I leave the computer idle for a while, it becomes fast again. I can't really think of any explanation for this. What is going on?

Does your script use vast amounts of memory? We sometimes observe this kind of behaviour on scripts that consume 8GB+ of memory (and thus overflow into virtual memory) - it seems the OS often takes a while to free up all that memory back to the next invocation.

Tristam MacDonald - Software Engineer @ Amazon - [swiftcoding] [GitHub]

#4Storyyeller  Members

Posted 11 January 2012 - 11:00 AM

My os is Windows 7 and I'm using Python 2.6

As far as memory, according to the task manager, the python.exe process peaks around 6mb of memory, so I doubt that that is a problem.
I trust exceptions about as far as I can throw them.

#5e‍dd  Members

Posted 11 January 2012 - 01:32 PM


import cProfile

cProfile.run('main()')



#6Storyyeller  Members

Posted 11 January 2012 - 02:07 PM

If the timing of the overall script with a simple timer is inconsistent, how would in depth profiling help?
I trust exceptions about as far as I can throw them.

#7swiftcoder  Senior Moderators

Posted 11 January 2012 - 02:10 PM

If the timing of the overall script with a simple timer is inconsistent, how would in depth profiling help?

You profile several runs, making sure to profile both a quick and a slow run. Compare where the majority of time is spent in each, and you may just have your culprit...

Tristam MacDonald - Software Engineer @ Amazon - [swiftcoding] [GitHub]

#8Storyyeller  Members

Posted 11 January 2012 - 02:46 PM

While profiling, I noticed that a large amount of time is spent finding the set bits in an integer. Does anyone know the most effecient way to do this? Here's what I have right now. What would be ideal is a c implementation, but there doesn't appear to be one in the standard library, and I don't want to require other libraries.

[source lang='python']def toList(bs): assert(bs >= 0) #negative numbers have infinitely many bits set return [i for i,b in enumerate(reversed(bin(bs))) if b=='1'][/source]
I trust exceptions about as far as I can throw them.

#9swiftcoder  Senior Moderators

Posted 11 January 2012 - 03:23 PM

Read all of BitManipulation. In particular, you might be able to build some variation of Kernighan/Knuth's bitCount() method, to iterate bits instead.

Your current solution uses a string conversion, a reversal, an enumeration and a list comphrehension all together - that can't be very fast.

Tristam MacDonald - Software Engineer @ Amazon - [swiftcoding] [GitHub]

#10Storyyeller  Members

Posted 11 January 2012 - 03:56 PM

That page looks out of date and inaccurate. Most of the methods involving manual loops are very slow in Python. (This is actually what I tried before I came up with the faster version posted above).
I trust exceptions about as far as I can throw them.

#11swiftcoder  Senior Moderators

Posted 11 January 2012 - 04:32 PM

Well, this problem intrigues me, so I took the time to produce what should be the optimal solution (for compiled languages, at least). It's based on de Bruijn sequences to eliminate as many loop iterations as possible, and it's implemented as an iterator/generator.

_DeBruijnBitPosition = [
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
]

def _trunc32( w ):
w = int( ( w & 0x7fffFFFF ) | -( w & 0x80000000 ) )
return w

def findSetBits(v):
while v:
index = _DeBruijnBitPosition[_trunc32((v & -v) * 0x077CB531) >> 27]
v = (v & v-1)
yield index

# usage: list(findSetBits(4937)) => [0, 3, 6, 8, 9, 12]


I don't happen to have a native python interpreter handy - fancy benchmarking this against your earlier list comprehension solution?

Tristam MacDonald - Software Engineer @ Amazon - [swiftcoding] [GitHub]

#12Storyyeller  Members

Posted 11 January 2012 - 04:33 PM

That won't work in Python, because ints are arbitrary precision. The paper assumes fixed 32bit ints with overflow.
I trust exceptions about as far as I can throw them.

#13swiftcoder  Senior Moderators

Posted 11 January 2012 - 04:40 PM

Is that only for 32bit ints?

Yes - I had assumed you were operating on a maximum of 32-bits. It's not easy to generalise any of the high-performance solutions to a large number of bits.

How big are your bit sets typically?

Tristam MacDonald - Software Engineer @ Amazon - [swiftcoding] [GitHub]

#14Storyyeller  Members

Posted 11 January 2012 - 04:43 PM

Well in this particular test case, the bit sets have length 109, though I hope it generalizes to larger values.
I trust exceptions about as far as I can throw them.

#15swiftcoder  Senior Moderators

Posted 11 January 2012 - 04:49 PM

Well in this particular test case, the bit sets have length 109, though I hope it generalizes to larger values.

One could derive further sequence values, but if you don't have a fixed upper bound, probably not worth it.

My research group attempted a similar thing for a large scale bittorrent simulation, with upwards of 1,000 elements per bitset. In the end, we ported the whole damn thing to C++, because we just couldn't squeeze enough performance out of python in this area.

That said, PyPy's performance has come along way in the last few years - have you considered trying your current solution out on their interpreter?

/aside:

That won't work in Python, because ints are arbitrary precision. The paper assumes fixed 32bit ints with overflow.

That's why I had to explicitly truncate to 32-bits with the _trunc32 function. I think I have handled the overflow correctly... Either way, it seems to pass all my unit tests.

Tristam MacDonald - Software Engineer @ Amazon - [swiftcoding] [GitHub]

#16e‍dd  Members

Posted 11 January 2012 - 04:54 PM

If the timing of the overall script with a simple timer is inconsistent, how would in depth profiling help?

Timing tells you it's slow. Profiling tells you why. You want to know the latter, presumably?

For counting bits, perhaps something like:

def count_bits(i):
return len(bin(i).replace('0', '')) - 1 # -1 for 'b'

?

I would guess that moving the loop in to C land might help.

#17Storyyeller  Members

Posted 11 January 2012 - 04:56 PM

That's why I had to explicitly truncate to 32-bits with the _trunc32 function. I think I have handled the overflow correctly... Either way, it seems to pass all my unit tests.

Sorry, I said that while reading the paper and didn't look carefully at the code. It looks like your code will work as long as the bitset size is at most 32.

For counting bits, perhaps something like:

def count_bits(i):
return len(bin(i).replace('0', '')) - 1 # -1 for 'b'

?

I would guess that moving the loop in to C land might help.

Personally, I use
bin(x).count("1")
for counting bits. But the main problem here is gettign a list of set bits, not counting them.
I trust exceptions about as far as I can throw them.

#18e‍dd  Members

Posted 11 January 2012 - 04:59 PM

Ah, missed that. Are you sure that's the problem? Do you have benchmark results?

#19Storyyeller  Members

Posted 11 January 2012 - 08:09 PM

I figured out a way to rewrite my code so it doesn't require getting individual bits at all. Of course, it does mean more work has to be done in other places, but it is still faster over all.

Here's the profiler results after all the optimizations I've done.
		 5209449 function calls (5047367 primitive calls) in 27.757 CPU seconds
Ordered by: standard name
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.004    0.004   27.757   27.757 <string>:1(<module>)
1    0.000    0.000    0.000    0.000 bisect.py:1(<module>)
1904    0.017    0.000    0.036    0.000 bitset.py:10(bitset)
112424    0.640    0.000    1.129    0.000 bitset.py:3(bitcount)
321    0.008    0.000    0.009    0.000 bitset.py:33(toList)
1    0.000    0.000    0.000    0.000 bitset.py:41(makeKeyMap)
656208    2.846    0.000    4.449    0.000 bitset.py:7(union)
73374/13957    0.587    0.000   21.646    0.002 calculation.py:126(getCostSub2)
582249    1.164    0.000    1.164    0.000 calculation.py:14(isSubset)
31678    2.773    0.000   21.387    0.001 calculation.py:140(getSkeletonConnectLists)
112583    1.627    0.000    6.066    0.000 calculation.py:168(getTouchedChildren)
268969    1.865    0.000    3.028    0.000 calculation.py:17(connectKeyTupIsLessEq)
221132    0.478    0.000    0.478    0.000 calculation.py:173(getTouchSet)
751/109    0.018    0.000    0.118    0.001 calculation.py:186(makeCalcNode)
109    0.002    0.000    0.122    0.001 calculation.py:202(makeCalcRoot)
1    0.001    0.001    0.123    0.123 calculation.py:207(buildCalculationTrees)
321    0.008    0.000    0.077    0.000 calculation.py:22(makeConnectDict_Bitset)
109    0.001    0.000    0.003    0.000 calculation.py:31(__init__)
1983    0.005    0.000    0.005    0.000 calculation.py:36(getCost)
109    0.002    0.000    0.003    0.000 calculation.py:41(__init__)
342    0.032    0.000   25.089    0.073 calculation.py:46(getCost)
86698    1.167    0.000    4.451    0.000 calculation.py:6(insertIncomparableList)
321    0.006    0.000    0.097    0.000 calculation.py:68(__init__)
85023/2223    0.653    0.000   25.051    0.011 calculation.py:78(getCost)
12660/1466    1.070    0.000   25.019    0.017 calculation.py:95(getCostSub)
1453/321    0.022    0.000    0.026    0.000 connect.py:1(connectivityDictSub)
321    0.021    0.000    0.052    0.000 connect.py:18(connectivityDict)
215922    1.427    0.000    1.797    0.000 decomposition.py:13(<lambda>)
1    0.004    0.004    2.441    2.441 decomposition.py:33(carvingDecomposition)
236    0.001    0.000    0.002    0.000 decomposition.py:41(<lambda>)
1    0.006    0.006    0.036    0.036 decomposition.py:52(bitsetizeKeys)
1124    0.007    0.000    0.025    0.000 decomposition.py:53(convertEdge)
106    0.008    0.000    2.433    0.023 decomposition.py:6(greedySplitNode)
5989    0.039    0.000    0.073    0.000 decomposition.py:9(<lambda>)
212    0.001    0.000    0.001    0.000 graph.py:12(removeEdge)
107    0.000    0.000    0.001    0.000 graph.py:15(numNeighbors)
325    0.001    0.000    0.001    0.000 graph.py:4(__init__)
545    0.002    0.000    0.002    0.000 graph.py:8(addEdge)
1    0.003    0.003    0.003    0.003 heapq.py:31(<module>)
3    0.000    0.000    0.000    0.000 ntpath.py:122(splitdrive)
3    0.000    0.000    0.000    0.000 ntpath.py:55(isabs)
1    0.000    0.000    0.000    0.000 ntpath.py:63(join)
1    0.000    0.000    2.601    2.601 rsglib.py:27(__init__)
342    0.005    0.000   25.115    0.073 rsglib.py:40(getCost)
1    0.001    0.001    0.002    0.002 test.py:20(makeGraph)
493    0.007    0.000    0.007    0.000 test.py:31(getSuccessorStates)
1    0.007    0.007   25.145   25.145 test.py:38(aStarSearch)
342    0.010    0.000   25.127    0.073 test.py:41(push)
1    0.005    0.005   27.754   27.754 test.py:57(testLevel)
1    0.001    0.001    0.002    0.002 test.py:9(parseLevel)
220466    0.508    0.000    0.508    0.000 unionfind_bitset.py:18(getContainingSet)
406014    5.661    0.000   11.071    0.000 unionfind_bitset.py:3(unionBitsetLists)
153    0.001    0.000    0.001    0.000 {_heapq.heappop}
342    0.002    0.000    0.002    0.000 {_heapq.heappush}
196397    0.370    0.000    0.370    0.000 {any}
112745    0.250    0.000    0.250    0.000 {bin}
361464    0.640    0.000    0.640    0.000 {len}
2005/1097    0.014    0.000    0.033    0.000 {map}
1    0.000    0.000    0.000    0.000 {method '__enter__' of 'file' objects}
216    0.000    0.000    0.000    0.000 {method 'add' of 'set' objects}
574394    1.092    0.000    1.092    0.000 {method 'append' of 'list' objects}
112424    0.240    0.000    0.240    0.000 {method 'count' of 'str' objects}
1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
106    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
215    0.001    0.000    0.001    0.000 {method 'items' of 'dict' objects}
321    0.001    0.000    0.001    0.000 {method 'iteritems' of 'dict' objects}
431    0.001    0.000    0.001    0.000 {method 'pop' of 'list' objects}
1    0.000    0.000    0.000    0.000 {method 'readlines' of 'file' objects}
8    0.000    0.000    0.000    0.000 {method 'strip' of 'str' objects}
106    0.001    0.000    0.001    0.000 {method 'union' of 'frozenset' objects}
322    0.001    0.000    0.001    0.000 {method 'values' of 'dict' objects}
15168    0.593    0.000    2.388    0.000 {min}
1    0.000    0.000    0.000    0.000 {open}
65237    0.169    0.000    0.169    0.000 {range}
656208    1.602    0.000    1.602    0.000 {reduce}
7899/1910    0.064    0.000    0.114    0.000 {sorted}



I trust exceptions about as far as I can throw them.

#20thok  Members

Posted 12 January 2012 - 03:00 AM

I figured out a way to rewrite my code so it doesn't require getting individual bits at all. Of course, it does mean more work has to be done in other places, but it is still faster over all.

Here's the profiler results after all the optimizations I've done.
...

What did the profile report look like before? Profiling only _after_ an optimization is kind of a useless exercise.

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.