• Advertisement
Sign in to follow this  

strange performance patterns

This topic is 2198 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have a python script which does some calculations, and I am trying to figure out how fast it is. I invoke it from the command line and it prints out the time taken after completing.

However, there is a mysterious pattern to the timings. When I first run it, it starts out relatively fast. However, subsequent runs become slower and slower. If I leave the computer idle for a while, it becomes fast again. I can't really think of any explanation for this. What is going on?

Share this post


Link to post
Share on other sites
Advertisement
What is the difference in time? Have you run your code through a profiler? (Have a look at cProfile: http://docs.python.org/library/profile.html.)

What OS?
What version of Python?
Can you give a basic description of what your code is doing?

Share this post


Link to post
Share on other sites
[quote name='Storyyeller' timestamp='1326295402' post='4901654']
However, there is a mysterious pattern to the timings. When I first run it, it starts out relatively fast. However, subsequent runs become slower and slower. If I leave the computer idle for a while, it becomes fast again. I can't really think of any explanation for this. What is going on?[/quote]
Does your script use vast amounts of memory? We sometimes observe this kind of behaviour on scripts that consume 8GB+ of memory (and thus overflow into virtual memory) - it seems the OS often takes a while to free up all that memory back to the next invocation.

Share this post


Link to post
Share on other sites
My os is Windows 7 and I'm using Python 2.6

As far as memory, according to the task manager, the python.exe process peaks around 6mb of memory, so I doubt that that is a problem.

Share this post


Link to post
Share on other sites
If the timing of the overall script with a simple timer is inconsistent, how would in depth profiling help?

Share this post


Link to post
Share on other sites
[quote name='Storyyeller' timestamp='1326312477' post='4901736']
If the timing of the overall script with a simple timer is inconsistent, how would in depth profiling help?[/quote]
You profile several runs, making sure to profile both a quick and a slow run. Compare where the majority of time is spent in each, and you may just have your culprit...

Share this post


Link to post
Share on other sites
While profiling, I noticed that a large amount of time is spent finding the set bits in an integer. Does anyone know the most effecient way to do this? Here's what I have right now. What would be ideal is a c implementation, but there doesn't appear to be one in the standard library, and I don't want to require other libraries.

[source lang='python']def toList(bs):
assert(bs >= 0) #negative numbers have infinitely many bits set
return [i for i,b in enumerate(reversed(bin(bs))) if b=='1']


[/source]

Share this post


Link to post
Share on other sites
Read all of [url="http://wiki.python.org/moin/BitManipulation"]BitManipulation[/url]. In particular, you might be able to build some variation of [color=#000000][font=Arial, Verdana, Geneva,]Kernighan/Knuth's bitCount() method, to iterate bits instead.[/font][/color]

[color=#000000][font=Arial, Verdana, Geneva,]Your current solution uses a string conversion, a reversal, an enumeration and a list comphrehension all together - that can't be very fast.[/font][/color]

Share this post


Link to post
Share on other sites
That page looks out of date and inaccurate. Most of the methods involving manual loops are very slow in Python. (This is actually what I tried before I came up with the faster version posted above).

Share this post


Link to post
Share on other sites
Well, this problem intrigues me, so I took the time to produce what should be the optimal solution (for compiled languages, at least). It's based on [url="http://supertech.csail.mit.edu/papers/debruijn.pdf"]de Bruijn sequences[/url] to eliminate as many loop iterations as possible, and it's implemented as an iterator/generator.

[CODE]
_DeBruijnBitPosition = [
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
]

def _trunc32( w ):
w = int( ( w & 0x7fffFFFF ) | -( w & 0x80000000 ) )
return w

def findSetBits(v):
while v:
index = _DeBruijnBitPosition[_trunc32((v & -v) * 0x077CB531) >> 27]
v = (v & v-1)
yield index

# usage: list(findSetBits(4937)) => [0, 3, 6, 8, 9, 12]
[/CODE]

I don't happen to have a native python interpreter handy - fancy benchmarking this against your earlier list comprehension solution?

Share this post


Link to post
Share on other sites
That won't work in Python, because ints are arbitrary precision. The paper assumes fixed 32bit ints with overflow.

Share this post


Link to post
Share on other sites
[quote name='Storyyeller' timestamp='1326321224' post='4901788']
Is that only for 32bit ints?[/quote]
Yes - I had assumed you were operating on a maximum of 32-bits. It's not easy to generalise any of the high-performance solutions to a large number of bits.

How big are your bit sets typically?

Share this post


Link to post
Share on other sites
Well in this particular test case, the bit sets have length 109, though I hope it generalizes to larger values.

Share this post


Link to post
Share on other sites
[quote name='Storyyeller' timestamp='1326321806' post='4901797']
Well in this particular test case, the bit sets have length 109, though I hope it generalizes to larger values.[/quote]
One could derive further sequence values, but if you don't have a fixed upper bound, probably not worth it.

My research group attempted a similar thing for a large scale bittorrent simulation, with upwards of 1,000 elements per bitset. In the end, we ported the whole damn thing to C++, because we just couldn't squeeze enough performance out of python in this area.

That said, PyPy's performance has come along way in the last few years - have you considered trying your current solution out on their interpreter?



/aside:
[quote name='Storyyeller' timestamp='1326321224' post='4901788']
That won't work in Python, because ints are arbitrary precision. The paper assumes fixed 32bit ints with overflow.[/quote]
That's why I had to explicitly truncate to 32-bits with the _trunc32 function. I think I have handled the overflow correctly... Either way, it seems to pass all my unit tests.

Share this post


Link to post
Share on other sites
Hidden
/aside:
[quote name='Storyyeller' timestamp='1326321224' post='4901788']
That won't work in Python, because ints are arbitrary precision. The paper assumes fixed 32bit ints with overflow.[/quote]
That's why I had to explicitly truncate to 32-bits with the _trunc32 function. I think I have handled the overflow correctly... Either way, it seems to pass all my unit tests.

Share this post


Link to post
[quote name='Storyyeller' timestamp='1326312477' post='4901736']
If the timing of the overall script with a simple timer is inconsistent, how would in depth profiling help?
[/quote]
Timing tells you it's slow. Profiling tells you why. You want to know the latter, presumably?

For counting bits, perhaps something like:

[code]
def count_bits(i):
return len(bin(i).replace('0', '')) - 1 # -1 for 'b'
[/code]
?

I would guess that moving the loop in to C land might help.

Share this post


Link to post
Share on other sites
[quote name='swiftcoder' timestamp='1326322188' post='4901799']
That's why I had to explicitly truncate to 32-bits with the _trunc32 function. I think I have handled the overflow correctly... Either way, it seems to pass all my unit tests.
[/quote]

Sorry, I said that while reading the paper and didn't look carefully at the code. It looks like your code will work as long as the bitset size is at most 32.



[quote name='edd²' timestamp='1326322461' post='4901802']
For counting bits, perhaps something like:

[code]
def count_bits(i):
return len(bin(i).replace('0', '')) - 1 # -1 for 'b'
[/code]
?

I would guess that moving the loop in to C land might help.
[/quote]

Personally, I use [code]bin(x).count("1")[/code] for counting bits. But the main problem here is gettign a list of set bits, not counting them.

Share this post


Link to post
Share on other sites
Ah, missed that. [s]Are you sure that's the problem? Do you have benchmark results?[/s]
EDIT: doh, and misread question.

Share this post


Link to post
Share on other sites
I figured out a way to rewrite my code so it doesn't require getting individual bits at all. Of course, it does mean more work has to be done in other places, but it is still faster over all.

Here's the profiler results after all the optimizations I've done.
[CODE]
5209449 function calls (5047367 primitive calls) in 27.757 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 27.757 27.757 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 bisect.py:1(<module>)
1904 0.017 0.000 0.036 0.000 bitset.py:10(bitset)
112424 0.640 0.000 1.129 0.000 bitset.py:3(bitcount)
321 0.008 0.000 0.009 0.000 bitset.py:33(toList)
1 0.000 0.000 0.000 0.000 bitset.py:41(makeKeyMap)
656208 2.846 0.000 4.449 0.000 bitset.py:7(union)
73374/13957 0.587 0.000 21.646 0.002 calculation.py:126(getCostSub2)
582249 1.164 0.000 1.164 0.000 calculation.py:14(isSubset)
31678 2.773 0.000 21.387 0.001 calculation.py:140(getSkeletonConnectLists)
112583 1.627 0.000 6.066 0.000 calculation.py:168(getTouchedChildren)
268969 1.865 0.000 3.028 0.000 calculation.py:17(connectKeyTupIsLessEq)
221132 0.478 0.000 0.478 0.000 calculation.py:173(getTouchSet)
751/109 0.018 0.000 0.118 0.001 calculation.py:186(makeCalcNode)
109 0.002 0.000 0.122 0.001 calculation.py:202(makeCalcRoot)
1 0.001 0.001 0.123 0.123 calculation.py:207(buildCalculationTrees)
321 0.008 0.000 0.077 0.000 calculation.py:22(makeConnectDict_Bitset)
109 0.001 0.000 0.003 0.000 calculation.py:31(__init__)
1983 0.005 0.000 0.005 0.000 calculation.py:36(getCost)
109 0.002 0.000 0.003 0.000 calculation.py:41(__init__)
342 0.032 0.000 25.089 0.073 calculation.py:46(getCost)
86698 1.167 0.000 4.451 0.000 calculation.py:6(insertIncomparableList)
321 0.006 0.000 0.097 0.000 calculation.py:68(__init__)
85023/2223 0.653 0.000 25.051 0.011 calculation.py:78(getCost)
12660/1466 1.070 0.000 25.019 0.017 calculation.py:95(getCostSub)
1453/321 0.022 0.000 0.026 0.000 connect.py:1(connectivityDictSub)
321 0.021 0.000 0.052 0.000 connect.py:18(connectivityDict)
215922 1.427 0.000 1.797 0.000 decomposition.py:13(<lambda>)
1 0.004 0.004 2.441 2.441 decomposition.py:33(carvingDecomposition)
236 0.001 0.000 0.002 0.000 decomposition.py:41(<lambda>)
1 0.006 0.006 0.036 0.036 decomposition.py:52(bitsetizeKeys)
1124 0.007 0.000 0.025 0.000 decomposition.py:53(convertEdge)
106 0.008 0.000 2.433 0.023 decomposition.py:6(greedySplitNode)
5989 0.039 0.000 0.073 0.000 decomposition.py:9(<lambda>)
212 0.001 0.000 0.001 0.000 graph.py:12(removeEdge)
107 0.000 0.000 0.001 0.000 graph.py:15(numNeighbors)
325 0.001 0.000 0.001 0.000 graph.py:4(__init__)
545 0.002 0.000 0.002 0.000 graph.py:8(addEdge)
1 0.003 0.003 0.003 0.003 heapq.py:31(<module>)
3 0.000 0.000 0.000 0.000 ntpath.py:122(splitdrive)
3 0.000 0.000 0.000 0.000 ntpath.py:55(isabs)
1 0.000 0.000 0.000 0.000 ntpath.py:63(join)
1 0.000 0.000 2.601 2.601 rsglib.py:27(__init__)
342 0.005 0.000 25.115 0.073 rsglib.py:40(getCost)
1 0.001 0.001 0.002 0.002 test.py:20(makeGraph)
493 0.007 0.000 0.007 0.000 test.py:31(getSuccessorStates)
1 0.007 0.007 25.145 25.145 test.py:38(aStarSearch)
342 0.010 0.000 25.127 0.073 test.py:41(push)
1 0.005 0.005 27.754 27.754 test.py:57(testLevel)
1 0.001 0.001 0.002 0.002 test.py:9(parseLevel)
220466 0.508 0.000 0.508 0.000 unionfind_bitset.py:18(getContainingSet)
406014 5.661 0.000 11.071 0.000 unionfind_bitset.py:3(unionBitsetLists)
153 0.001 0.000 0.001 0.000 {_heapq.heappop}
342 0.002 0.000 0.002 0.000 {_heapq.heappush}
196397 0.370 0.000 0.370 0.000 {any}
112745 0.250 0.000 0.250 0.000 {bin}
361464 0.640 0.000 0.640 0.000 {len}
2005/1097 0.014 0.000 0.033 0.000 {map}
1 0.000 0.000 0.000 0.000 {method '__enter__' of 'file' objects}
216 0.000 0.000 0.000 0.000 {method 'add' of 'set' objects}
574394 1.092 0.000 1.092 0.000 {method 'append' of 'list' objects}
112424 0.240 0.000 0.240 0.000 {method 'count' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
106 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
215 0.001 0.000 0.001 0.000 {method 'items' of 'dict' objects}
321 0.001 0.000 0.001 0.000 {method 'iteritems' of 'dict' objects}
431 0.001 0.000 0.001 0.000 {method 'pop' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'readlines' of 'file' objects}
8 0.000 0.000 0.000 0.000 {method 'strip' of 'str' objects}
106 0.001 0.000 0.001 0.000 {method 'union' of 'frozenset' objects}
322 0.001 0.000 0.001 0.000 {method 'values' of 'dict' objects}
15168 0.593 0.000 2.388 0.000 {min}
1 0.000 0.000 0.000 0.000 {open}
65237 0.169 0.000 0.169 0.000 {range}
656208 1.602 0.000 1.602 0.000 {reduce}
7899/1910 0.064 0.000 0.114 0.000 {sorted}

[/CODE]

Share this post


Link to post
Share on other sites
[quote name='Storyyeller' timestamp='1326334144' post='4901863']
I figured out a way to rewrite my code so it doesn't require getting individual bits at all. Of course, it does mean more work has to be done in other places, but it is still faster over all.

Here's the profiler results after all the optimizations I've done.
...
[/quote]

What did the profile report look like before? Profiling only _after_ an optimization is kind of a useless exercise.

Share this post


Link to post
Share on other sites
So do any parts of that profile change significantly when when running your program repeatedly?

Share this post


Link to post
Share on other sites
[quote name='Storyyeller' timestamp='1326334144' post='4901863']
I figured out a way to rewrite my code so it doesn't require getting individual bits at all. Of course, it does mean more work has to be done in other places, but it is still faster over all.

Here's the profiler results after all the optimizations I've done.
[CODE]
5209449 function calls (5047367 primitive calls) in 27.757 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 27.757 27.757 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 bisect.py:1(<module>)
1904 0.017 0.000 0.036 0.000 bitset.py:10(bitset)
112424 0.640 0.000 1.129 0.000 bitset.py:3(bitcount)
321 0.008 0.000 0.009 0.000 bitset.py:33(toList)
1 0.000 0.000 0.000 0.000 bitset.py:41(makeKeyMap)
656208 2.846 0.000 4.449 0.000 bitset.py:7(union)
73374/13957 0.587 0.000 21.646 0.002 calculation.py:126(getCostSub2)
582249 1.164 0.000 1.164 0.000 calculation.py:14(isSubset)
31678 2.773 0.000 21.387 0.001 calculation.py:140(getSkeletonConnectLists)
112583 1.627 0.000 6.066 0.000 calculation.py:168(getTouchedChildren)
268969 1.865 0.000 3.028 0.000 calculation.py:17(connectKeyTupIsLessEq)
221132 0.478 0.000 0.478 0.000 calculation.py:173(getTouchSet)
751/109 0.018 0.000 0.118 0.001 calculation.py:186(makeCalcNode)
109 0.002 0.000 0.122 0.001 calculation.py:202(makeCalcRoot)
1 0.001 0.001 0.123 0.123 calculation.py:207(buildCalculationTrees)
321 0.008 0.000 0.077 0.000 calculation.py:22(makeConnectDict_Bitset)
109 0.001 0.000 0.003 0.000 calculation.py:31(__init__)
1983 0.005 0.000 0.005 0.000 calculation.py:36(getCost)
109 0.002 0.000 0.003 0.000 calculation.py:41(__init__)
342 0.032 0.000 25.089 0.073 calculation.py:46(getCost)
86698 1.167 0.000 4.451 0.000 calculation.py:6(insertIncomparableList)
321 0.006 0.000 0.097 0.000 calculation.py:68(__init__)
85023/2223 0.653 0.000 25.051 0.011 calculation.py:78(getCost)
12660/1466 1.070 0.000 25.019 0.017 calculation.py:95(getCostSub)
1453/321 0.022 0.000 0.026 0.000 connect.py:1(connectivityDictSub)
321 0.021 0.000 0.052 0.000 connect.py:18(connectivityDict)
215922 1.427 0.000 1.797 0.000 decomposition.py:13(<lambda>)
1 0.004 0.004 2.441 2.441 decomposition.py:33(carvingDecomposition)
236 0.001 0.000 0.002 0.000 decomposition.py:41(<lambda>)
1 0.006 0.006 0.036 0.036 decomposition.py:52(bitsetizeKeys)
1124 0.007 0.000 0.025 0.000 decomposition.py:53(convertEdge)
106 0.008 0.000 2.433 0.023 decomposition.py:6(greedySplitNode)
5989 0.039 0.000 0.073 0.000 decomposition.py:9(<lambda>)
212 0.001 0.000 0.001 0.000 graph.py:12(removeEdge)
107 0.000 0.000 0.001 0.000 graph.py:15(numNeighbors)
325 0.001 0.000 0.001 0.000 graph.py:4(__init__)
545 0.002 0.000 0.002 0.000 graph.py:8(addEdge)
1 0.003 0.003 0.003 0.003 heapq.py:31(<module>)
3 0.000 0.000 0.000 0.000 ntpath.py:122(splitdrive)
3 0.000 0.000 0.000 0.000 ntpath.py:55(isabs)
1 0.000 0.000 0.000 0.000 ntpath.py:63(join)
1 0.000 0.000 2.601 2.601 rsglib.py:27(__init__)
342 0.005 0.000 25.115 0.073 rsglib.py:40(getCost)
1 0.001 0.001 0.002 0.002 test.py:20(makeGraph)
493 0.007 0.000 0.007 0.000 test.py:31(getSuccessorStates)
1 0.007 0.007 25.145 25.145 test.py:38(aStarSearch)
342 0.010 0.000 25.127 0.073 test.py:41(push)
1 0.005 0.005 27.754 27.754 test.py:57(testLevel)
1 0.001 0.001 0.002 0.002 test.py:9(parseLevel)
220466 0.508 0.000 0.508 0.000 unionfind_bitset.py:18(getContainingSet)
406014 5.661 0.000 11.071 0.000 unionfind_bitset.py:3(unionBitsetLists)
153 0.001 0.000 0.001 0.000 {_heapq.heappop}
342 0.002 0.000 0.002 0.000 {_heapq.heappush}
196397 0.370 0.000 0.370 0.000 {any}
112745 0.250 0.000 0.250 0.000 {bin}
361464 0.640 0.000 0.640 0.000 {len}
2005/1097 0.014 0.000 0.033 0.000 {map}
1 0.000 0.000 0.000 0.000 {method '__enter__' of 'file' objects}
216 0.000 0.000 0.000 0.000 {method 'add' of 'set' objects}
574394 1.092 0.000 1.092 0.000 {method 'append' of 'list' objects}
112424 0.240 0.000 0.240 0.000 {method 'count' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
106 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
215 0.001 0.000 0.001 0.000 {method 'items' of 'dict' objects}
321 0.001 0.000 0.001 0.000 {method 'iteritems' of 'dict' objects}
431 0.001 0.000 0.001 0.000 {method 'pop' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'readlines' of 'file' objects}
8 0.000 0.000 0.000 0.000 {method 'strip' of 'str' objects}
106 0.001 0.000 0.001 0.000 {method 'union' of 'frozenset' objects}
322 0.001 0.000 0.001 0.000 {method 'values' of 'dict' objects}
15168 0.593 0.000 2.388 0.000 {min}
1 0.000 0.000 0.000 0.000 {open}
65237 0.169 0.000 0.169 0.000 {range}
656208 1.602 0.000 1.602 0.000 {reduce}
7899/1910 0.064 0.000 0.114 0.000 {sorted}

[/CODE]
[/quote]

Maybe I wasn't clear enough. What I mean is, showing us the POST-optimization profiling report tells us nothing unless you actually have a PRE-optimization report to compare it to.

If you DO have both reports (but just haven't posted the other one), great! You should now know where your bottle neck is and/or if your 'optimizations' had the desired effect.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement