Back to General and Gameplay Programming

strange performance patterns

General and Gameplay Programming Programming

Started by Storyyeller January 11, 2012 03:23 PM

21 comments, last by larsbutler 12 years, 3 months ago

18,997

January 11, 2012 10:32 PM

Well, this problem intrigues me, so I took the time to produce what should be the optimal solution (for compiled languages, at least). It's based on de Bruijn sequences to eliminate as many loop iterations as possible, and it's implemented as an iterator/generator.



_DeBruijnBitPosition = [

   0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,

   31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9

]



def _trunc32( w ):

   w = int( ( w & 0x7fffFFFF ) | -( w & 0x80000000 ) )

   return w



def findSetBits(v):

   while v:

	  index = _DeBruijnBitPosition[_trunc32((v & -v) * 0x077CB531) >> 27]

	  v = (v & v-1)

	  yield index



# usage: list(findSetBits(4937)) => [0, 3, 6, 8, 9, 12]

I don't happen to have a native python interpreter handy - fancy benchmarking this against your earlier list comprehension solution?

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Storyyeller

215

Author

January 11, 2012 10:33 PM

That won't work in Python, because ints are arbitrary precision. The paper assumes fixed 32bit ints with overflow.

I trust exceptions about as far as I can throw them.

swiftcoder

18,997

January 11, 2012 10:40 PM

Is that only for 32bit ints?

Yes - I had assumed you were operating on a maximum of 32-bits. It's not easy to generalise any of the high-performance solutions to a large number of bits.

How big are your bit sets typically?

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Storyyeller

215

Author

January 11, 2012 10:43 PM

Well in this particular test case, the bit sets have length 109, though I hope it generalizes to larger values.

I trust exceptions about as far as I can throw them.

swiftcoder

18,997

January 11, 2012 10:49 PM

Well in this particular test case, the bit sets have length 109, though I hope it generalizes to larger values.

One could derive further sequence values, but if you don't have a fixed upper bound, probably not worth it.

My research group attempted a similar thing for a large scale bittorrent simulation, with upwards of 1,000 elements per bitset. In the end, we ported the whole damn thing to C++, because we just couldn't squeeze enough performance out of python in this area.

That said, PyPy's performance has come along way in the last few years - have you considered trying your current solution out on their interpreter?

/aside:

That won't work in Python, because ints are arbitrary precision. The paper assumes fixed 32bit ints with overflow.

That's why I had to explicitly truncate to 32-bits with the _trunc32 function. I think I have handled the overflow correctly... Either way, it seems to pass all my unit tests.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

swiftcoder

18,997

January 11, 2012 10:53 PM

/aside:

That won't work in Python, because ints are arbitrary precision. The paper assumes fixed 32bit ints with overflow.

That's why I had to explicitly truncate to 32-bits with the _trunc32 function. I think I have handled the overflow correctly... Either way, it seems to pass all my unit tests.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

the_edd

2,109

January 11, 2012 10:54 PM

If the timing of the overall script with a simple timer is inconsistent, how would in depth profiling help?

Timing tells you it's slow. Profiling tells you why. You want to know the latter, presumably?

For counting bits, perhaps something like:



def count_bits(i):

    return len(bin(i).replace('0', '')) - 1 # -1 for 'b'

?

I would guess that moving the loop in to C land might help.

http://www.mr-edd.co.uk
http://bitbucket.org/edd

Storyyeller

215

Author

January 11, 2012 10:56 PM

That's why I had to explicitly truncate to 32-bits with the _trunc32 function. I think I have handled the overflow correctly... Either way, it seems to pass all my unit tests.

Sorry, I said that while reading the paper and didn't look carefully at the code. It looks like your code will work as long as the bitset size is at most 32.

For counting bits, perhaps something like:

def count_bits(i): return len(bin(i).replace('0', '')) - 1 # -1 for 'b'
?

I would guess that moving the loop in to C land might help.

Personally, I use bin(x).count("1") for counting bits. But the main problem here is gettign a list of set bits, not counting them.

I trust exceptions about as far as I can throw them.

the_edd

2,109

January 11, 2012 10:59 PM

Ah, missed that. [s]Are you sure that's the problem? Do you have benchmark results?[/s]
EDIT: doh, and misread question.

http://www.mr-edd.co.uk
http://bitbucket.org/edd

Storyyeller

215

Author

January 12, 2012 02:09 AM

I figured out a way to rewrite my code so it doesn't require getting individual bits at all. Of course, it does mean more work has to be done in other places, but it is still faster over all.

Here's the profiler results after all the optimizations I've done.



		 5209449 function calls (5047367 primitive calls) in 27.757 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

	    1    0.004    0.004   27.757   27.757 <string>:1(<module>)

	    1    0.000    0.000    0.000    0.000 bisect.py:1(<module>)

	 1904    0.017    0.000    0.036    0.000 bitset.py:10(bitset)

   112424    0.640    0.000    1.129    0.000 bitset.py:3(bitcount)

	  321    0.008    0.000    0.009    0.000 bitset.py:33(toList)

	    1    0.000    0.000    0.000    0.000 bitset.py:41(makeKeyMap)

   656208    2.846    0.000    4.449    0.000 bitset.py:7(union)

73374/13957    0.587    0.000   21.646    0.002 calculation.py:126(getCostSub2)

   582249    1.164    0.000    1.164    0.000 calculation.py:14(isSubset)

    31678    2.773    0.000   21.387    0.001 calculation.py:140(getSkeletonConnectLists)

   112583    1.627    0.000    6.066    0.000 calculation.py:168(getTouchedChildren)

   268969    1.865    0.000    3.028    0.000 calculation.py:17(connectKeyTupIsLessEq)

   221132    0.478    0.000    0.478    0.000 calculation.py:173(getTouchSet)

  751/109    0.018    0.000    0.118    0.001 calculation.py:186(makeCalcNode)

	  109    0.002    0.000    0.122    0.001 calculation.py:202(makeCalcRoot)

	    1    0.001    0.001    0.123    0.123 calculation.py:207(buildCalculationTrees)

	  321    0.008    0.000    0.077    0.000 calculation.py:22(makeConnectDict_Bitset)

	  109    0.001    0.000    0.003    0.000 calculation.py:31(__init__)

	 1983    0.005    0.000    0.005    0.000 calculation.py:36(getCost)

	  109    0.002    0.000    0.003    0.000 calculation.py:41(__init__)

	  342    0.032    0.000   25.089    0.073 calculation.py:46(getCost)

    86698    1.167    0.000    4.451    0.000 calculation.py:6(insertIncomparableList)

	  321    0.006    0.000    0.097    0.000 calculation.py:68(__init__)

85023/2223    0.653    0.000   25.051    0.011 calculation.py:78(getCost)

12660/1466    1.070    0.000   25.019    0.017 calculation.py:95(getCostSub)

1453/321    0.022    0.000    0.026    0.000 connect.py:1(connectivityDictSub)

	  321    0.021    0.000    0.052    0.000 connect.py:18(connectivityDict)

   215922    1.427    0.000    1.797    0.000 decomposition.py:13(<lambda>)

	    1    0.004    0.004    2.441    2.441 decomposition.py:33(carvingDecomposition)

	  236    0.001    0.000    0.002    0.000 decomposition.py:41(<lambda>)

	    1    0.006    0.006    0.036    0.036 decomposition.py:52(bitsetizeKeys)

	 1124    0.007    0.000    0.025    0.000 decomposition.py:53(convertEdge)

	  106    0.008    0.000    2.433    0.023 decomposition.py:6(greedySplitNode)

	 5989    0.039    0.000    0.073    0.000 decomposition.py:9(<lambda>)

	  212    0.001    0.000    0.001    0.000 graph.py:12(removeEdge)

	  107    0.000    0.000    0.001    0.000 graph.py:15(numNeighbors)

	  325    0.001    0.000    0.001    0.000 graph.py:4(__init__)

	  545    0.002    0.000    0.002    0.000 graph.py:8(addEdge)

	    1    0.003    0.003    0.003    0.003 heapq.py:31(<module>)

	    3    0.000    0.000    0.000    0.000 ntpath.py:122(splitdrive)

	    3    0.000    0.000    0.000    0.000 ntpath.py:55(isabs)

	    1    0.000    0.000    0.000    0.000 ntpath.py:63(join)

	    1    0.000    0.000    2.601    2.601 rsglib.py:27(__init__)

	  342    0.005    0.000   25.115    0.073 rsglib.py:40(getCost)

	    1    0.001    0.001    0.002    0.002 test.py:20(makeGraph)

	  493    0.007    0.000    0.007    0.000 test.py:31(getSuccessorStates)

	    1    0.007    0.007   25.145   25.145 test.py:38(aStarSearch)

	  342    0.010    0.000   25.127    0.073 test.py:41(push)

	    1    0.005    0.005   27.754   27.754 test.py:57(testLevel)

	    1    0.001    0.001    0.002    0.002 test.py:9(parseLevel)

   220466    0.508    0.000    0.508    0.000 unionfind_bitset.py:18(getContainingSet)

   406014    5.661    0.000   11.071    0.000 unionfind_bitset.py:3(unionBitsetLists)

	  153    0.001    0.000    0.001    0.000 {_heapq.heappop}

	  342    0.002    0.000    0.002    0.000 {_heapq.heappush}

   196397    0.370    0.000    0.370    0.000 {any}

   112745    0.250    0.000    0.250    0.000 {bin}

   361464    0.640    0.000    0.640    0.000 {len}

2005/1097    0.014    0.000    0.033    0.000 {map}

	    1    0.000    0.000    0.000    0.000 {method '__enter__' of 'file' objects}

	  216    0.000    0.000    0.000    0.000 {method 'add' of 'set' objects}

   574394    1.092    0.000    1.092    0.000 {method 'append' of 'list' objects}

   112424    0.240    0.000    0.240    0.000 {method 'count' of 'str' objects}

	    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

	  106    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}

	  215    0.001    0.000    0.001    0.000 {method 'items' of 'dict' objects}

	  321    0.001    0.000    0.001    0.000 {method 'iteritems' of 'dict' objects}

	  431    0.001    0.000    0.001    0.000 {method 'pop' of 'list' objects}

	    1    0.000    0.000    0.000    0.000 {method 'readlines' of 'file' objects}

	    8    0.000    0.000    0.000    0.000 {method 'strip' of 'str' objects}

	  106    0.001    0.000    0.001    0.000 {method 'union' of 'frozenset' objects}

	  322    0.001    0.000    0.001    0.000 {method 'values' of 'dict' objects}

    15168    0.593    0.000    2.388    0.000 {min}

	    1    0.000    0.000    0.000    0.000 {open}

    65237    0.169    0.000    0.169    0.000 {range}

   656208    1.602    0.000    1.602    0.000 {reduce}

7899/1910    0.064    0.000    0.114    0.000 {sorted}

I trust exceptions about as far as I can throw them.

strange performance patterns

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

strange performance patterns

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines