strange performance patterns

Started by
21 comments, last by larsbutler 12 years, 3 months ago
Well, this problem intrigues me, so I took the time to produce what should be the optimal solution (for compiled languages, at least). It's based on de Bruijn sequences to eliminate as many loop iterations as possible, and it's implemented as an iterator/generator.


_DeBruijnBitPosition = [
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
]

def _trunc32( w ):
w = int( ( w & 0x7fffFFFF ) | -( w & 0x80000000 ) )
return w

def findSetBits(v):
while v:
index = _DeBruijnBitPosition[_trunc32((v & -v) * 0x077CB531) >> 27]
v = (v & v-1)
yield index

# usage: list(findSetBits(4937)) => [0, 3, 6, 8, 9, 12]


I don't happen to have a native python interpreter handy - fancy benchmarking this against your earlier list comprehension solution?

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Advertisement
That won't work in Python, because ints are arbitrary precision. The paper assumes fixed 32bit ints with overflow.
I trust exceptions about as far as I can throw them.

Is that only for 32bit ints?

Yes - I had assumed you were operating on a maximum of 32-bits. It's not easy to generalise any of the high-performance solutions to a large number of bits.

How big are your bit sets typically?

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Well in this particular test case, the bit sets have length 109, though I hope it generalizes to larger values.
I trust exceptions about as far as I can throw them.

Well in this particular test case, the bit sets have length 109, though I hope it generalizes to larger values.

One could derive further sequence values, but if you don't have a fixed upper bound, probably not worth it.

My research group attempted a similar thing for a large scale bittorrent simulation, with upwards of 1,000 elements per bitset. In the end, we ported the whole damn thing to C++, because we just couldn't squeeze enough performance out of python in this area.

That said, PyPy's performance has come along way in the last few years - have you considered trying your current solution out on their interpreter?



/aside:

That won't work in Python, because ints are arbitrary precision. The paper assumes fixed 32bit ints with overflow.

That's why I had to explicitly truncate to 32-bits with the _trunc32 function. I think I have handled the overflow correctly... Either way, it seems to pass all my unit tests.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

/aside:

That won't work in Python, because ints are arbitrary precision. The paper assumes fixed 32bit ints with overflow.

That's why I had to explicitly truncate to 32-bits with the _trunc32 function. I think I have handled the overflow correctly... Either way, it seems to pass all my unit tests.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]


If the timing of the overall script with a simple timer is inconsistent, how would in depth profiling help?

Timing tells you it's slow. Profiling tells you why. You want to know the latter, presumably?

For counting bits, perhaps something like:


def count_bits(i):
return len(bin(i).replace('0', '')) - 1 # -1 for 'b'

?

I would guess that moving the loop in to C land might help.

That's why I had to explicitly truncate to 32-bits with the _trunc32 function. I think I have handled the overflow correctly... Either way, it seems to pass all my unit tests.


Sorry, I said that while reading the paper and didn't look carefully at the code. It looks like your code will work as long as the bitset size is at most 32.




For counting bits, perhaps something like:


def count_bits(i):
return len(bin(i).replace('0', '')) - 1 # -1 for 'b'

?

I would guess that moving the loop in to C land might help.


Personally, I use bin(x).count("1") for counting bits. But the main problem here is gettign a list of set bits, not counting them.
I trust exceptions about as far as I can throw them.
Ah, missed that. [s]Are you sure that's the problem? Do you have benchmark results?[/s]
EDIT: doh, and misread question.
I figured out a way to rewrite my code so it doesn't require getting individual bits at all. Of course, it does mean more work has to be done in other places, but it is still faster over all.

Here's the profiler results after all the optimizations I've done.

5209449 function calls (5047367 primitive calls) in 27.757 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 27.757 27.757 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 bisect.py:1(<module>)
1904 0.017 0.000 0.036 0.000 bitset.py:10(bitset)
112424 0.640 0.000 1.129 0.000 bitset.py:3(bitcount)
321 0.008 0.000 0.009 0.000 bitset.py:33(toList)
1 0.000 0.000 0.000 0.000 bitset.py:41(makeKeyMap)
656208 2.846 0.000 4.449 0.000 bitset.py:7(union)
73374/13957 0.587 0.000 21.646 0.002 calculation.py:126(getCostSub2)
582249 1.164 0.000 1.164 0.000 calculation.py:14(isSubset)
31678 2.773 0.000 21.387 0.001 calculation.py:140(getSkeletonConnectLists)
112583 1.627 0.000 6.066 0.000 calculation.py:168(getTouchedChildren)
268969 1.865 0.000 3.028 0.000 calculation.py:17(connectKeyTupIsLessEq)
221132 0.478 0.000 0.478 0.000 calculation.py:173(getTouchSet)
751/109 0.018 0.000 0.118 0.001 calculation.py:186(makeCalcNode)
109 0.002 0.000 0.122 0.001 calculation.py:202(makeCalcRoot)
1 0.001 0.001 0.123 0.123 calculation.py:207(buildCalculationTrees)
321 0.008 0.000 0.077 0.000 calculation.py:22(makeConnectDict_Bitset)
109 0.001 0.000 0.003 0.000 calculation.py:31(__init__)
1983 0.005 0.000 0.005 0.000 calculation.py:36(getCost)
109 0.002 0.000 0.003 0.000 calculation.py:41(__init__)
342 0.032 0.000 25.089 0.073 calculation.py:46(getCost)
86698 1.167 0.000 4.451 0.000 calculation.py:6(insertIncomparableList)
321 0.006 0.000 0.097 0.000 calculation.py:68(__init__)
85023/2223 0.653 0.000 25.051 0.011 calculation.py:78(getCost)
12660/1466 1.070 0.000 25.019 0.017 calculation.py:95(getCostSub)
1453/321 0.022 0.000 0.026 0.000 connect.py:1(connectivityDictSub)
321 0.021 0.000 0.052 0.000 connect.py:18(connectivityDict)
215922 1.427 0.000 1.797 0.000 decomposition.py:13(<lambda>)
1 0.004 0.004 2.441 2.441 decomposition.py:33(carvingDecomposition)
236 0.001 0.000 0.002 0.000 decomposition.py:41(<lambda>)
1 0.006 0.006 0.036 0.036 decomposition.py:52(bitsetizeKeys)
1124 0.007 0.000 0.025 0.000 decomposition.py:53(convertEdge)
106 0.008 0.000 2.433 0.023 decomposition.py:6(greedySplitNode)
5989 0.039 0.000 0.073 0.000 decomposition.py:9(<lambda>)
212 0.001 0.000 0.001 0.000 graph.py:12(removeEdge)
107 0.000 0.000 0.001 0.000 graph.py:15(numNeighbors)
325 0.001 0.000 0.001 0.000 graph.py:4(__init__)
545 0.002 0.000 0.002 0.000 graph.py:8(addEdge)
1 0.003 0.003 0.003 0.003 heapq.py:31(<module>)
3 0.000 0.000 0.000 0.000 ntpath.py:122(splitdrive)
3 0.000 0.000 0.000 0.000 ntpath.py:55(isabs)
1 0.000 0.000 0.000 0.000 ntpath.py:63(join)
1 0.000 0.000 2.601 2.601 rsglib.py:27(__init__)
342 0.005 0.000 25.115 0.073 rsglib.py:40(getCost)
1 0.001 0.001 0.002 0.002 test.py:20(makeGraph)
493 0.007 0.000 0.007 0.000 test.py:31(getSuccessorStates)
1 0.007 0.007 25.145 25.145 test.py:38(aStarSearch)
342 0.010 0.000 25.127 0.073 test.py:41(push)
1 0.005 0.005 27.754 27.754 test.py:57(testLevel)
1 0.001 0.001 0.002 0.002 test.py:9(parseLevel)
220466 0.508 0.000 0.508 0.000 unionfind_bitset.py:18(getContainingSet)
406014 5.661 0.000 11.071 0.000 unionfind_bitset.py:3(unionBitsetLists)
153 0.001 0.000 0.001 0.000 {_heapq.heappop}
342 0.002 0.000 0.002 0.000 {_heapq.heappush}
196397 0.370 0.000 0.370 0.000 {any}
112745 0.250 0.000 0.250 0.000 {bin}
361464 0.640 0.000 0.640 0.000 {len}
2005/1097 0.014 0.000 0.033 0.000 {map}
1 0.000 0.000 0.000 0.000 {method '__enter__' of 'file' objects}
216 0.000 0.000 0.000 0.000 {method 'add' of 'set' objects}
574394 1.092 0.000 1.092 0.000 {method 'append' of 'list' objects}
112424 0.240 0.000 0.240 0.000 {method 'count' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
106 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
215 0.001 0.000 0.001 0.000 {method 'items' of 'dict' objects}
321 0.001 0.000 0.001 0.000 {method 'iteritems' of 'dict' objects}
431 0.001 0.000 0.001 0.000 {method 'pop' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'readlines' of 'file' objects}
8 0.000 0.000 0.000 0.000 {method 'strip' of 'str' objects}
106 0.001 0.000 0.001 0.000 {method 'union' of 'frozenset' objects}
322 0.001 0.000 0.001 0.000 {method 'values' of 'dict' objects}
15168 0.593 0.000 2.388 0.000 {min}
1 0.000 0.000 0.000 0.000 {open}
65237 0.169 0.000 0.169 0.000 {range}
656208 1.602 0.000 1.602 0.000 {reduce}
7899/1910 0.064 0.000 0.114 0.000 {sorted}

I trust exceptions about as far as I can throw them.

This topic is closed to new replies.

Advertisement