# [C++] Need help optimizing performace of a small function

## Recommended Posts

Roots    1625
Hello. I have a relatively simple problem that I wrote a function to solve but I'm running into performance issues when the input size to this function gets exceptionally high. Basically, I'm trying to find the number of ways that I can build a wall of height "n" using a number of different arrangements of lines of blocks. Not all lines are allowed to be placed on top of one another. Each "BlockLine" object has a vector container called "compatible_lines" that contain the indeces of all lines that this line is compatible with.

So for example if line 0 was compatible with lines 1 and 2, line 1 was compatible with 0 and 3, line 2 was compatible with 0, and line 3 was compatible with line 1, then the number of ways I can build a wall of height "3" would be:

0-1-0
0-2-0
0-1-3
1-0-1
1-3-1
1-0-2
2-0-2
2-0-1
3-1-3
3-1-0

For a total of 10 ways. Make sense?

I already have written a function that calculates the correct result when given a wall height (between 1 and 10) and a set of BlockLine objects that already have their compatible_line data entered. The function works great for small to mid-size wall configurations (a wall of height 10 with 65 possible lines is computed in a fraction of a second). But when I change the number of lines to a larger number (3329 in the test that I'm using) the function is exceptionally slow for larger heights. (3329, 5) is still computed in under a second, but (3329, 10) takes much much too long (several minutes). So I'm trying to figure out how to optimize this code so that it performs well for the larger input sets. Here's the source:

[source]
/** \brief Determines the number of possible configurations that can build a block wall of a specific height
*** \param wall_height The height of the wall, as a number of wall lines
*** \param lines A reference to the container where all possible block line permutations are stored
*** \return The number of wall configurations
*** \note The BlockLine objects stored in the lines argument must have all their line compatibilities
*** set before this function is called.
***
*** This function works by starting with constructing the base of the wall with all possible lines. For each
*** line at the base, the compatible lines are placed on top, and those lines have their compatible lines placed
*** on top of them and so on until the wall has reached its maximum height.
**/
int64_t ComputeNumberWallConfigurations(int wall_height, vector<BlockLine>& lines) {
// In the case that the height of the wall is one, the number of configurations is simply the number of line permutations
if (wall_height == 1) {
return lines.size();
}

// A counter that keeps track of the number of wall configurations
int64_t num_configurations = 0;
// A stack of all the lines that build the current configuration being examined
vector<int> line_stack;
// A companion stack that tracks the index of the next compatible line to examine for each line in the stack
vector<int> next_stack;

for (int i = 0; i < lines.size(); i++) {
// Lines that are not compatible with any other line can not be used to form a wall that is two or more lines tall
if (lines[i].compatible_lines.empty() == true) {
continue;
}

line_stack.push_back(i);
next_stack.push_back(0);

int64_t temp = 0;
printf("Constructing wall with base line: %d ... ", i);

// With the base line established, now determine all possible configurations for the wall for the provided base line
while (line_stack.empty() == false) {
int this_line = line_stack.back();

// (Case 1): If the next line pushed on the stack would be the top of the wall, add the number of compatible lines for this line and pop
if (line_stack.size() == (wall_height - 1)) {
temp += lines[this_line].compatible_lines.size();
num_configurations += lines[this_line].compatible_lines.size();
line_stack.pop_back();
next_stack.pop_back();
}
// (Case 2): There are no more compatible lines to examine for this configuration on this level. Pop back to a lower level.
else if (next_stack.back() >= lines[this_line].compatible_lines.size()) {
line_stack.pop_back();
next_stack.pop_back();
}
// (Case 3): Add another line to the top of the stack and increment the current line's "next" index
else {
int next_line = next_stack.back();
next_stack.back() += 1;

line_stack.push_back(lines[this_line].compatible_lines[next_line]);
next_stack.push_back(0);
}
}

printf("number configurations == %lld\n", static_cast<long long int>(temp));
}

return num_configurations;
}
[/source]

Its a pretty simple algorithm as you can see. It used to be a recursive function, but I rewrote it to be an iterative version when I noticed how slow it ran. Basically I build the wall up from the bottom one line at a time until I'm at the second to last line from the top. Then I simply add to my accumulator variable (num_configurations) the amount of lines I can put at the top, then pop back down the stack.

I realize that this algorithm is a prime candidate for multithreading and that could improve performance (launch each while(line_stack.empty()) loop logic in its own thread) but I don't really want to make this code multithreaded if I don't have to. Plus, I'm kind of doubtful that it would give me the degree of performance that I'm seeking unless I'm launching these threads across 64 different processor cores or something. I've been staring at this function for an hour and I haven't though of any way to drastically increase the performance. Is there something that I'm missing?

One behavior that I did notice during my testing with these larger input sets is that the further along I get, the slower it seems to take for each base line. What I mean is, when this function starts running the first few hundred lines are printed out very quickly
[b][i]
(each line computed quickly, hundredths of a second)[/i][/b]
Constructing wall with base line: 0 ... number configurations == 1
Constructing wall with base line: 1 ... number configurations == 252
Constructing wall with base line: 2 ... number configurations == 731808
Constructing wall with base line: 3 ... number configurations == 267227532
Constructing wall with base line: 4 ... number configurations == 13534077843
[b][i](each line computed slowly, several seconds each)[/i][/b]
Constructing wall with base line: 468 ... number configurations == 1733285040
Constructing wall with base line: 469 ... number configurations == 2477179776
Constructing wall with base line: 470 ... number configurations == 2673027300
Constructing wall with base line: 471 ... number configurations == 2108819760
Constructing wall with base line: 472 ... number configurations == 808799196
Constructing wall with base line: 473 ... number configurations == 6573600

I examined my performance monitor and I didn't see anything wrong with the memory usage. The process remains static at 800KB of memory and 100% CPU, so I'm pretty certain that this is a computationally bound performance issue and has nothing to do with inefficient memory access.

##### Share on other sites
alvaro    21246
[quote name='Roots' timestamp='1321893775' post='4886255']
Hello. I have a relatively simple problem that I wrote a function to solve but I'm running into performance issues when the input size to this function gets exceptionally high. Basically, I'm trying to find the number of ways that I can build a wall of height "n" using a number of different arrangements of lines of blocks. Not all lines are allowed to be placed on top of one another. Each "BlockLine" object has a vector container called "compatible_lines" that contain the indeces of all lines that this line is compatible with.

So for example if line 0 was compatible with lines 1 and 2, line 1 was compatible with 0 and 3, line 2 was compatible with 0, and line 3 was compatible with line 1, then the number of ways I can build a wall of height "3" would be:

0-1-0
0-2-0
0-1-3
1-0-1
1-3-1
1-0-2
2-0-2
3-1-3

For a total of 8 ways. Make sense?[/quote]

What's wrong with 2-0-1 or 3-1-0?

##### Share on other sites
Roots    1625
[quote name='alvaro' timestamp='1321894430' post='4886263']
What's wrong with 2-0-1 or 3-1-0?
[/quote]

Oops. Yes, both of those configurations are valid as well. I edited the original post to include them. Thanks for catching that!

##### Share on other sites
Antheus    2409
[quote]I realize that this algorithm is a prime candidate for multithreading and that could improve performance[/quote]

It's prime candidate for [url="http://en.wikipedia.org/wiki/Dynamic_programming"]dynamic programming[/url]. The article probably won't be of immediate help though, so if anyone feels like implementing it...

##### Share on other sites
Roots    1625
Thanks Antheus, that looks like a promising solution. I've also been thinking about ways to divide the problem into smaller sub problems but wasn't sure about how to go about it such that the complexity was reduced. This technique says it can bring exponential time down to linear time so if I can achieve that, I think that will be a suitable answer for me.

Also I did some more testing and its not true that the program slows down with larger base line values. The larger base line values simply take longer to calculate on their own. (If I skip the first 1,000 base lines in the algorithm, calculating base line 1001 takes just as long as if I were to calculate it should I have started the algorithm from base line 1). I'll keep working on this and post a solution if I come up with a vastly improved algorithm. Right now I'm thinking of just calculating the wall for height n = 2, then adding up all those pairs of lines to make the full size wall.

##### Share on other sites
jwezorek    2663
Unless, I'm misunderstanding the problem ... if you think of the BlockLine types as nodes of a graph G and the "is-compatible-with" operation as defining the edges of graph G, then asking how many ways are there to construct a wall of height n is the same as asking how many paths of length n are there in G. I think, you could just perform n depth-first searches starting at each node of G and output all paths you find that are length n.

EDIT: actually standard DFS wouldn't work because you need to enumerate all paths even if they go through nodes you've already visited.

##### Share on other sites
alvaro    21246
Instead of counting walls of height x, count walls of height x whose last line is l. So the result for a particular height x is a vector, not a single number. Now consider the matrix whose columns represent the compatibilities. In your example:
[code] (0 1 1 0)
M := (1 0 0 1)
(1 0 0 0)
(0 1 0 0)[/code]

Now make a column vector v that contains the number of possible walls of height h. The number of walls of height h+1 (as a column vector) is then M*v.

As a last step, you can multiply by the row (1 1 1 1) on the left to compute the sum. The answer to your original problem is then simply
[code]
(1)
(1 1 1 1) * M^(wall_height-1) * (1)
(1)
(1)[/code]

Using some clever exponentiation tricks, this can be computed really, really fast.

So for example if line 0 was compatible with lines 1 and 2, line 1 was compatible with 0 and 3, line 2 was compatible with 0, and line 3 was compatible with line 1, then the number of ways I can build a

##### Share on other sites
alvaro    21246
Hidden
[quote name='alvaro' timestamp='1321905331' post='4886317']
Instead of counting walls of height x, count walls of height x whose last line is l. So the result for a particular height x is a vector, not a single number. Now consider the matrix whose columns represent the compatibilities. In your example:
[code] (0 1 1 0)
M := (1 0 0 1)
(1 0 0 0)
(0 1 0 0)[/code]

Now make a column vector v that contains the number of possible walls of height h. The number of walls of height h+1 is then M*v.

As a last step, you can multiply by the row (1 1 1 1) on the left to compute the sum. The answer to your original problem is then simply
[code]
(1)
(1 1 1 1) * M^(wall_height-1) * (1)
(1)
(1)[/code]

Using some clever exponentiation tricks, this can be computed really, really fast.

So for example if line 0 was compatible with lines 1 and 2, line 1 was compatible with 0 and 3, line 2 was compatible with 0, and line 3 was compatible with line 1, then the number of ways I can build a
[/quote]

Antheus    2409
If I understood the problem correctly, then this is what the correct solution should be (12 configurations for height = 3 with relations given in OP):
[code]#include <iostream>

typedef long long int_t;

const int NBLOCKS = 4;
const int HEIGHT = 3;

int_t connections[NBLOCKS][NBLOCKS] =
{
{ 1, 2, -1, -1 },
{ 0, 3, -1, -1 },
{ 0, -1, -1, -1 },
{ 1, -1, -1, -1 }
};

// int_t cache[HEIGHT][NBLOCKS];

int_t count(int came_from, int height)
{
// int_t & cached = cache[height][came_from];
// if (cached != -1) return cached;

int_t sum = 0;

if (height == 0) {
sum = 1;
} else {
for (int i = 0; i < NBLOCKS; i++) {
if (connections[came_from][i] != -1) sum += count(i, height-1);
}
}

// cached = sum;
return sum;
}

int main(int argc, char* argv[])
{
// for (int h = 0; h < HEIGHT; h++) {
// for (int b = 0; b < NBLOCKS; b++)
// cache[h][b] = -1;
//}

int_t sum = 0;
for (int i = 0; i < NBLOCKS; i++) {
sum += count(i, HEIGHT-1);
}
std::cout << sum << '\n';
return 0;
}

[/code]It's lazy evaluated DP solution, or even just basic memoization. Commented lines show how to convert recursive version to memoized one. Uncomment them for linear time solution.

What doesn't seem to match are the numbers given in original post. The numbers increase, so even for depth of ~30 an int isn't enough. A 64-bit int makes it possible to go slightly beyond that, but for depths of several thousands the numbers will become prohibitive and would require some sort of big int type.

Either way, the solution above, if it even solves the problem, requires NBLOCKS * HEIGHT operations (approximately).

##### Share on other sites
landagen    376
So I think I worked out a much faster solution. I will write the pseudo code and use your example

[source]
int currentCombos[numOfBlocks];
int prevCombos[numOfBlocks]; //for each block that is allowed on the bottom, set it to 1 otherwise, set it to 0

for (int h=1;h<desiredHeight;h++)
{
memset(currentCombos, 0, numOfBlocks*sizeof(int));
for (int x=0;x<numOfBlocks;x++)
{
for (int i=0;i<lines[x].compatibleLines.count;i++)
{
currentCombos[lines[x].compatibleLines[i]] += prevCombos[x];
}
}
SWAP(prevCombos, currentCombos);
}

int total = 0;
for (int x=0;x<numOfBlocks;x++)
{
total += prevCombos[x];
}

[/source]

So basically this works because your next block only depends on the one right before it and nothing else. So let's take your example. So on the first height, there can only be 1 combination of each block so that means that if you chose a height of 1, there can only be 4 combinations, i.e. 0, 1, 2, and 3. On the second height, block 0 can feed into both 1 and 2, 1 can feed into 0 and 3, 2 can feed into 0, and 3 can feed into 1.
So currentCombos becomes 2 2 1 1. On level three current becomes 3 3 2 2 which gives you 10 altogether which is your answer. This should be a lot faster than your current algorithm. Worst case is (x*x*h) where x is the num of block lines and h is the desired height for your loop. I know I didn't explain it well. I hope the code provides enough explanation.

##### Share on other sites
alvaro    21246
[quote name='landagen' timestamp='1321906965' post='4886326']
So I think I worked out a much faster solution. I will write the pseudo code and use your example[/quote]

If you read the matrix formulation I proposed, what your code is doing is its most naive implementation. And yes, this is a very reasonable solution.

##### Share on other sites
Telastyn    3777
Hidden
[quote name='landagen' timestamp='1321906965' post='4886326']
So I think I worked out a much faster solution.
[/quote]

I came up with this as well. I wouldn't be surprised if this could be solved in essentially constant time with respect to the wall's height with some fancy math involving the ratio of incoming to outgoing connections.

landagen    376
[quote name='alvaro' timestamp='1321907485' post='4886332']
[quote name='landagen' timestamp='1321906965' post='4886326']
So I think I worked out a much faster solution. I will write the pseudo code and use your example[/quote]

If you read the matrix formulation I proposed, what your code is doing is its most naive implementation. And yes, this is a very reasonable solution.
[/quote]

I was formulating my post when you posted yours and had not read yours. I think both will work just fine. In the case of sparse line compatibilities I think mine will consume less memory and will be faster, but yours will be quicker in the event of dense line compatibilities.

##### Share on other sites
alvaro    21246
[quote name='landagen' timestamp='1321909960' post='4886346']
[quote name='alvaro' timestamp='1321907485' post='4886332']
[quote name='landagen' timestamp='1321906965' post='4886326']
So I think I worked out a much faster solution. I will write the pseudo code and use your example[/quote]

If you read the matrix formulation I proposed, what your code is doing is its most naive implementation. And yes, this is a very reasonable solution.
[/quote]

I was formulating my post when you posted yours and had not read yours.[/quote]
Sure, I wanted to point out to the OP that we were basically saying the same thing.

[quote]I think both will work just fine. In the case of sparse line compatibilities I think mine will consume less memory and will be faster, but yours will be quicker in the event of dense line compatibilities.
[/quote]

Well, I didn't propose a specific algorithm to compute the product of matrices I posted. Your pseudocode is a perfectly good way to do it. If the height is large, a dense matrix representation will work better. It might even be worth diagonalizing the matrix, so [url="http://en.wikipedia.org/wiki/Diagonalizable_matrix#An_application"]this method[/url] can be used.

##### Share on other sites
alvaro    21246
Ah, one more comment for the OP: If your function is not going to modify the container that it gets passed, make it a const reference.

##### Share on other sites
Roots    1625
Thanks for all your help thus far everyone. landagen, I implemented the algorithm you described and yes, it is much much faster. Easily fast enough for my requirements. Also I am using 64-bit integers to compute this total as that is the maximum size I am expecting for any given input. I still am not quite sure about how the solution works though. This is just something that I need to think about and study on my own though. I think if I work the algorithm through an example data set I will understand it better. Thanks so much, I really appreciate the help on this!