[C++] Need help optimizing performace of a small function

Started by
12 comments, last by Roots 12 years, 5 months ago
Hello. I have a relatively simple problem that I wrote a function to solve but I'm running into performance issues when the input size to this function gets exceptionally high. Basically, I'm trying to find the number of ways that I can build a wall of height "n" using a number of different arrangements of lines of blocks. Not all lines are allowed to be placed on top of one another. Each "BlockLine" object has a vector container called "compatible_lines" that contain the indeces of all lines that this line is compatible with.

So for example if line 0 was compatible with lines 1 and 2, line 1 was compatible with 0 and 3, line 2 was compatible with 0, and line 3 was compatible with line 1, then the number of ways I can build a wall of height "3" would be:

0-1-0
0-2-0
0-1-3
1-0-1
1-3-1
1-0-2
2-0-2
2-0-1
3-1-3
3-1-0

For a total of 10 ways. Make sense?


I already have written a function that calculates the correct result when given a wall height (between 1 and 10) and a set of BlockLine objects that already have their compatible_line data entered. The function works great for small to mid-size wall configurations (a wall of height 10 with 65 possible lines is computed in a fraction of a second). But when I change the number of lines to a larger number (3329 in the test that I'm using) the function is exceptionally slow for larger heights. (3329, 5) is still computed in under a second, but (3329, 10) takes much much too long (several minutes). So I'm trying to figure out how to optimize this code so that it performs well for the larger input sets. Here's the source:

[source]
/** \brief Determines the number of possible configurations that can build a block wall of a specific height
*** \param wall_height The height of the wall, as a number of wall lines
*** \param lines A reference to the container where all possible block line permutations are stored
*** \return The number of wall configurations
*** \note The BlockLine objects stored in the lines argument must have all their line compatibilities
*** set before this function is called.
***
*** This function works by starting with constructing the base of the wall with all possible lines. For each
*** line at the base, the compatible lines are placed on top, and those lines have their compatible lines placed
*** on top of them and so on until the wall has reached its maximum height.
**/
int64_t ComputeNumberWallConfigurations(int wall_height, vector<BlockLine>& lines) {
// In the case that the height of the wall is one, the number of configurations is simply the number of line permutations
if (wall_height == 1) {
return lines.size();
}

// A counter that keeps track of the number of wall configurations
int64_t num_configurations = 0;
// A stack of all the lines that build the current configuration being examined
vector<int> line_stack;
// A companion stack that tracks the index of the next compatible line to examine for each line in the stack
vector<int> next_stack;

for (int i = 0; i < lines.size(); i++) {
// Lines that are not compatible with any other line can not be used to form a wall that is two or more lines tall
if (lines.compatible_lines.empty() == true) {
continue;
}

line_stack.push_back(i);
next_stack.push_back(0);

int64_t temp = 0;
printf("Constructing wall with base line: %d ... ", i);

// With the base line established, now determine all possible configurations for the wall for the provided base line
while (line_stack.empty() == false) {
int this_line = line_stack.back();

// (Case 1): If the next line pushed on the stack would be the top of the wall, add the number of compatible lines for this line and pop
if (line_stack.size() == (wall_height - 1)) {
temp += lines[this_line].compatible_lines.size();
num_configurations += lines[this_line].compatible_lines.size();
line_stack.pop_back();
next_stack.pop_back();
}
// (Case 2): There are no more compatible lines to examine for this configuration on this level. Pop back to a lower level.
else if (next_stack.back() >= lines[this_line].compatible_lines.size()) {
line_stack.pop_back();
next_stack.pop_back();
}
// (Case 3): Add another line to the top of the stack and increment the current line's "next" index
else {
int next_line = next_stack.back();
next_stack.back() += 1;

line_stack.push_back(lines[this_line].compatible_lines[next_line]);
next_stack.push_back(0);
}
}

printf("number configurations == %lld\n", static_cast<long long int>(temp));
}

return num_configurations;
}
[/source]

Its a pretty simple algorithm as you can see. It used to be a recursive function, but I rewrote it to be an iterative version when I noticed how slow it ran. Basically I build the wall up from the bottom one line at a time until I'm at the second to last line from the top. Then I simply add to my accumulator variable (num_configurations) the amount of lines I can put at the top, then pop back down the stack.



I realize that this algorithm is a prime candidate for multithreading and that could improve performance (launch each while(line_stack.empty()) loop logic in its own thread) but I don't really want to make this code multithreaded if I don't have to. Plus, I'm kind of doubtful that it would give me the degree of performance that I'm seeking unless I'm launching these threads across 64 different processor cores or something. I've been staring at this function for an hour and I haven't though of any way to drastically increase the performance. Is there something that I'm missing?



One behavior that I did notice during my testing with these larger input sets is that the further along I get, the slower it seems to take for each base line. What I mean is, when this function starts running the first few hundred lines are printed out very quickly

(each line computed quickly, hundredths of a second)

Constructing wall with base line: 0 ... number configurations == 1
Constructing wall with base line: 1 ... number configurations == 252
Constructing wall with base line: 2 ... number configurations == 731808
Constructing wall with base line: 3 ... number configurations == 267227532
Constructing wall with base line: 4 ... number configurations == 13534077843
(each line computed slowly, several seconds each)
Constructing wall with base line: 468 ... number configurations == 1733285040
Constructing wall with base line: 469 ... number configurations == 2477179776
Constructing wall with base line: 470 ... number configurations == 2673027300
Constructing wall with base line: 471 ... number configurations == 2108819760
Constructing wall with base line: 472 ... number configurations == 808799196
Constructing wall with base line: 473 ... number configurations == 6573600

I examined my performance monitor and I didn't see anything wrong with the memory usage. The process remains static at 800KB of memory and 100% CPU, so I'm pretty certain that this is a computationally bound performance issue and has nothing to do with inefficient memory access.

Hero of Allacrost - A free, open-source 2D RPG in development.
Latest release June, 2015 - GameDev annoucement

Advertisement

Hello. I have a relatively simple problem that I wrote a function to solve but I'm running into performance issues when the input size to this function gets exceptionally high. Basically, I'm trying to find the number of ways that I can build a wall of height "n" using a number of different arrangements of lines of blocks. Not all lines are allowed to be placed on top of one another. Each "BlockLine" object has a vector container called "compatible_lines" that contain the indeces of all lines that this line is compatible with.

So for example if line 0 was compatible with lines 1 and 2, line 1 was compatible with 0 and 3, line 2 was compatible with 0, and line 3 was compatible with line 1, then the number of ways I can build a wall of height "3" would be:

0-1-0
0-2-0
0-1-3
1-0-1
1-3-1
1-0-2
2-0-2
3-1-3

For a total of 8 ways. Make sense?


What's wrong with 2-0-1 or 3-1-0?

What's wrong with 2-0-1 or 3-1-0?


Oops. Yes, both of those configurations are valid as well. I edited the original post to include them. Thanks for catching that!

Hero of Allacrost - A free, open-source 2D RPG in development.
Latest release June, 2015 - GameDev annoucement

I realize that this algorithm is a prime candidate for multithreading and that could improve performance[/quote]

It's prime candidate for dynamic programming. The article probably won't be of immediate help though, so if anyone feels like implementing it...
Thanks Antheus, that looks like a promising solution. I've also been thinking about ways to divide the problem into smaller sub problems but wasn't sure about how to go about it such that the complexity was reduced. This technique says it can bring exponential time down to linear time so if I can achieve that, I think that will be a suitable answer for me.


Also I did some more testing and its not true that the program slows down with larger base line values. The larger base line values simply take longer to calculate on their own. (If I skip the first 1,000 base lines in the algorithm, calculating base line 1001 takes just as long as if I were to calculate it should I have started the algorithm from base line 1). I'll keep working on this and post a solution if I come up with a vastly improved algorithm. Right now I'm thinking of just calculating the wall for height n = 2, then adding up all those pairs of lines to make the full size wall.

Hero of Allacrost - A free, open-source 2D RPG in development.
Latest release June, 2015 - GameDev annoucement

Unless, I'm misunderstanding the problem ... if you think of the BlockLine types as nodes of a graph G and the "is-compatible-with" operation as defining the edges of graph G, then asking how many ways are there to construct a wall of height n is the same as asking how many paths of length n are there in G. I think, you could just perform n depth-first searches starting at each node of G and output all paths you find that are length n.

EDIT: actually standard DFS wouldn't work because you need to enumerate all paths even if they go through nodes you've already visited.
Instead of counting walls of height x, count walls of height x whose last line is l. So the result for a particular height x is a vector, not a single number. Now consider the matrix whose columns represent the compatibilities. In your example:
(0 1 1 0)
M := (1 0 0 1)
(1 0 0 0)
(0 1 0 0)


Now make a column vector v that contains the number of possible walls of height h. The number of walls of height h+1 (as a column vector) is then M*v.

As a last step, you can multiply by the row (1 1 1 1) on the left to compute the sum. The answer to your original problem is then simply

(1)
(1 1 1 1) * M^(wall_height-1) * (1)
(1)
(1)


Using some clever exponentiation tricks, this can be computed really, really fast.

So for example if line 0 was compatible with lines 1 and 2, line 1 was compatible with 0 and 3, line 2 was compatible with 0, and line 3 was compatible with line 1, then the number of ways I can build a

Instead of counting walls of height x, count walls of height x whose last line is l. So the result for a particular height x is a vector, not a single number. Now consider the matrix whose columns represent the compatibilities. In your example:
(0 1 1 0)
M := (1 0 0 1)
(1 0 0 0)
(0 1 0 0)


Now make a column vector v that contains the number of possible walls of height h. The number of walls of height h+1 is then M*v.

As a last step, you can multiply by the row (1 1 1 1) on the left to compute the sum. The answer to your original problem is then simply

(1)
(1 1 1 1) * M^(wall_height-1) * (1)
(1)
(1)


Using some clever exponentiation tricks, this can be computed really, really fast.

So for example if line 0 was compatible with lines 1 and 2, line 1 was compatible with 0 and 3, line 2 was compatible with 0, and line 3 was compatible with line 1, then the number of ways I can build a
If I understood the problem correctly, then this is what the correct solution should be (12 configurations for height = 3 with relations given in OP):
#include <iostream>

typedef long long int_t;

const int NBLOCKS = 4;
const int HEIGHT = 3;

int_t connections[NBLOCKS][NBLOCKS] =
{
{ 1, 2, -1, -1 },
{ 0, 3, -1, -1 },
{ 0, -1, -1, -1 },
{ 1, -1, -1, -1 }
};

// int_t cache[HEIGHT][NBLOCKS];

int_t count(int came_from, int height)
{
// int_t & cached = cache[height][came_from];
// if (cached != -1) return cached;

int_t sum = 0;

if (height == 0) {
sum = 1;
} else {
for (int i = 0; i < NBLOCKS; i++) {
if (connections[came_from] != -1) sum += count(i, height-1);
}
}

// cached = sum;
return sum;
}


int main(int argc, char* argv[])
{
// for (int h = 0; h < HEIGHT; h++) {
// for (int b = 0; b < NBLOCKS; b++)
// cache[h] = -1;
//}

int_t sum = 0;
for (int i = 0; i < NBLOCKS; i++) {
sum += count(i, HEIGHT-1);
}
std::cout << sum << '\n';
return 0;
}

It's lazy evaluated DP solution, or even just basic memoization. Commented lines show how to convert recursive version to memoized one. Uncomment them for linear time solution.


What doesn't seem to match are the numbers given in original post. The numbers increase, so even for depth of ~30 an int isn't enough. A 64-bit int makes it possible to go slightly beyond that, but for depths of several thousands the numbers will become prohibitive and would require some sort of big int type.

Either way, the solution above, if it even solves the problem, requires NBLOCKS * HEIGHT operations (approximately).
So I think I worked out a much faster solution. I will write the pseudo code and use your example

[source]
int currentCombos[numOfBlocks];
int prevCombos[numOfBlocks]; //for each block that is allowed on the bottom, set it to 1 otherwise, set it to 0

for (int h=1;h<desiredHeight;h++)
{
memset(currentCombos, 0, numOfBlocks*sizeof(int));
for (int x=0;x<numOfBlocks;x++)
{
for (int i=0;i<lines[x].compatibleLines.count;i++)
{
currentCombos[lines[x].compatibleLines] += prevCombos[x];
}
}
SWAP(prevCombos, currentCombos);
}

int total = 0;
for (int x=0;x<numOfBlocks;x++)
{
total += prevCombos[x];
}

return total;

[/source]


So basically this works because your next block only depends on the one right before it and nothing else. So let's take your example. So on the first height, there can only be 1 combination of each block so that means that if you chose a height of 1, there can only be 4 combinations, i.e. 0, 1, 2, and 3. On the second height, block 0 can feed into both 1 and 2, 1 can feed into 0 and 3, 2 can feed into 0, and 3 can feed into 1.
So currentCombos becomes 2 2 1 1. On level three current becomes 3 3 2 2 which gives you 10 altogether which is your answer. This should be a lot faster than your current algorithm. Worst case is (x*x*h) where x is the num of block lines and h is the desired height for your loop. I know I didn't explain it well. I hope the code provides enough explanation.


This topic is closed to new replies.

Advertisement