So for example if line 0 was compatible with lines 1 and 2, line 1 was compatible with 0 and 3, line 2 was compatible with 0, and line 3 was compatible with line 1, then the number of ways I can build a wall of height "3" would be:

0-1-0

0-2-0

0-1-3

1-0-1

1-3-1

1-0-2

2-0-2

2-0-1

3-1-3

3-1-0

For a total of 10 ways. Make sense?

I already have written a function that calculates the correct result when given a wall height (between 1 and 10) and a set of BlockLine objects that already have their compatible_line data entered. The function works great for small to mid-size wall configurations (a wall of height 10 with 65 possible lines is computed in a fraction of a second). But when I change the number of lines to a larger number (3329 in the test that I'm using) the function is exceptionally slow for larger heights. (3329, 5) is still computed in under a second, but (3329, 10) takes much much too long (several minutes). So I'm trying to figure out how to optimize this code so that it performs well for the larger input sets. Here's the source:

/** \brief Determines the number of possible configurations that can build a block wall of a specific height *** \param wall_height The height of the wall, as a number of wall lines *** \param lines A reference to the container where all possible block line permutations are stored *** \return The number of wall configurations *** \note The BlockLine objects stored in the lines argument must have all their line compatibilities *** set before this function is called. *** *** This function works by starting with constructing the base of the wall with all possible lines. For each *** line at the base, the compatible lines are placed on top, and those lines have their compatible lines placed *** on top of them and so on until the wall has reached its maximum height. **/ int64_t ComputeNumberWallConfigurations(int wall_height, vector<BlockLine>& lines) { // In the case that the height of the wall is one, the number of configurations is simply the number of line permutations if (wall_height == 1) { return lines.size(); } // A counter that keeps track of the number of wall configurations int64_t num_configurations = 0; // A stack of all the lines that build the current configuration being examined vector<int> line_stack; // A companion stack that tracks the index of the next compatible line to examine for each line in the stack vector<int> next_stack; for (int i = 0; i < lines.size(); i++) { // Lines that are not compatible with any other line can not be used to form a wall that is two or more lines tall if (lines[i].compatible_lines.empty() == true) { continue; } line_stack.push_back(i); next_stack.push_back(0); int64_t temp = 0; printf("Constructing wall with base line: %d ... ", i); // With the base line established, now determine all possible configurations for the wall for the provided base line while (line_stack.empty() == false) { int this_line = line_stack.back(); // (Case 1): If the next line pushed on the stack would be the top of the wall, add the number of compatible lines for this line and pop if (line_stack.size() == (wall_height - 1)) { temp += lines[this_line].compatible_lines.size(); num_configurations += lines[this_line].compatible_lines.size(); line_stack.pop_back(); next_stack.pop_back(); } // (Case 2): There are no more compatible lines to examine for this configuration on this level. Pop back to a lower level. else if (next_stack.back() >= lines[this_line].compatible_lines.size()) { line_stack.pop_back(); next_stack.pop_back(); } // (Case 3): Add another line to the top of the stack and increment the current line's "next" index else { int next_line = next_stack.back(); next_stack.back() += 1; line_stack.push_back(lines[this_line].compatible_lines[next_line]); next_stack.push_back(0); } } printf("number configurations == %lld\n", static_cast<long long int>(temp)); } return num_configurations; }

Its a pretty simple algorithm as you can see. It used to be a recursive function, but I rewrote it to be an iterative version when I noticed how slow it ran. Basically I build the wall up from the bottom one line at a time until I'm at the second to last line from the top. Then I simply add to my accumulator variable (num_configurations) the amount of lines I can put at the top, then pop back down the stack.

I realize that this algorithm is a prime candidate for multithreading and that could improve performance (launch each while(line_stack.empty()) loop logic in its own thread) but I don't really want to make this code multithreaded if I don't have to. Plus, I'm kind of doubtful that it would give me the degree of performance that I'm seeking unless I'm launching these threads across 64 different processor cores or something. I've been staring at this function for an hour and I haven't though of any way to drastically increase the performance. Is there something that I'm missing?

One behavior that I did notice during my testing with these larger input sets is that the further along I get, the slower it seems to take for each base line. What I mean is, when this function starts running the first few hundred lines are printed out very quickly

(each line computed quickly, hundredths of a second)(each line computed quickly, hundredths of a second)

Constructing wall with base line: 0 ... number configurations == 1

Constructing wall with base line: 1 ... number configurations == 252

Constructing wall with base line: 2 ... number configurations == 731808

Constructing wall with base line: 3 ... number configurations == 267227532

Constructing wall with base line: 4 ... number configurations == 13534077843

*(each line computed slowly, several seconds each)*Constructing wall with base line: 468 ... number configurations == 1733285040

Constructing wall with base line: 469 ... number configurations == 2477179776

Constructing wall with base line: 470 ... number configurations == 2673027300

Constructing wall with base line: 471 ... number configurations == 2108819760

Constructing wall with base line: 472 ... number configurations == 808799196

Constructing wall with base line: 473 ... number configurations == 6573600

I examined my performance monitor and I didn't see anything wrong with the memory usage. The process remains static at 800KB of memory and 100% CPU, so I'm pretty certain that this is a computationally bound performance issue and has nothing to do with inefficient memory access.