See previous post for more overview of exactly what I'm doing, but basically we're developing a distributed test system to cut down on test turnaround. Given that we're farming this work out to developers machines when they are not around, we are using virtual machines to ensure both a) consistency of environment and b) ability to test arbitrary platforms.
The issue here is that a VM isn't a small thing, the final ones that will be in use will take up space in the 10's of gigabytes, so the cost is high to transfer these around and also to store them (we currently test on something like 9-10 platforms). The criteria for where to put them are in my mind the following: a) redundancy, if one of the locations the VM is available is offline, it should be available elsewhere. b) transfer capacity, if we need to distribute multiple VM's at the same time, we shouldn't incur too much of a performance hit to do so.
My initial solution to this was to have the client machines themselves store the VM's for distribution. After all, they need to have VM's anyways to actually run the tests, so they ought to be able to redistribute those same VMs. This meets the criteria, several systems will be running tests on each platform, and if we need 3 copies of the same VM, we can grab each copy from a different machine. The problem becomes the "churn" of the platform configuration. Since these are developers machines, its possible something might happen and various machines go offline at different times - how does the system reconfigure itself to handle the loss of these machines? As an example, lets suppose by chance all the machines that go offline were running platform X. Our capacity for running tests on X is now severely reduced, do we copy VM's for X to other machines that were previously running some other platform? The cost for doing this is quite high, especially if there is any kind of ripple effect. And then when the previously disabled machines come back online, how does the system revert to the old configuration?
At the moment I'm letting this brew in the back of my head - I don't know exactly what the best solution is. Basically I think a good solution will end up revolving around an idea of minimum and optimal capacity - if some machines go offline we can accept running at lower capacity temporarily, but if too many are lost then it is worth the cost to transfer some VM's around.
Anyhow, typing that out gave me a better handle on the issue I think.