You should be able to pack as many bones + weights as you want in your vertex. If you're on relatively modern hardware (DX10-class or higher) you may want to branch on the weight being > 0 before fetching the bone matrix as an optimization.
eh?, i was under the impression that overall it's cheaper to do the mathematics, then to do such branching? granted modern hardware is more capable of dealing with such branching, i was simply under the impression that it's overall cheaper for the gpu to do the matrix math, then to do any branching on dynamic data sets.