gcc also has a function called __builtin_popcount, but on my computer the one above is three times faster. You can do it even faster using 64-bit arithmetic.
Talking about performance, here is my personal experience (maybe off topic).
There is another lookup table algorithm for bit counting in my posted url, but be careful to use it.
AFAIR, it may be slower due to the data lookup may invalid the cpu data cache.
The algorithm posted by alvaro should be better because it uses "in-place" data and doesn't access any data in the memory.