Naturally, hemispherical sampling makes the convergence worse, because it no longer does importance sampling. However, I think this at least shows that my methodology is sound, and it's an important precursor to eventually making bidirectional path tracing. The below picture shows importance sampling on left versus hemispherical sampling with 2.0*cos(theta) weighting on right. The image was rendered with 2004 samples. The key thing to note is that the illumination is the same, even if the hemispherical sampling has higher variance.
Assuming that the ray scatters, it must go on the hemisphere. I speculate that, on average, the weight of a scatter in any direction on the hemisphere must be 1.0 (that is, the integral of the weights over the hemisphere must be 2*pi). For importance sampling, all the weights are one, but the probability of scattering away from the normal is lower. The integral is 2*pi. For cosine weighted, all the weights are cos(theta), but they need to be scaled by 2 to that the integral is 2*pi. This provides a nice way to make any BRDF--without importance sampling, simply weight each sample by the BRDF(theta), and then scale to ensure that the integral over the hemisphere is 2*pi.

Thanks,
Ian