Anyway, did a quick photoshop blur again on your "original" image with a radius of 2 and it turned out pretty good I think, a lot better than your second I'd say, which seems to remove a lot of features while not really fixing the jaggedness.

Again, I'm not really read up on this, but to me it seems that involving any significant decision making into the processing would ruin the realtime quality, making features appear/disappear and behave erratically, whereas blurring and such solutions would have a more consistent and fluid look (although not as high quality when looking at individual frames), and you could also get the result cheaply anti-aliased that way (if not using FXAA).
Just read your update, if you want that "continous shape" looking look, it seems to me like you have to give up the smaller features entirely and just fit some very rough curves over it all (but that could probably make it very "blobby" instead I think if you don't tune it carefully), or possibly just more blur. I'm curious though, I would think that from looking at that image that they don't use the buffer itself, but rather interpret the position of the body parts and then render a human model instead.