Haha, like I said 'How does one even start to code a software renderer?'
Way over my head
A good way to start is to try and understand how the GPU processes things. That means learning how index/vertex buffers work, how to write shaders, how world/view/projection space works and the math behind that, how the z-buffer works, how alpha blending works, how mip-mapping works, clipping, etc. Once you understand those concepts, you can start to write a rasterizer (alternatively, learning while you write one is a great way to learn it!). Tackling the entire thing all at once is completely overwhelming, but breaking it down into small parts takes care of that. For instance, you could start by writing a simple wireframe rasterizer. That was what I started with. If you're not looking write a super fast parallelized one then this is actually quite easy.
I have never written a path tracer, so take my response with a grain of salt. I don't see why you couldn't have a child node in your BVH be another BVH with a transform associated with it. In fact, that makes the most sense to me and I would venture to guess that production systems do exactly that. What complicates things, like you said, is if you are morphing geometry or otherwise changing/rebuilding the BVH for a dense object with millions of triangles. You should give it a shot and let us know how it goes!
Another thing I'm wondering, I've been thinking about experimenting with some rendering techniques like rasterization with gpgpu by using something like CUDA. However, I have no CUDA experience and wondering if it would be possible to do so? Could there be certain advantages over just using Dx / GL?
They have a paper somewhere that outlines some of the details, but I believe they do very simplistic rasterization to a 320x280 depth buffer (or something like that). It's heavily vectorized. If you own a core-I7 with AVX, I believe those can do 8-wide vector operations, which would speed up something like that heavily. I've considered writing an occlusion library that utilizes AVX and SSE instructions. I don't think anyone's really used AVX much yet in production (from my very limited viewpoint).