When taking a look at your technique, you would need to sample the memory quite often, compared to a clustered/cell deferred rendering pipeline, where the lights are stored as uniforms/parameters.
In my experience with clustered, you store all the lights in the buffer in a very similar manner, except each cluster has a linear section of the buffer.
e.g. pseudo-shader:
struct ClusterLightRange { int start, size; };
struct LightInfo { float3 position; etc };
Buffer<ClusterLightRange> clusters;
Buffer<Lights> lights;
ClusterLightRange c = clusters[ClusterIdxFromPixelPosition(pixelPosition)];
for( int i=c.start, end=c.start+c.size; i!=end; ++i )
{
Light l = lights[i];
DoLight(l);
}
So the main additional cost is doing a bounding sphere test per light that you visit, and iterating through the light array as a linked-list rather than a simpler linear array traversal. On modern hardware, it should do pretty well. Worse than clustered, but probably not by much -- especially if the "DoLight" funciton above is expensive.
It would be interesting to do some comparisons between the two using (A) a simple Lambert lighting model, and (B) a very complex Cook-Torrence/GGX/Smith/etc fancy new lighting model.
Also, you could merge this technique with clustered shading:
A) CPU creates the BVH as described, and uploads into a GPU buffer.
B) A GPU compute shader traverses the BVH for each cluster, and generates the 'Lights' buffer in my pseudo example above.
C) Lighting is performed as in clustered shading (as in my pseudo example above).