In my experience with clustered, you store all the lights in the buffer in a very similar manner, except each cluster has a linear section of the buffer.
Yes, I think that this would be the more common way of doing it. I currently use the uniform way to do it in my engine.
So the main additional cost is doing a bounding sphere test per light that you visit, and iterating through the light array as a linked-list rather than a simpler linear array traversal.
It is more like a tree structure, isn't it ? Traversing a tree will, even if you will narrow down to the candidates quite quickly, still need to jump around quite often. The virtual nodes in this tree (which will not represent a light) would add atleast extra costs and if you need to traverse partly through a tree with 3000 leafs, then you could have a few cache misses and this for every pixel.
I would be interested in the expected overhead, would it be roughly x1.5 , x2.0, x3.0. The system could perform better on more realistic game setup (few hundred visible lights) with a much flatter tree structure.
Nevertheless, I found this approach very sexy. I could think about using it to add some extra eye-candy to the forward rendered stuff (particles) as option for people with hi-performance hardware.
PS: if you do the light calculation in the vertex shader you could push out lit particles with ease and with much better performance.
PPS: @fries: how about posting your approach as article on gamedev ?