1. You can certainly handle shadow casting lights, you just need all of the shadow maps to be in memory at the same time. For the last game I worked on we kept spotlight shadow maps in a texture array with 16 elements, and then had a separate 4 element array for the directional light cascades.
2. You can handle other light types, like spotlights. Lights are culled per-tile by computing the planes of a sub-frustum that surrounds the tile, and then testing the light's bounding volume for intersection with that frustum. So for a point light you do a sphere/frustum test, and for a spotlight you can do a cone/frustum test. Just be aware that both sphere/frustum and cone/frustum can have false positives when you're doing the typical "test the volume against each plane" approach.
In case you didn't get this from the paper, the reason they do this is so that each thread group can cull a bunch of lights in parallel. So basically during the culling phase you assign a different light to each of your N threads in your thread group, and then append each intersecting light to a list in thread group shared memory. Then in the second phase each thread loops over the entire list of intersecting lights, and computes the light contribution for a single pixel.
3. Sure, that works. I'm pretty sure that's how the
sample does it as well.
4. One of the main advantages of this approach is that you avoid blending. Basically you combine the light contributions for all lights (or at least, many lights) inside of your compute shader, and then write out the combined result to your texture. This saves a lot of bandwidth, since you don't have to do read/modify/write for every single light source. If you need to do multiple tiled passes, you can still do that with a compute shader approach. Just be aware that in D3D11 you can't read from a RWTexture2D unless it has R32_FLOAT/R32_UINT/R32_INT format*. This means that you can't do a manual read/modify/write for a R16G16B16A16_FLOAT texture. If you want to do use an fp16 format, you'll need to ping pong between two textures so that you can read from one and write to the other.
*This restriction was was relaxed for FEATURE_LEVEL_12_0 hardware, which now supports typed UAV loads for additional formats.