I feel like implementing a GPU (or software) raytracer just to solve OIT is a bad idea.
I had the same feeling..
GPU raytracers are extremely powerful, but they have drawbacks as you saw. I would have initially recommended trying to composite together a hybrid approach, but at that point it makes more sense just to jump into a full raytracing solution. You saw the performance figures in those papers; that's a big step for flexibility.
The solutions to OIT are classic and well-known, but more importantly by contrast they are simpler and faster. You're right, on-the-fly sorting is prone to problems, especially with cyclic geometry, and it doesn't scale well. I suggest you reexamine the depth peeling algorithms, in particular.
You can do this in two passes, if you're careful, by using MRT. In the first pass, render opaque fragments normally into buffer 0. For semi-transparent fragments, do single-pass depth peeling, packing premultiplied RGB and depth into RGBA floating point render targets (buffers 1 to n). For the second pass, just blend the color buffers 1 to n onto buffer 0.
Exactly how the alpha channel is stored (premultiplied or not) may be the subject of some consternation, but that general idea would definitely work. Note that for single-pass depth peeling, you'll need shader mutexes. Here's, where I made a GLSL fragment program that does single-pass depth peeling.
I decided to follow your suggestion and go on with the depth peeling. Currently I am trying to implement a simple version of it, that is just the original algorith (no dual) without occlusion query..
I just wonder if I can implement it without shaders.. is it possible? I ask because I have only a rough knowledge about and this would require a break to go deeper through shaders and then moving on