I'm not sure my input will be useful, but I have been coding GPUs for 6 months now (mainly machine learning problems). I started with kernel development myself, but then found ArrayFire and it is much faster than what I was writing. Now I use ArrayFire for most everything I do and supplement with kernel code just when I have to do so.
You might be looking to just go low-level for the exercise of it. But if you're trying to get work done and want the best speed possible, I don't think anyone does it better than the ArrayFire guys.