Thanks for the tips Ohforf sake and jms bc. By doing some research I've learned more about the textures and there's a CUDA binding called surface which allows write acess to texture memory, it's seems like a good solution to my problem.
My only issue right know is that I've already started implementing the shared cache version of my code, so changing yet again to another approach would cost me a lot of time (I'd have to implement a work around those 130MB limit for a single 1D texture), so, for now, I'll test shared memory and check the speed-up, if it's not good enough I'll switch to texture memory.
Also, didn't know CUDA supported half float variables (thought that was only an OpenCL feature), so I'll switch to half since it could effectivelly double the reach of a single CUDA block to the mesh values.
Thanks again, learned a lot.