Once you optimize voxel to be not memory bandwidth bound, you'll end up texture-fetch or memory-fetch bound, there are simply not enough texture units to feed all ALUs that's just gonna get worse, simply because it's extremely expensive to ramp up the internal bandwidth. (it's in the TB/s area already)
NVIdia Keppler: 192 CUDA Cores/ 16 TMU per SMX
NVidia Maxwell: 128 CUDA Cores/ 8 TMU per SMM
no it's not really. ram is accessed sequentially. you set an address and you read/write it and then the next access can happen. the setup for the address is quite slow and causes most of the latency. it also has mostly electrical limitations, that's why you see RAM that doubles the frequency, but at the same time nearly doubles the latency, so you end up with the same real time for the access, e.g typical DDR SDRAM CAS latencys
DDR1 200-400 : 2-3cycle
DDR2 400 : 3-4cycle
DDR2 800 : 4-6cycle
DDR3 800 : 5-6cycle
DDR3 1600 : 8-12cycle
you can improve the situation by splitting the memory area across several memory controllers, but that's like hiring more delivery guys, once everyone is his own delivery guy, it's fully parallel, but then your bottleneck will just move to the place where you need to control those. and memory controllers seem to eat up quite a lot of the die space of GPUs, as their size is more dictated by the external interface that they have to provide.
some more on ram:
what GPUs try to do, and that's what makes them so much more efficient for gfx than cpus, is to group memory access and execute requests in batches. so your texture fetch for one pixel will be delayed until there are maybe 100 of those, then a big chunk of memory will be read into the L1 TMU cache and then the sampler units start doing all the interpolation. if you organize your cache in a way, where half of it can be streamed in simultanously to the other half providing data for the samplers, you effectivelly fully parallelize the reads while at the same time being just bandwidth limited.
but those are special cases. this just works out because the rasterizer groups close by pixel together so they generate quite coherent UVs for texture lookups and textures are organized into tiles that just need one addressing operation to get like 32x32 pixel instead of 32 operations to get 32 lines of 1024 pixel where 992 are dropped. and on top of it, you use mipmaps to approximatelly end up with 1Texel:1Pixel.
for raytracing we do similar reorganizations of rays to be more coherent in our access pattern, but it's no way close to that simplicity and efficiency and there is quite some overhead in compute code to do this reorganization.
some basic techniques are wrapped up here: https://mediatech.aalto.fi/~timo/publications/aila2009hpg_paper.pdf