For data can be fully loaded, just create proxy spirit (based on 3D box world pos) as RT for volume rendering (uv also depend on 3D box pos), and place this spirit on your 3Dbox (or map the RT texture on this box)
For data too big, then is trickier:
If your 3Dbox movement is predictable, and you have streaming system running in background then there are 2 choices I can think of:
1. the 3D tiled resource (maybe only latest GPU could support it), if you are satisfied with only support latest GPU, I think that does exactly what you want: only needed tile is in vram.
2. brick your huge Texture3D into tons of smaller ones (handmade tiled resource)
then based on your visibility prediction stream in/out tiles/small Texture3D objects accordingly. and do the aforementioned rendering.
If you 3Dbox is not predictable, then I have no idea.....
However, if your volume is sparse (like SDF), take advantage of the sparsity, create spacial structure (like octree) and you could have much smaller proxy Texture3D (store offset into the typedbuffer) along with small typedbuffer (store actual voxel data). And in most case you could fit it into vram. And modify your volume renderer to go another indirection from sample the proxy Texture3D into actual voxel data (or just return if voxel is empty)