I don't understand.
How would the low-order bits of the tile coordinates specify a tile within the chunk?
>Why wouldn't I translate the current tile map coordinates to the local space of the chunk that begins at the current tile map coordinates?
Basically yes -- but by selecting chunk size as a power-of-two, you can do that translation simply by using the low-order bits. Say you have 16x16 tile chunks, then the lowest 4 bits of whole-world tile coordinates select the right tile in the chunk. In essence, the low-order bits are the chunk-local coordinate system.
Why would I organize the chunks according to the high-order bits?
>Shouldn't I organize the chunks according to their start or end positions on the tile map?
Its like your original idea of forming a unique key into the set out of the x/y coordinates of the tile, only now I propose that you have a unique_set of chunks, rather than tiles. Following on from the previous question, since the low-order bits form the chunk-local coordinate system, its the high-order bits that are left to form the whole-world coordinate system, the whole-world contains chunks rather than tiles.
Now, you can attempt to form the unique_set key (from the high-order bits) such that the sort function applied to the set puts nearby chunks close together, but I don't think that's probably worthwhile because its unlikely to be more efficient since a chunk ought to be many times larger than a cache-line. locality of reference is important because it means that cache lines, once loaded, are well-utilized. Loading a cache-line has a fixed cost regardless of its location -- sometimes the prefetcher can predict linear access patterns and lower the apparent cost, but that's not really well-suited for the kinds of access patterns you're talking here. In short, the simple key lookup might take some time, but you're only doing a handful of those per frame -- good enough is good enough.
If you were going to organize the set for spacial locality of reference between neighboring chunks, you'd form the key in such a way that a contiguous, densely-populated area of chunks were allocated according to a space-filling curve something like the z-order curve or hilbert curve. Those would allow the prefetcher to hit some of the time, and for the z-order curve all you have to do is interleave the bits of the key coming from the high-order bits of the x and y coordinates; even as straight-forward as that construction would be, I think you'd be disappointed in the gains -- I'd wager they'd be barely noticable unless your chunks are very small -- better to just have chunks of a reasonable size to begin with.