Everything else you said is correct perhaps with the exception of:
- a multiprocessor can handle x thread groups, to fully use available computing power, create x*2 groups, so a stalled multiprocessor can fall back to the other thread group. With 16 multiprocessors, this would be (max) 32 thread groups
- shared memory can be max 32kb, so 16kb per thread group, because if you have 2 per multiprocessor, there wouldn't be enough with >2*16kb
The calculation isn't as simple as creating x * 2 thread groups in order to fully utilise the GPU. Ideally you'd create a lot more than 2x more threads than the GPU has processors in order to give the GPU's scheduler the best possible opportunity to switch to another wave (or thread group) in order to continually issue work. An AMD GPU can handle 10 'waves' of work per "SIMD" in their terminology. More threads is generally better; trust the GPU to schedule them properly. It's not like writing CPU code where creating too many threads can overwhelm the OS' scheduler.
Regarding shared memory, it is true that each thread group can only address 32KB of it at once. However there's nothing to say that the GPU doesn't have a lot more than 32KB of shared memory per "multiprocessor" (aka Compute Unit in GCN speak). GCN GPUs have 64KB per CU so can run two thread groups each using 32KB each simultaneously. There's no reason future cards might not have even more (128KB, say) and in doing so they could run more shared-memory-hungry thread groups at once. Try to keep your use of shared memory to a minimum because it is a scarce resource, but just because each thread group can only address 32KB doesn't *necessarily* mean each "multiprocessor/CU" only has 32KB.
You answered your own question in the thread title: use DXT compression.
If SlimDX's ToFile function doesn't support compressing to DXT, do it offline using DirectXTex. Texconv is a tool that can convert/compress uncompressed DDS textures to other compressed formats within the DDS container, or you can use the DirectXTex library directly and write your own tool if you wish.
The only thing I can think of that would look remotely like what you describe is if you've created a two-buffer Swap Chain and have got confused about which one you're supposed to be rendering to / reading from on a given frame.
What you're talking about doing is writing an application that uses the Multi-Adapter functionality added to D3D12. Specifically, since you mentioned one NVIDIA and one Intel GPU, it's Heterogeneous Multi-Adapter (two or more GPUs of different designs).
There is no switch you can flip that will make this 'just work', it needs to be thought about and designed into the application. I don't think MiniEngine has any multi-adapter code in it yet, although it wouldn't surprise me if it wasn't added some time in the future.