I believe it's a typo, yes.
Everything else you said is correct perhaps with the exception of:
- a multiprocessor can handle x thread groups, to fully use available computing power, create x*2 groups, so a stalled multiprocessor can fall back to the other thread group. With 16 multiprocessors, this would be (max) 32 thread groups- shared memory can be max 32kb, so 16kb per thread group, because if you have 2 per multiprocessor, there wouldn't be enough with >2*16kb
The calculation isn't as simple as creating x * 2 thread groups in order to fully utilise the GPU. Ideally you'd create a lot more than 2x more threads than the GPU has processors in order to give the GPU's scheduler the best possible opportunity to switch to another wave (or thread group) in order to continually issue work. An AMD GPU can handle 10 'waves' of work per "SIMD" in their terminology. More threads is generally better; trust the GPU to schedule them properly. It's not like writing CPU code where creating too many threads can overwhelm the OS' scheduler.
Regarding shared memory, it is true that each thread group can only address 32KB of it at once. However there's nothing to say that the GPU doesn't have a lot more than 32KB of shared memory per "multiprocessor" (aka Compute Unit in GCN speak). GCN GPUs have 64KB per CU so can run two thread groups each using 32KB each simultaneously. There's no reason future cards might not have even more (128KB, say) and in doing so they could run more shared-memory-hungry thread groups at once. Try to keep your use of shared memory to a minimum because it is a scarce resource, but just because each thread group can only address 32KB doesn't *necessarily* mean each "multiprocessor/CU" only has 32KB.