So there's a difference between the hardware and the software abstraction.
In the shader you specify a thread-group size. e.g. numthreads(8, 8, 1) declares that for each thread-group that is dispatched, you want 64 thread to be launched.
On the CPU side, the Dispatch call says how many thread groups to launch. e.g. Dispatch(240, 135, 1) will launch 32400 thread groups. With the above shader, it will end up launching 240*8 * 135*8 (=1920*1080 = 2073600) threads.
On the hardware side:
AMD compute units always run 64 threads at once.
NVidia compute units run 32 threads at once.
Intel compute units run 8 threads at once.
So - for good performance across all hardware, you should try to make your shader's thread-group sizes a multiple of 64.
You can't make your thread-group size too big, because there's not enough memory inside a compute-unit to maintain state for thousands of threads at a time (a few hundred is probably the limit).
The point of a thread-group is:
A) they can communicate with each other by using group-shared memory.
B) they should somewhat match the way that the hardware works (8-64 "threads" being executed in lockstep on a SIMD architecture).
You also don't want just a single mega-thread-group of size 1920*1080 because GPU's have more than one compute-unit.
A GPU might have 32 CU's, which can each be working on 10 thread-groups at once, each made up of 64 threads, meaning you can have 20480 threads in flight at once across all CU's -- but a single thread group might be limited to something like 640 threads max.
So you want your thread-group size to be small enough so that your workload is spread out across all the CU's on the GPU, and you want it to be large enough so that you're not wasting any of the SIMD abilities of each CU. As a rule of thumb, 64 or 128 are good sizes to use.
Note that on AMD, a 128-sized thread-group will run on a SIMD-64 machine, which lets AMD's internal compiler basically unroll the code to work on two sets of SIMD-64 registers at once.
On NVidia, a 128-sized thread-group will run on a SIMD-32 machine, which lets NVidia's internal compiler unroll the code to work on four sets of SIMD-32 registers at once.
Sometimes this unrolling is helpful as it helps reduce memory stalls, and sometimes it's harmful as you're increasing register pressure... It takes a lot of profiling to find the optimal thread-group size for a particular GPU :(
What about the case of a texture that is 33x33? How would you have to lay out threads and thread groups in that case? Or does it have more to do with the algorithm itself on grouping things based on the algorithm?
If you defined your thread-group size to be 8x8, you'd dispatch 33/8 x 33/8 thread-groups (rounding up to 5). That's enough threads to process a 40x40 sized texture, so a bunch of your threads will end up doing nothing.
HLSL is helpful here where out of bounds texture reads return 0, and out of bounds texture writes are a NOP -- so often your algorithm can be unaware that it's writing outside the bounds of your target texture. In other cases, you need to use an if statement to 'early exit' on any out of bounds threads.