Sounds funny. So, the best scenario would be (at least in 1D case) to make groups with the number of threads multiple by 32 and then dispatch enough of these groups. Do they state it explicitely in their documentation that perfomance can be hit that much just if you dont guess the right size? Because you may think that as long as your shader is ok you may expect more or less the same performance in different scaled cases.
It is always better to just try it out and see for yourself how this affects the performance. It could very well be that you are totally texture bandwidth limited and the threadgroup size won't make much difference - but in general what the others are saying is very relevant. All other things being equal, it is better to use a multiple of the IHV's threadgroup size suggestions.