I would try doing roughly like this:
Put your OVERLAPPED and the 4k receive buffer into one, like this:
struct recv_buf { OVERLAPPED overlapped; char data[4096]; };
Fire up a couple of threads (like, 5 or 10), and have them do an infinite loop around GetQueuedCompletionStatus, returning when some "special message" is posted to the completion port. The completion port will do the smart one-thread-per-core management for you, as per the documentation.
For each OVL_ACCEPT seen, pull one such recv_buf from an allocator (or list, or whatever). You have accepted a connection, so you want to receive. WSARecv into that structure, using &buf.data, sizeof(buf.data), and &buf.overlapped.
For each OVL_RECV seen, you get back an OVERLAPPED* from GetQueuedCompletionStatus, which is, as you know, really a pointer to a recv_buf (you were not 100% truthful when you told Windows that this was an OVERLAPPED, but you were truthful enough!). So you know where to find the received data, and you know that this buffer is used by you and only you. Nobody else could have gotten back this pointer at this time, so there is no need to lock anything, no need to do any memory management, or anything else. The structure was put into the IOCP's queue exactly once, and now it's no longer in there, you have it.
Process the buffer, and then WSARecv again using that same recv_buf structure. You still know for sure that nobody else could possibly be using it, so that's fine.
When a socket is closed, return the recv_buf to your allocator (or push it back to a list, or whatever). Someone else will eventually pull and reuse it.
Repeat forever.
When the server should exit, PostQueuedCompletionStatus your "special message" (for example code = 0, length = 0, handle = 0) and WaitForMultipleObjects(thread_handle_array, TRUE, INFINITE). Each thread receiving that message from the IOCP re-posts it (so all threads eventually get to see it) and then simply returns from the thread function.
The only thing that needs to be explicitly threadsafe is the allocator (list, whatever) from which you pull your recv_bufs, everything else is made threadsafe automatically by how the completion port works.
what about sending multiple packets ?