So why would TCP block the send if it queues the data into a buffer and send it with a latency later on? You would think a buffer would smooth out the bandwidth.
TCP takes some time to re-transmit a packet after it's lost, and meanwhile, the kernel cannot deliver any of the data it receives after the lost packet, because TCP is guaranteed in-order. Also, if the kernel buffer fills up because the other end doesn't acknowledge the receipt fast enough, the sending side will block, waiting to put more into the buffer.
we don't know why TCP is so unstable
and non uniform. Maybe it's by design.
Yes, this is something that's not uncommon in TCP clients and servers. There's a lot of reseach in this area, including tons of academic papers. You need good knowledge of the actual situation you have, plus the network protocols involved, to attempt to tune it up to go faster.
a single core cannot handle 3Gbs of bandwidth too well.
That's entirely dependent on the quality of the network hardware, the network driver software, and the network stack implementation (as well as the application program structure, of course.)
If you're doing this on Windows, you should be using OVERLAPPED sockets, WSARecv(), GetQueuedCompletionStatus(), and multiple outstanding OVERLAPPED I/O requests. Other APIs are less efficient in various ways. You might also want to have a single thread always waiting in the GetQueuedCompletionStatus() function, and hand off the buffers you receive to processing thread pool. Or have all the thread pool threads gang up on GQCS().
People who want to drive the network as hard as possible, sometimes end up using "user space implementations" of the entire network stack, with a memory-mapped ring buffer of packets receiving/sending from/to the network driver, and running it in polled mode. If the hardware and software are built to facilitate this approach, that can give you additional performance headroom.
If you haven't yet tried with Intel "enterprise level" network cards, now might be a good time to try that. While not always the best, they tend to avoid some of the terrible hardware bugs and software work-arounds that some of the cheapo vendors ship.