TCP stream 300ms spikes due to nagle algorithm?

Started by
18 comments, last by Zurtan 4 years, 5 months ago

So why would TCP block the send if it queues the data into a buffer and send it with a latency later on? You would think a buffer would smooth out the bandwidth.

TCP takes some time to re-transmit a packet after it's lost, and meanwhile, the kernel cannot deliver any of the data it receives after the lost packet, because TCP is guaranteed in-order. Also, if the kernel buffer fills up because the other end doesn't acknowledge the receipt fast enough, the sending side will block, waiting to put more into the buffer.

we don't know why TCP is so unstable

 and non uniform. Maybe it's by design.

Yes, this is something that's not uncommon in TCP clients and servers. There's a lot of reseach in this area, including tons of academic papers. You need good knowledge of the actual situation you have, plus the network protocols involved, to attempt to tune it up to go faster.

a single core cannot handle 3Gbs of bandwidth too well.

That's entirely dependent on the quality of the network hardware, the network driver software, and the network stack implementation (as well as the application program structure, of course.)

If you're doing this on Windows, you should be using OVERLAPPED sockets, WSARecv(), GetQueuedCompletionStatus(), and multiple outstanding OVERLAPPED I/O requests. Other APIs are less efficient in various ways. You might also want to have a single thread always waiting in the GetQueuedCompletionStatus() function, and hand off the buffers you receive to processing thread pool. Or have all the thread pool threads gang up on GQCS().

People who want to drive the network as hard as possible, sometimes end up using "user space implementations" of the entire network stack, with a memory-mapped ring buffer of packets receiving/sending from/to the network driver, and running it in polled mode. If the hardware and software are built to facilitate this approach, that can give you additional performance headroom.

If you haven't yet tried with Intel "enterprise level" network cards, now might be a good time to try that. While not always the best, they tend to avoid some of the terrible hardware bugs and software work-arounds that some of the cheapo vendors ship.

 

enum Bool { True, False, FileNotFound };
Advertisement

I stabilized my NET client/server by client thread ever not blocking and just peeking for available data, if there is no data, after short processing, it peeks again If there is data, the client "swallow" thread reads it and unblock-sends it through local station socket to processing thread. 

The "swallow" thread utilizes single core fully, while the processing thread just blocks on a local read socket- if it has yet not complete data, it blocks to continue the data . The processing thread can buffer up high if it has a lot to proccess, but it allowed me to see how well actual NET protocols perform on me.

From my experience, the bigger the write to distant net end, the bigger chance of data being lost and subjected to re-send.

Ok so I have been working on it lately.

A few things.

On UDP I would get too many packet losses.

However, after disabling Jumbo Packets I was getting far less packet losses.

The issue that with 1500 bytes packets, its nearly impossible to get more than 2Gbs with a single socket and thread.

While with the Jumbo Packets I could easily get 9Gbs with a single thread and socket.

Even after disabling Jumbo Packets, one of the Edge computers would Sometime get too many packet loss. CPU also work a lot hard with the 1500 bytes packets.

Why this single PC gets worse packet loss than the other PCs? Its suppose to have identical Machine and configuration. The only difference I could think of is different RAM sticks. I also replaced the fiber cable to make sure.

So why Jumbo packets have so many packet losses?

I am using winsock2 directly with recvfrom, select, sendto and etc.

My current threading is:

For receiving:

A thread for a single socket, The thread wait for readability of the socket with select with a 2 second timeout. Although I also tried doing select only the first time.

Then it gets a pointer to a vector from a pool with the size of the packet. It recvfrom the socket then push the vector to a queue.

A different thread polls the queue and compose the frame.

For the sendto I have thread the decompose the frame into packets and push them to a queue and then a thread that polls the queue and sendto(tried with and without select for writability).

Am I doing this right?

Basically I think I have a solution where I can split the frames to send into multiple socket. Only one edge computer under performs sometimes for no apparent reason.

The causes for losses of jumbo frames are the same as for regular losses: Bad hardware, bad drivers, bad wiring, extra load on the CPU from other processes, or something like that. There's really no way to tell for sure without applying high-quality network analysis software and hardware.

If you can get port statistics from your switch, then you can see whether the switch is discarding or forwarding packets. Similarly, if you can get network interface statistics from the computer, you can see whether the network itself is discarding packets.

Then you have to replace one piece at a time to see whether that was the problem. You might even want to dual-boot the computer using some other OS (windows / linux / BSD / etc) from a USB stick, and see if that other OS gives you different results, which would insulate the problem to whether it's hardware or somewhere in the software stack.

enum Bool { True, False, FileNotFound };

How do I get port statistics from a NIC or a switch? I couldn't find anything for intel x710

Getting card statistics is driver- and OS-dependent. If you're using Windows, you should be able to get WMI events, which you could for example look at in Perfmon. (Sometimes you need to enable event generation/collection first, before they are gathered.)

For Linux, you can find most events in sysfs, and some of them even in simple utilities like "ifconfig" and the "ip link" command.

For a network switch, you need to log into the management interface for the switch, and use whatever software is provided by the switch manufacturer (e g IOS for Cisco, JunOS for Juniper, etc) to extract the statistics. If you use a cheap unmanaged switch from Netgear or whatever, then you won't be able to get this, and might want to invest in higher-end hardware.

enum Bool { True, False, FileNotFound };

I have been looking in perfmonitor for windows. I just looked at the ipv4 UDP stats and I wasn't able to get anything that helped me fron that.

What do you suggest to look at the perfmonitor to figure out the source of the problem?

The internet says that Win32_PerfRawData_Tcpip_NetworkInterface is a place to start.

In Perfmon it seems to be the \NetworkInterface*\* or \NetworkAdapter*\*set of counters, I think. I see names like "Packets Received Discarded" and "Packets Received Errors" for example. I haven't done Windows diagnostics in a long time, so I can't be more specific.

enum Bool { True, False, FileNotFound };

Thanks.

This helped solve the problem. In the network interface there was a packet discarded counter. This countered showed that the receiving NIC was discarding frames.

I have brought back Jumbo frames and the actual culprit was that a long time ago we have decided to disable flow control. Eversince we kept it disabled but enabling it with the combination of ironing bugs along the way was what solved the problem.

I now get nearly zero packt discard on 4 agents running together. Had like a few packt discard after 30 minutes.

The next challange is streaming four 4K raw videos, one from each agent. As it seems a single socket with a single thread struggles with this.

Thanks for your patience and help.

This topic is closed to new replies.

Advertisement