UDP reception - CPU load and packet loss

Started by
2 comments, last by UnshavenBastard 5 years, 4 months ago

Howdy,

  • I have an ARM Cortex A9 CPU, the NXP iMX6, dual core,
  • on which runs a bare bones Linux (kernel 4.1.44).
  • I'm trying to receive data over its Gigabit ethernet interface.
  • goal throughput is ~ 40 MByte/s (320 Mbit/s)
  • my default UDP payload is ~ 1350 bytes, I've experimented with sizes from 300 to 16k Bytes
  • the ARM board is connected, via one Gigabit switch, to my PC
  • nothing else on the switch
  • on the PC runs the UDP client, sending data continuously
  • the ARM runs the server which binds a socket to a port and then basically only does
    • while(true) { poll( socket, ...); recv( socket, ...); }
    • there is some other stuff going on like loop time jitter histogram and crude data integrity check, but commenting that out yields no difference in the problematic behavior

I.e. I'm using the socket library as available on Linux.
Now, according to what I've read, poll(..) blocks (in a non-busy manner) until there is data. I use it because it has timeout functionality.

So from that, I should not be seeing the near maxing out of the CPU (90..99 % for my process) that I unfortunately do. Unless that throughput really is too much for that ARM CPU.
But I doubt that, as iperf (2.0.5) eats "only" 50..56% CPU at ~ 40MB/s datarate. I'm not sure what exactly it does, though, looked briefly at the code and saw some fiddling with raw sockets, not sure I want to go that route... I have hardly any experience with network stuff.

Also, there was this strange effect of high packet loss, despite this short connection, while a second PC as UDP receiver has no packet loss.
I'd get it if the CPU were at 100% all the time, but it's slightly below - and iperf much less, which still can get like 2..4% loss.

I then found some Linux settings: net.core.rmem_max and net.core.rmem_default, which I set to 8 MB each via sysctl, instead of some KB it had prior.
Then the packet loss went to zero (still around 90+ % CPU load).
Today I tried to replicate this, put I have packet loss again. I had also put the cheap plastic switch somewhere else, now I'm suspecting that that thing may be unreliable... yet to test.
Can switches be a problem source like that? 

Any ideas about why the CPU load is so high?
Or, in other words, am I doing this totally wrong? What would a proper implementation of relatively high speed UDP reception look like?

Regards,
- UB
 

Advertisement

First, two system calls per packet is never great. You should instead either block on recvfrom(), or you should use a non-blocking socket and deal with the case when there is no packet. If you need a timeout, perhaps do that on another thread? It's worth changing to a single blocking recvfrom() in a simple receive loop just to measure how much the system call overhead is.

Second, high packet loss is almost certainly because of system overload on the receiver side in the setup you describe.

Third, is the 50% "percent of one CPU" or "percent of all CPUs?" And is it "total time" or just "your process user-space time?" From what you say, it sounds quite likely that your system is CPU bound, possibly on a single core. For example, most ARM linux-es vector all interrupt to one core, which means that with a high interrupt rate, CPU usage will be high on that core. If the driver for the network chip is not highly optimized, or if the chipset logic of your integrated board doesn't support all the fancy offload/DMA methods of a PC system, or if your CPU core needs to flush cache to be coherent with DMA, then you almost certainly are CPU bound on that load.

You can check what the system is doing overall with other tools, such as "top" and "time" and so forth. There are also kernel-level / sampling profilers that will give you a better overall view of where the CPU is spending its time.

It's almost certain, though, that your problems are all because of "embedded systems performance woes" rather than "software level woes." Assuming you're on an i.MX6Dual of some sort, that runs at 1 GHz, with anemic caches, and an old implementation architecture (compared to the x86-style speculative pipelines) on top of a 64-bit RAM interface at 533 MHz (compared to 256bit-plus at 2000+ MHz of most PCs.)

 

enum Bool { True, False, FileNotFound };

Hey, thanks for the reply!
50% CPU, now that you're asking, I didn't pay attention to that detail, lol! I stared at the top of "htop", perhaps it was only 1 core. It was shown with my process at the top, eating most of it, though.
I have to look into "time", and kernel level profilers sounds interesting, too, but probably quite involved to set up?
Cache coherency or rather its lack... there rings a bell, I think that problem is there.

While it does seem very plausible that my CPU just hasn't got enough horse power, I still wonder what iperf is doing better (if not exactly impressively), and why that one day, it seemed consistent zero packet loss after I set 8 MB buffers (4MB was barely not enough, there seemed to be a roughly proportionate effect).

I will try out what happens with only one system call!

Edit:
Ok, I tried it with only receive, without poll, and checking for timeout in a low frequency thread. That reduced the CPU load at best slighty. I wonder why it's not always the same - sometimes the core0 is, given my goal data rate, at 95..99% and then there is no packet loss, but when it gets to 100%, not surprisingly, there is.
Core1 is mostly < 2% busy.
I disabled the LXDE desktop completely to see whether the barely enough CPU (99%) would be more repeatable then, but it's not.

I'll look into whether all parts of my gear supports jumbo frames, to reduce the number of packets/sec I get...
=> Too bad, the iMX6 won't go above 1500 MTU.

Another edit:
I tried 2 things now, the second is semi-successful.
1) had a NVIDIA TK1 board in the drawer - unlike my iMX6 device, that thing allowed me to set the MTU to 4000, no 9000 but hey... my Windows PC also, and the switch maker also claims my switch support it.
But... that ran much worse than the iMX6, for some funny reason, and I saw kernel mesages about no memory for gluing packets together or something...
Anyway:
2) Next to setting rmem_max and rmem_default to 8MB (instead some KB defaults), I also set the CPU core affinity of my process to core1. You mentioned that probably all those interrupts run on core0 only, so I thought, how about doing whatever else possible on the other core - and bingo:
Now I have 30...55% CPU core load for both cores, no maxing out anymore, and no packet loss - unless I use the default buffer values instead my magical 8MB.

I guess getting the 4x core version then would leave me to actually do some things besides fetching the data :D
Don't know yet if that's a real option.

This topic is closed to new replies.

Advertisement