Jump to content
  • Advertisement
Sign in to follow this  
wood_brian

Minimizing data copies

This topic is 2768 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts


I read an article by Dave Abrahams about how Boost.MPI minimizes the number of data copies needed to send and receive data between processes. I've thought about this topic previously, and his mentioning of the article reminded me of the topic. Now I would like to work on adding similar functionality to the C++ Middleware Writer. I came across an interesting paper -- www.info.kochi-tech.ac.jp/yama/papers/ispdc05_active.pdf -- on a similar topic. I'm interested in finding books, papers and websites on this topic. If you have some thoughts or suggestions on the subject, I'd also like to hear them. Thanks in advance.


Brian Wood
Ebenezer Enterprises
http://webEbenezer.net

Share this post


Link to post
Share on other sites
Advertisement
Zero copy concepts in networking typically apply to kernel-userspace messaging. Ideally, application would access hardware memory directly.

Under DOS it was common to do graphics directly on memory-mapped part of RAM. 0xB800 or 0xB000 would be accessed directly, and video card would draw straight from that. Not the perfect example, but best I can recall.

This approach hasn't been practical in a while, hardware has grown in complexity, security layers and abstractions have also made it impossible in just about all today's OSes.

The only place where one might still encounter such techniques are in specialized embedded hardware or perhaps custom linux kernel hacks. For end-user it's simply not viable.

One of biggest problems is that application needs to be designed around specific hardware which makes reuse next to impossible (or one relies on abstractions which defeats the purpose). It also complicates development and adds (what is today) unreasonable cost to development.


Windows IOCP is supposedly capable of zero-copy in certain circumstances, I don't recall about Linux, but there are certainly patches to do the same since epoll reactor model is much better suited for it.

In practice it doesn't matter all that much. Zero copy used to be a really big thing 10 years ago, but it doesn't seem to be mentioned anymore. Workloads and usage patterns have simply changed so much that networking at this level isn't much of a problem anymore.


Userspace MPI may benefit from zero-copy simply due to reducing memory bandwidth. But at same time, replicating data across concurrent systems is often a considerably bigger win. And VM-based languages make almost all allocations on heap, so the C/C++ issues don't even exist.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Zero copy concepts in networking typically apply to kernel-userspace messaging. Ideally, application would access hardware memory directly.

Under DOS it was common to do graphics directly on memory-mapped part of RAM. 0xB800 or 0xB000 would be accessed directly, and video card would draw straight from that. Not the perfect example, but best I can recall.

This approach hasn't been practical in a while, hardware has grown in complexity, security layers and abstractions have also made it impossible in just about all today's OSes.

The only place where one might still encounter such techniques are in specialized embedded hardware or perhaps custom linux kernel hacks. For end-user it's simply not viable.


Not viable unless the library you're using has support for it?

Quote:

One of biggest problems is that application needs to be designed around specific hardware which makes reuse next to impossible (or one relies on abstractions which defeats the purpose). It also complicates development and adds (what is today) unreasonable cost to development.

Windows IOCP is supposedly capable of zero-copy in certain circumstances, I don't recall about Linux, but there are certainly patches to do the same since epoll reactor model is much better suited for it.

In practice it doesn't matter all that much. Zero copy used to be a really big thing 10 years ago, but it doesn't seem to be mentioned anymore. Workloads and usage patterns have simply changed so much that networking at this level isn't much of a problem anymore.


That paper I mentioned had references from 2004 so it was written after that. I think the work that Abrahams mentions was done even more recently so I'm unsure about what you're saying here. In a subsequent reply Abrahams says the work, "yields huge performance benefits and makes practical some computations that might otherwise not be, due to resource constraints." Perhaps this optimization only matters in scientific applications.

Quote:

Userspace MPI may benefit from zero-copy simply due to reducing memory bandwidth. But at same time, replicating data across concurrent systems is often a considerably bigger win. And VM-based languages make almost all allocations on heap, so the C/C++ issues don't even exist.


What are you getting at with your last sentence?


Brian Wood

Share this post


Link to post
Share on other sites
Quote:
That paper I mentioned had references from 2004 so it was written after that.


You are aware that not all academics publish on the topics most relevant to real-life systems, right?

Zero copy was the bees' knees back when CPUs and memory were slow and networks relatively fast, but it really isn't as important these days.

These days, memory doubles throughput every two years or so, but networks double throughput maybe every 10 years. Data going on a LAN is typically carried at gigabit speeds, which means at most 800 MB per second (and in practice a *lot* less). With 25 GB/s memory throughput on modern computers, a 1% memory bandwidth hit from copying data just doesn't matter.

When you move to a WAN, it gets even more ridiculous. Let's say you have a really nice cable modem -- 45 Mbps. And let's say you actually get this, because you live somewhere where none of your neighbors use the internet. That's 5 MB/s at most, and generally less than that. That's 1/5000th of the available throughput on the memory bus. A memcpy() of all network data won't even show up on a profiler.

That being said: Zero-copy is up to the driver and OS to implement. You, as an application/library, just need to use the proper API to make it possible. The UNIX read()/write() model is unfortunately really bad at this, so you'd expect the TCP/IP implementation to copy at least once, anyway.

Share this post


Link to post
Share on other sites
Quote:
Original post by hplus0603
Quote:
That paper I mentioned had references from 2004 so it was written after that.


You are aware that not all academics publish on the topics most relevant to real-life systems, right?

Zero copy was the bees' knees back when CPUs and memory were slow and networks relatively fast, but it really isn't as important these days.

These days, memory doubles throughput every two years or so, but networks double throughput maybe every 10 years. Data going on a LAN is typically carried at gigabit speeds, which means at most 800 MB per second (and in practice a *lot* less). With 25 GB/s memory throughput on modern computers, a 1% memory bandwidth hit from copying data just doesn't matter.

When you move to a WAN, it gets even more ridiculous. Let's say you have a really nice cable modem -- 45 Mbps. And let's say you actually get this, because you live somewhere where none of your neighbors use the internet. That's 5 MB/s at most, and generally less than that. That's 1/5000th of the available throughput on the memory bus. A memcpy() of all network data won't even show up on a profiler.


Thanks for that. It is convincing as far as the performance aspect goes, but what about resource constraints on memory? If your architecture requires buffers be as large as messages and messages can be orders of magnitude larger than the typical message in a game, the amount of memory that has to be dedicated to message processing might be a problem. I wonder if in some contexts, like a scientific cluster on a LAN, the above requirement of buffers needing to be as large as the message is needed.


Quote:

That being said: Zero-copy is up to the driver and OS to implement. You, as an application/library, just need to use the proper API to make it possible. The UNIX read()/write() model is unfortunately really bad at this, so you'd expect the TCP/IP implementation to copy at least once, anyway.


What about Windows? I should be able to implement this on one platform and not on others and still be able to transmit messages between programs running on different platforms.

Brian Wood

Share this post


Link to post
Share on other sites
I don't see why a buffer needs to be as big as a message? Assuming you can send pieces of a message at time, the buffer only needs to be as big as the largest send window you want to support.

When it comes to cross-platform, I don't understand why you think the mechanism for transferring data between an application and the hardware device would affect the format of that data?

Share this post


Link to post
Share on other sites
Quote:
Original post by hplus0603
I don't see why a buffer needs to be as big as a message? Assuming you can send pieces of a message at time, the buffer only needs to be as big as the largest send window you want to support.


What is the alternative? We don't want to process a message until it has been completely received. I guess it could be stored on a disk an read back in when the whole message has arrived.


Quote:

When it comes to cross-platform, I don't understand why you think the mechanism for transferring data between an application and the hardware device would affect the format of that data?


Where do you get that? I don't think I said that.

Share this post


Link to post
Share on other sites
Quote:
Original post by wood_brian
What is the alternative? We don't want to process a message until it has been completely received. I guess it could be stored on a disk an read back in when the whole message has arrived.


Usually what is done is to have a pool of buffers that could be linked together to form larger packets as needed (essentially just a linked list). The packet writing and reading classes then perform their read/write operations across these buffers as needed, taking into account writes and reads that cross any boundaries. As a packet is written, buffers are linked together to grow. Then, you don't have to worry about buffer lifetime for async operations since it won't ever be freed and it'll only be reused after you mark it.

After a buffer is done being processed, it is then simply marked as being available for use so they can continuously be recycled. Since no new allocations ever take place, barring a poor configuration where the client or server always runs out of available buffers to use and has to keep allocating a lot more, it's 'ok' that a lot of the buffers might be "wasted" with small packets that do not actually take up the full size since they are just reused. I'm sure you already know the benefits of a pool based memory manager scheme vs just allocating each time you need it.

With this type of system though, you have to be able to send parts of a message as hplus0603 mentioned. Say you had some compression scheme or encryption protocol that required a full message in a continuous block of memory to actually process. A system like this wouldn't actually work out well since you'd always have to go another step to copy all the data into one continuous buffer first and then process it.

If you only had a few packets that actually did that though, it's not a problem since you can just mark the packets a certain way to tell your code to do that. However, if most of your packets are like that and are usually larger than the buffer size you picked, then you'd want to probably reconsider. I mean copies shouldn't really be a problem as per this thread in those cases, but you'd still want to have to have a secondary pool system for larger buffers to deal with as well, which is certainly doable but adds more complexity to the system.

When it comes to encryption, I'd think most people would want to use a stream or block cipher for network stuff anyways, so using a pool of buffers in this manner should still cater toward enough people to be useful. Likewise, there should be plenty of compression methods available compatible with this setup but having a larger pool of buffers to use in those cases would address that as well if needed.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!