packet parsing

Started by
50 comments, last by hplus0603 16 years, 1 month ago
That C# class is pretty lame IMO. It's just a thin wrapper on the user calling "get these bits" and "put those bits." In fact, because it uses calls separately for reading and writing, you can't re-use the same code to read and write, so there's ample opportunity to get typo errors.

In C#, I would assume any marshaling library to use custom attributes to mark up members, and then use reflection to construct a recipe for marshaling to/from each compound type.

In C++, I find that the simplest thing to do is to write a "visit" function for each class, either as a member, or as a template function. You then pass different visitor instances into the visit function to do different things -- reading binary, writing binary, reading/writing XML, creating user interfaces or property pages, etc. The nice thing about that is that each visitor scales for free to each new data structure, and each data structure scales for free to each new visitor -- the NxM problem is broken down into N+M!

class PrintfVisitor {  public:    void Visit(int &i, char const *name) {      printf("%s=%d\n", name, i); // assuming no equals in name    }    void Visit(std::string &s, char const *name) {      printf("%s=%s\n", name, s.c_str()); // assuming no newlines in string    }};class BinaryInputVisitor {  public:    BinaryInputVisitor(char const *buf, size_t bufSize) {      buf_ = buf;      size_ = bufSize;    }    void Visit(int &i, char const *name) {      if (size_ < sizeof(int)) throw std::runtime_error("buffer underflow");      memcpy(&i, buf_, sizeof(int));      buf_ += sizeof(int);      size_ -= sizeof(int);    }    void Visit(std::string &s, char const *name) {      int l = 0;      Visit(l, 0);      if (size_ < l) throw std::runtime_error("buffer underflow");      s.resize(l);      memcpy(&s[0], buf_, l);      buf_ += l;      size_ -= l;    }    char const *buf_;    size_t size_;};struct MyStruct1 {  int i;  int j;};struct MyOtherStruct {  std::string name;  int x;  int y;  int z;};template<typename T, typename V> void Visit(T &instance, V &visitor);template<typename V> void Visit<MyStruct1, V>(MyStruct1 &instance, V &visitor) {  visitor.Visit(instance.i, "i");  visitor.Visit(instance.j, "j");}template<typename V> void Visit<MyOtherStruct, V>(MyOtherStruct &instance, V &visitor) {  visitor.Visit(instance.name, "name");  visitor.Visit(instance.x, "x");  visitor.Visit(instance.y, "y");  visitor.Visit(instance.z, "z");}   MyStruct1 ms1;   ms1.i = 3;   ms1.j = 17;   PrintfVisitor pv;   Visit(ms1, pv);


This mechanism allows you to write only a single set of marshaling code for each data structure, and then re-use it for a number of different needs (including binary input and output). If you want custom ranges for integers, floats, etc, then you can add that as arguments for the Visit() functions on the visitors.

You can also do a little bit of macro niceness to avoid writing too verbose code. Additionally, you can make your data structures derive from a common base, which will allow you to aggregate them and easily visit your aggregate members using the same pattern.

The next step is to take that into a script where you define your data structures, and have the script generate the header and source files, with the visitor data types still provided by you.

I have some code on the web: http://www.mindcontrol.org/~hplus/marshaling.html


[Edited by - hplus0603 on February 11, 2008 10:12:46 PM]
enum Bool { True, False, FileNotFound };
Advertisement
Quote:Original post by hplus0603
That C# class is pretty lame IMO. It's just a thin wrapper on the user calling "get these bits" and "put those bits." In fact, because it uses calls separately for reading and writing, you can't re-use the same code to read and write, so there's ample opportunity to get typo errors.

Yeah it's written as a communication layer to Flash AS3 (which only has int, uint, and Number(aka double)). Maintainability was the first thing I thought of. Setting it up as I did makes the communication convenient.

I just handle all my entities as having a SerializeDelta and SerializeFull which are used to keep track of things. (and a complex scripting/byte code handles the odds and ends)

Reading and writing are two separate operations. I'm not following you.

Hplus, your code looks interesting. How do you handle partial serialization and such? Like serializing structures for delta packets where only changed data is sent?

(note the biggest reason my code is different is because flash doesn't have generics/templates leading to fun an varied ways of handling things to make porting common server and client code a breeze (also maintaining it)).

//edit, I'm not gonna lie. I don't 100% understand what your code does.
The main gain of my code is that you only need to write code that knows about the data structure once, and then re-use it in many contexts. Those contexts could be reading, writing, generating GUIs, reporting, etc. I achieve this by writing one template class that knows about the members of a given data structure, and then calls a variant argument (a functor) with information about each member. An alternative would be, for example, to build an array of pointers-to-member with type information.

The main drawback of this method is that you need to pass all possible metadata about your members into each call. Thus, if there's certain kinds of information you need to generate a GUI from a struct (say, the name and valid range of something) and other data you need when you marshal in/out (say, the scope/relevancy of the field) then the template that calls the visitor has to pass all of that information, because it doesn't know what particular type of visitor is being used.

To do delta encoding, you can pass in a visitor that knows about the previously sent state of the object, and that visitor would only pack a field if it has changed (and set the appropriate bit in some flag).

If you trace through the code at the bottom that causes printf() of a struct, what really happens is that the Visit() function for the MyStruct1 type gets called, passing in a Printf visitor instance. The Visit() function will then, in turn, call Visit() on the printf visitor for each member in turn. The Printf visitor is set up to printf each member as it's being visited, in this case.

If you realistically wanted to extract deltas, then you would have to pass in the base of the data structure, as well as the member, to the visitor (i e, you'd have to add to the protocol of the visitor functions). That way, one kind of visitor could take a reference to the old data, and recover a pointer to the old data so it can be compared to the new data. Something like:

  template<typename Visitor> Visit<MyStruct1>(MyStruct1 &data, Visitor &v) {    v.Visit(&data, data.i, "i");    v.Visit(&data, data.j, "j");  }  template<typename Type>  class DeltaVisitor {    public:      DeltaVisitor(Type const &old, OutStream &os) : old_(old), os_(os), bits_(0), count_(0) {}      void Visit(void *base, int &i, char const *n) {        size_t offset = (char *)&i - (char *)base;        if (*(int*)((char *)&old_ + offset) != i) {          os_.writeInt(i);          bits_ |= (1 << count_);        }        ++count_;      }      void Commit() {        os_.prependBits(bits_);      }      Type &old_;      OutStream &os_;      Bitfield bits_;      size_t count_;    };


This requires the ability to prepend data to your output stream class, because you don't know what's changed until you examined all the members, and you haven't examined all the members until you've already written them. You can work around this by keeping two streams and copying data around, or deferring writes, if you prefer.

The main thing to "get" is that there is a protocol between the Visit<> template function and the Visitor template argument. A member of type "int" will call a certain function on the visitor, a member of type std::vector<> will call another function on the visitor. The function called contains all information that the visitor may need for reading, writing, or doing something else with the data.

Other embellishments may include the ability to tell the visitor when you start/stop sub-structures (if you allow containment).
enum Bool { True, False, FileNotFound };
Quote:Original post by hplus0603
The main gain of my code is that you only need to write code that knows about the data structure once, and then re-use it in many contexts. Those contexts could be reading, writing, generating GUIs, reporting, etc. I achieve this by writing one template class that knows about the members of a given data structure, and then calls a variant argument (a functor) with information about each member. An alternative would be, for example, to build an array of pointers-to-member with type information.

The main drawback of this method is that you need to pass all possible metadata about your members into each call. Thus, if there's certain kinds of information you need to generate a GUI from a struct (say, the name and valid range of something) and other data you need when you marshal in/out (say, the scope/relevancy of the field) then the template that calls the visitor has to pass all of that information, because it doesn't know what particular type of visitor is being used.


I don't know if that is the main drawback. Template-based serialization libraries have a history of being poor performers. If you post the code that would be used to marshall a list<int>, people could compare it to this program.

I don't dispute the N plus M that you mention upthread, but would like to point out that just N is possible. Not having to duplicate the data members is helpful in projects where classes have many data members.

I tried this link (once) and it didn't work.
http://www.mindcontrol.org/~hplus/marshaling.html


Brian Wood
Ebenezer Enterprises

[Edited by - wood_brian on March 14, 2008 11:42:53 AM]
Quote:Original post by wood_brian

I don't know if that is the main drawback. Template-based serialization libraries have a history of being poor performers. If you post the code that would be used to marshall a list<int>, people could compare it to this program.


Or, put differently, your library outperforms boost::serialization.

Let's not generalize too much. Boost::serialization is known to be probably one of the least popular boost projects which suffers from incredible bloat, poor performance and a few additional random problems.

Boost::serialization isn't primarily intended for network serialization.

What ultimately matters is the throughput. To put things in perspective. The serialization library I use achieves between 500 and 1500 Mb/sec throughput. The network connection will typically be 10Mb/sec. In worst case, serialization will consume 2% of CPU. Irnonically, despite gcc being poorer at template code generation, gcc version benchmarks run 5% faster than MSVC equivalent.

On server machine with 1GHz high-end RAM, the throughput is several times higher, falling into fractions of a percent.
Quote:Original post by Antheus
Quote:Original post by wood_brian

I don't know if that is the main drawback. Template-based serialization libraries have a history of being poor performers. If you post the code that would be used to marshall a list<int>, people could compare it to this program.


Or, put differently, your library outperforms boost::serialization.

Let's not generalize too much.


Fair enough, but I think what hplus proposes is similar to Boost Serialization.

Quote:
What ultimately matters is the throughput. To put things in perspective. The serialization library I use achieves between 500 and 1500 Mb/sec throughput. The network connection will typically be 10Mb/sec. In worst case, serialization will consume 2% of CPU. Irnonically, despite gcc being poorer at template code generation, gcc version benchmarks run 5% faster than MSVC equivalent.

On server machine with 1GHz high-end RAM, the throughput is several times higher, falling into fractions of a percent.


I won't argue with that, however, in some circles every little bit helps. Not taking an efficient approach in one area seems to lead to not taking efficient approaches in other areas. Before you know it the application is a pig.


Brian Wood
Quote:I think what hplus proposes is similar to Boost Serialization.


I don't think so. Here's how you would marshal a std::list<int>:

// method 1template<...> void Visit<MyData>(...) {  v.Visit(listMember, "listMember");}


The visitor needs to be specialized for list<N>:

// method 1class VisitorOutput {  template<typename T>  void Visit(std::list<T> &l, char const *name) {    Visit(l.size(), "l.size");    for (std::list<T>::iterator ptr = l.begin(), end = l.end();        ptr != end; ++ptr) {      Visit(*ptr, "*ptr");    }  }}class VisitorInput {  template<typename T>  void Visit(std::list<T> &l, char const *name) {    size_t size;    Visit(size, "l.size");    for (size_t i = 0; i < size; ++i) {      T elem;      Visit(elem, "*ptr");      l.push_back(elem);    }  }}


An alternative would be to do it using iterators, and a "shadow" count member, from the data structure point of view:

// method 2template<...> void Visit<MyData>(...) {  size_t size = listMember.size();  v.Visit(listMember, listMember.begin(), listMember.end(), size, "listMember");}


With the following visitors:

// method 2class VisitorOutput {  template<typename Container, typename Iter = typename Container::iterator>  void Visit(Container &cont, Iter ptr, Iter end, size_t &count, char const *name) {    Visit(count, "count");    while (ptr != end) { Visit(*ptr, "*ptr"); ++ptr; }  }}class VisitorInput {  template<typename Container, typename Iter = typename Container::iterator>  void Visit(Container &cont, Iter ptr, Iter end, size_t &count, char const *name) {    Visit(count, "count");    typename iterator_traits<Iter>::value_type elem;    while (count > 0) { Visit(elem, "*ptr"); cont.push_back(elem); --count; }  }}


You could also first create an empty element, and Visit() that in place, if your assignment operator is expensive. There are other ways to structure the visitor function <-> actor protocol, too -- these are just illustrative examples.


Quote:only N is possible


You can't get "N" while still retaining the ability to add different kinds of actors -- that's where the "M" comes in. If you have the "N" for the number of structs, then you have to have a way to change between marshaling to binary, or marshaling to XML, or marshaling to GUI. That's what the M is, and it's an unavoidable term.

The Perl code I posted makes it so that you can write the data members only once. Because of the problems with C++ syntax, however, there really is no useful way to mention the members only once both for declaration and marshaling without resorting to some kind of custom preprocessor (or #include a file with bunch of macros twice). Also, the Perl code I posted could easily be extended to handle collections transparently.

enum Bool { True, False, FileNotFound };
Quote:Original post by hplus0603
Quote:I think what hplus proposes is similar to Boost Serialization.


I don't think so. Here's how you would marshal a std::list<int>:

// method 1template<...> void Visit<MyData>(...) {  v.Visit(listMember, "listMember");}


The visitor needs to be specialized for list<N>:

// method 1class VisitorOutput {  template<typename T>  void Visit(std::list<T> &l, char const *name) {    Visit(l.size(), "l.size");    for (std::list<T>::iterator ptr = l.begin(), end = l.end();        ptr != end; ++ptr) {      Visit(*ptr, "*ptr");    }  }}class VisitorInput {  template<typename T>  void Visit(std::list<T> &l, char const *name) {    size_t size;    Visit(size, "l.size");    for (size_t i = 0; i < size; ++i) {      T elem;      Visit(elem, "*ptr");      l.push_back(elem);    }  }}



I'm not interested in the input stuff at the moment. If possible please post a version that outputs a list<int> to a file. Like this.


Brian Wood

[Edited by - wood_brian on March 14, 2008 11:34:48 AM]
Brian,

That's what VisitorOutput does, both in Method 1 and Method 2.

The amount of code you have to write per data structure is this line:

  v.Visit(listMember, "listMember");  // method 1


or this line:

  v.Visit(listMember, listMember.begin(), listMember.end(), listMember.size(), "listMember"); // method 2


Btw: I assume you agree that you need to write some amount of code for each kind of input or output you're doing, e g your code can't magically learn to output to XML without at least some adapter being written.

The benefit of the second method is that it doesn't have to be specialized per serial container, although I prefer method 1. Note that method 2 is slightly revisionist; looking at the code I posted before, the size doesn't need to be a reference, and thus doesn't need a separate variable and line of code.
enum Bool { True, False, FileNotFound };
Quote:Original post by hplus0603
Brian,

That's what VisitorOutput does, both in Method 1 and Method 2.

The amount of code you have to write per data structure is this line:

  v.Visit(listMember, "listMember");  // method 1


or this line:

  v.Visit(listMember, listMember.begin(), listMember.end(), listMember.size(), "listMember"); // method 2



Right. I was hoping you would pull the pieces together into one post. I think a BinaryOutputVisitor is another piece that would be needed. Below is the code from the link I mentioned previously. It stores some ints in a list<>, opens a file, does some setup stuff and then marshalls the data to the file. If possible, please post a parallel program. Then people can easily compare the two approaches.


#include <sys/time.h>#include <fcntl.h>#include <iostream>#include <memory>using namespace std;#include <Msgs.hh>intmain(){  list<int> lst;  for (int j = 1; j <= 20000; ++j) {    lst.push_back(j*3);    }  int fd = open("eeout", O_WRONLY);  if (fd < 0) {    cout << "open failed." << "\n";    return 0;  }  auto_ptr<Buffer> buf(new Buffer(8192));  buf->sock_ = fd;  timeval before_tv, after_tv;  struct timezone tz;  Msgs msgs;  gettimeofday(&before_tv, &tz);  msgs.Send(buf.get(), lst);   gettimeofday(&after_tv, &tz);  if (before_tv.tv_sec != after_tv.tv_sec) {    after_tv.tv_usec += 1000000;  }  long diff = after_tv.tv_usec - before_tv.tv_usec;  cout << "That took (microseconds) " << diff << endl;  return 1;}


Quote:
Btw: I assume you agree that you need to write some amount of code for each kind of input or output you're doing, e g your code can't magically learn to output to XML without at least some adapter being written.


I've managed to avoid XML thank G-d.

Brian Wood

This topic is closed to new replies.

Advertisement