# Serialization PHP - C++ (boost?)

This topic is 3495 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

What i need to do is this: Have a php file in a webserver, doing a POST with binary payload to a fast-cgi written in c++, and the c++ should deserialize that payload into a known struct. I've been playing around with boost::serialization, but i can't seem to comprehend the binary format it generates (So i could hardcode the php to make a valid payload to be deserialised from the fast-cgi). And not so important: i've noticed boost::serialization to binary archives always look like: serialization::archive ÃõH@ can't i get rid of the "serialization::archive" thing that is obviusly not part of the struct i just serialized? Thanks a lot for your help. I'd appreciate any other ideas to accomplish the same result :).

##### Share on other sites
What's the known struct?

##### Share on other sites
it can be anything. like...

struct whatever
{
int a;
int b;
float c;
std::vector<int> d;
}

(mainly structs consisting of basic types and collections of basic types)

i'm trying to implement something like an RPC or webservice. and the struct would be mainly a way to pass all the parameters in a binary manner, so i could either just cast it to the struct i know it is, or deserialize it with boost::serialization, or something like that :).

did i make myself clear?

thanks for your time :)

##### Share on other sites
I see. In that case, I would suggest using an XML serialization format, and compressing it. XML is easier to parse (and more standard) than binary formats, yet not really heavier when compressed, plus boost should be able to read it.

Alternately, you may design your own archive format for boost.

##### Share on other sites
i see.

xml is not an option for now. im actually doing all this to bypass using a webservice/SOAP/xml, etc.

this should be real fast :)

i'll take a look at making my own archive type. i hope it's not very complicated, otherwise ill have to do something uglyer, like casting :P.

##### Share on other sites
Quote:
 Original post by ElPeque2i'll take a look at making my own archive type. i hope it's not very complicated, otherwise ill have to do something uglyer, like casting :P.

Casting wouldn't work, for endianness reasons. This is precisely what boost allows you to circumvent.

##### Share on other sites
Ice.

Quote:
 this should be real fast :)

In what way? You'll need to be on gigabit network before PHP + C++ combined become a bottle-neck.

But either way, the overhead of HTTP (edit), or the size of payload will determine the cost, very rarely the encoding.

##### Share on other sites
Quote:
 Original post by AntheusBut either way, the overhead of HTML
HTTP?

##### Share on other sites
Casting would still be trouble if i will only be using x86 systems? And if i were to migrate, then it would all switch to x64.

And it's interesting what you say about where the bottlenecks would be. My intuition tells me that passing xmls should be very inefficient. (much more data, maybe compress, decode, etc). Any past experiences?

##### Share on other sites
Quote:
 Original post by ElPeque2And it's interesting what you say about where the bottlenecks would be. My intuition tells me that passing xmls should be very inefficient. (much more data, maybe compress, decode, etc). Any past experiences?

Let's say you are sending 3 ints over the wire. They take 12 bytes in binary encoding.

The POST looks like this:
POST /some_path HTTP/1.0Content-type: multipart/form-data, boundary=ABCDLKHsadfhiueakjhg--ABCD

First, you need to MIME encode it, which will likely result in increased size of 4/3 (16 bytes).
Second, you need to form a valid POST request (100 bytes).
Third, you need to send this as TCP packet (120 bytes).
Then the IP header (140 bytes).

And suddenly, your compact, optimal, efficient 12 bytes resulted in 140 bytes on the wire.

HTTP isn't designed for efficiency, so unless you're sending proportionally enough data (tens of kilobytes per request), it will be horribly inefficient.

Even more, encoding data as text and passing it as URL parameters in POST will be much more efficient for in many cases, even if not binary encoded.

##### Share on other sites
i will potentially be passing 1000 int32 in the array, every time.

maybe more.

##### Share on other sites
Quote:
 Original post by ElPeque2i will potentially be passing 1000 int32 in the array, every time.maybe more.

Rather than playing the "guess the problem" one post at a time, how about you simply say what you're doing, and then we can start discussing the real issues.

What are these ints? Why are you passing them in the first place? How about simply not passing them and saving the problem in the first place? Why PHP? Why FastCGI?

The performance/efficiency discussions are completely pointless, without knowing the exact problem to solve.

If all you're doing is passing the data, then ditch HTTP completely, and use something that has least overhead, perhaps even UDP-based transport, to avoid the issues with HTTP.

1000 ints is nothing. That's 40-100kb. Whether encoded as text or as binary. they are in same range. Someone capable of transferring 40kb will be capable of transferring 100kb.

##### Share on other sites
Quote:
Original post by Antheus
Quote:
 Original post by ElPeque2i will potentially be passing 1000 int32 in the array, every time.maybe more.

Rather than playing the "guess the problem" one post at a time, how about you simply say what you're doing, and then we can start discussing the real issues.

im sorry, you are right about that :).

Let's see.

Why php? Because the site using the "webservice" is php based. Changing that is not an option.

Why 1000s of ints? The service provides AI capabilities. You can request (for example) to order a list of products (integer ids) by how much they appeal to a certain user.

The webservice itself is programmed in c/c++ (another prerequisite beyond my call), is deployed in one or many webservers, and does heavy use of threads, an connects to a distributed mysql database system. Lots of cache, memcache, etc etc.

Right now, we already have that running with WSDL/Soap over Apache/AXIS2C and PHP5 SoapClient. We expect very high traffic, and have little resources, and have good reasons to believe that we have a bottleneck with the transport.

Some simple tests with fastcgi we have done showed some promise (by bypassing the wsdl/soap stuff that works like crap anyway).

Why cgi? Because it has shown some promise in some tests we have done. Besides, it's got some support, and it runs under apache which is solid. We have not discarded yet bypassing that too and using sockets directly, but we don't want to reinvent the wheel.

I hope i was able to make the problem a little clearer. We are still pretty much in a research phase and don't have much experience in this area.

Thanks, i appreciate your time and advice :).

Edit PD: 1000 integers is just an example. It may be more. We have a long way ahead to benchmark, redesign, requirement changes, etc etc. But something is for sure. We are talking about some millions of requests a day. I don't know the exact numbers yet, but at that scale, any resources saved mean a lot of cash.

##### Share on other sites
Quote:
 Original post by ElPeque2Why 1000s of ints? The service provides AI capabilities. You can request (for example) to order a list of products (integer ids) by how much they appeal to a certain user.

In this case, the cost of generating the list, and obtaining the IDs and ID-related information will dwarf the cost of serialization.

You have several options here.

One is to generate the end-view in the service itself. Rather than sending user IDs, you just generate the web page.

If you need to post-process the IDs, then keep in mind you'll need to request individual lookups for those 1000 IDs. For example, how will you handle 1000 recommendations, if you also need to retrieve the image for each product. That will result in hundreds of pages of HTML.

SQL provides ranged queries for this very purpose. 1000 product recommendation isn't practical, regardless of how "fast". And even then, users tend to prefer first or second link. With search engines, users got really annoyed if they had to search for their link on second page.

The AI itself will be the one taking biggest hit. Such queries will almost certainly require pre-calculated index. So designing that to return relevant results in the first place will save you much more. Not all-that-match, but best-10-matches.

Quote:
 Right now, we already have that running with WSDL/Soap over Apache/AXIS2C and PHP5 SoapClient. We expect very high traffic, and have little resources, and have good reasons to believe that we have a bottleneck with the transport.

Good reason? If you have a system running, attaching a profiler is trivial. If bottle-necks come from transport, you may wish to look at granularity of requests.

For example, sending 1 request per ID is horrible, but sending them in one batch is managable. If you're not doing that already, than that's the first place to start looking for problems.

In addition, if this is synhronous RPC, then fine granularity of requests will be choking you. While RPC approach works, and can scale, it's inherently less scalable than batch processing.

Even more, google would be the source to look for ensuring scalability. Approaches such as map/reduce ensure scalable processing, even over low reliability clusters.

Quote:
 Some simple tests with fastcgi we have done showed some promise (by bypassing the wsdl/soap stuff that works like crap anyway).

Actually, those have high constant cost, but aren't that bad algorithmically. The high constant cost tends to exhibit itself when using blocking calls.

If the traffic is hogging the network, then you have something to go on.

But assuming serialization is the bottle-neck is tempting when faced with SOAP. As it turns out, unless you're really hogging the network, they will rarely be the key factor to scalability. YMMV.

##### Share on other sites
Quote:
 Original post by AntheusIn this case, the cost of generating the list, and obtaining the IDs and ID-related information will dwarf the cost of serialization.You have several options here.One is to generate the end-view in the service itself. Rather than sending user IDs, you just generate the web page.If you need to post-process the IDs, then keep in mind you'll need to request individual lookups for those 1000 IDs. For example, how will you handle 1000 recommendations, if you also need to retrieve the image for each product. That will result in hundreds of pages of HTML.SQL provides ranged queries for this very purpose. 1000 product recommendation isn't practical, regardless of how "fast". And even then, users tend to prefer first or second link. With search engines, users got really annoyed if they had to search for their link on second page.The AI itself will be the one taking biggest hit. Such queries will almost certainly require pre-calculated index. So designing that to return relevant results in the first place will save you much more. Not all-that-match, but best-10-matches.

I agree with all that. But unfortunately, this is a huge mutant that is already in production. We are a different team that were asked to do some AI stuff so they can use. So we just dont have access (for example) to stocks, prices, etc that they use for a first filtering. So they have to "tell us" what products are available (and maybe related in some way according to their busyness logic) and we just sort them and return them.

At this point, we can't do anything about it.

They just "feed" our tables through our web interface with all the data mining stuff, and then ask us to rank products according to that "intelligence".

It's not nice, it doesn't make us proud :(, but hey... :P we are the "new guys", we don't decide how things are done.

So... given all these limitations, we still have to get as much juice out of it as we can :D

Quote:
 Original post by AntheusGood reason? If you have a system running, attaching a profiler is trivial. If bottle-necks come from transport, you may wish to look at granularity of requests.For example, sending 1 request per ID is horrible, but sending them in one batch is managable. If you're not doing that already, than that's the first place to start looking for problems.In addition, if this is synhronous RPC, then fine granularity of requests will be choking you. While RPC approach works, and can scale, it's inherently less scalable than batch processing.Even more, google would be the source to look for ensuring scalability. Approaches such as map/reduce ensure scalable processing, even over low reliability clusters.

Interesting.

As i told you, we have little experience in this, i'll see what i can study about it. If you can recomend me any material in particular, i could apreciate it :).

Quote:
 Original post by AntheusBut assuming serialization is the bottle-neck is tempting when faced with SOAP. As it turns out, unless you're really hogging the network, they will rarely be the key factor to scalability. YMMV.

That is probably true. I think we'll find out. The problem is that at this point, most benchmarks we can make are sinthetic (inserting delays, programs that automatically make requests, fake data, etc). And requirements are changing every day. [ This is a nightmare!!! :P ]

Thanks for your insight, it's been enlightening.

the booze is on me :).

##### Share on other sites
in case anybody ever needs to do something like i was asking about, i was able to send a binary payload from php to a fastcgi (c++).

<?php$r = new HttpRequest('http://localhost/fastcgi/fastcgi.exe', HTTP_METH_POST);$data = pack("vVVVcccVVVVV", 0, 1, 2, 3, 99, 99, 99, 3, 3, 4, 5, 6);$r->setRawPostData($data);try {	print $r->send()->getBody(); }catch (HttpException$e)	{	print \$e;	}?>

And then the fastcgi was able to correctly deserialize that payload to a struct.

Thanks again to Antheus and ToohrVyk