At my day job, we solve problems that most indies don't have, like "Oh, our gigabit network link is saturating, how can we route more fiber into our data center?" or "we're running out of space on our 50 terabyte storage cluster; can we get new drives from Dell in time?"
One of the things we do to manage this challenge is Continuous Deployment -- as soon as a code commit is done and tests pass, we deploy it in production. Why wait? Get feedback quicker!
Part of continuous deployment is an immune system, which detects when things "go pear shaped" in response to deploying new code, and automatically rolls back the latest commit.
To do that detection with high accuracy, short detection time, and low number of false positives, we needed better instrumentation of our application and data center than we could get through existing monitoring tools like Graphite, Zabbix, OpenTSDB, Cacti, etc. Specifically, we needed a very high rate of sampling (every 10 seconds), a very long retention time (10 days for 10 second data, 6 years for downsampled data) and statistics in real time about the sampling (mean, standard deviation, min, max, for each sample)
Because we couldn't find anything that did this, we wrote it outselves, using boost::asio and C++. And, we release it as open source! "istatd" has been used for the last year to gather, report, and chart metrics for over 500,000 counters (including the 3 different retention intervals, so 130,000 names) every 10 seconds. It rocks!
And, because it uses boost::asio, and does high-throughput networking and I/O, it might be useful for those who want to dive into that kind of network programming with a working, real-life example. Or it might be useful just if you want to track a bunch of counters over time. And if you only have three counters you care about, it'll still work, and draw pretty charts for you; it only uses as much resources as it needs :-)
Check it out at http://github.com/imvu-open/istatd/wiki
Also, I wrote a blog post describing the background: http://engineering.imvu.com/2012/09/26/continuous-monitoring-real-time-statistics-for-a-thousand-servers-and-the-application-they-serve/
Edited by hplus0603, 27 September 2012 - 11:25 AM.