[ Content | View menu ]

Surge 2012 Recollections

Mark Mzyk | October 8, 2012

Being one of the representatives of Opscode at Surge 2012, here are my (slightly belated) thoughts from having attended the conference.

The list of sessions can be found here.

The opening keynote was an analysts of Forrester who presented four general trends that he said made scalability and performance more important than ever. This talk was forgettable. It was entirely non-technical and spoke about items such as wearable computing becoming a trend that Forrester saw coming into its own in 2017. I think the intent was for the talk to give the entire weekend a theme and relate it back to the customer. That is my charitable interpretation, however. Apparently the analyst, Mike Guiltieri, is the person responsible for coining no-ops, but this talk didn’t touch on that at all.

I attended the Build Your Own Database: BerkeleyDB Java Edition at Yammer. Unfortunately this talk did not live up to its lofty title. I was hoping for a talk on DB theory and how Yammer used that to build a database system. Instead this talk was about why Yammer opted to use BerkeleyDB and build tooling around it for their use case. Their use case turned out to be that they didn’t want to shard Postgres. Yet they basically built a sharded system on top of BerkelyDB and the speaker admitted as much in the Q&A period. I assume Yammer had their reason for building on top of BerkeleyDB, but it wasn’t clear from the talk that sharding Postgres might not have worked out just as well, with more community to rely on.

Bryce Howard’s talk, with a long title and poor comparison of the internet to pipes and water, was a good talk. It focused on the the HTTP and SSL stack and the handshake process. Bryce went through how he debugged the network layer and looked at the handshakes that were occurring. By analyzing the process, Bryce tuned his system so it went from 3 or 4 handshakes to 1, shaving seconds off of each request his system had to make, bringing the request time in-line with what his customers needed and saving him from having to look at other costly options. It was a good talk to highlight how understanding a domain can bring significant improvement into a system.

Tom Daly, of DynDNS, did a talk on AnyCast. This was interesting because it dove into the deep end of DNS and showed how DNS providers (and others who might want to take advantage) are fooling the DNS network by assigning the same IP to multiple machines and taking advantage of routing rules to ensure that DNS lookups go to the closest geographically located machine. It was a technically in-depth  talk that again showed how mastery of a domain can give significant performance improvements. It also proved to me just how little I know about operations and specifically about the network.

I opted to give the Insights into Entreprise Application Management talk a go. It was forgettable. The talk might have been relevant 10 years ago, but not really today. The best insight was that garbage collections cause a process in processing and some tuning that could be done to mitigate this as much as possible.

Artur Bergman in Mysteries of a CDN Explained went deep into the workings of, obviously, CDNs. The long and short of it is that CDNs provide value by acting as close caches to the customer and it’s worth it to cache all the things. Even if someone doesn’t have a lot of content they want to cache, CDNs can still provide value with serving as a single entry point into a network, where if many requests are coming in they can be routed through the CDN which can be a throttle, sending one request on, caching the result for a short time (say a minute) and serving the answer back out for that one minute. This can be a life saver during a high load event. The most entertaining part of Artur’s talk was where he claimed AnyCast DNS was shit and Tom Daly was sitting in the room. During Q&A Tom would get Artur to admit his tests were not the best, but Artur wouldn’t fully back off his statement.

During Lightning Talk, Theo Schlossnagle taught everyone how to tie their shoes, someone showed us what beer brewing looks like, and Bryan Cantrill led a Linux forensics session tracing the history and use of a vestigial Linux program ta that still lives on in the operating system, but would no longer ever be used for anything. The lesson: software hangs around much longer than you probably realize.

For day two, the first talk was Paul Hinze and Tony Pitluga of Braintree talking about how they achieve HA in their datacenter. It was interesting in that they talked about the looks they use and how the avoid traditional alerting of X has gone outside the normal range. Instead they prefer a holistic view of their machines and have written custom tools to provide this. The main tool isLitmus Paper, which is how the configure a nodes health. The way health currently works is by examing multiple attributes on a node and using weights assigns an overall health of a node. The load balance can then monitor this number and send traffic only to health nodes. They did mention that they had to do some tuning to this system to avoid the thundering herd problem. Currently the system is simplistic in that it doesn’t take into account many factors in determining a nodes health, but they plan to keep refining the system as they learn more. In addition to Litmus Paper, they make use of several other open source tools, such as Pacemaker, to make the system work. A big take away from this talk for me was that we should consider looking at our monitoring and figuring how how to put in place a more holistic system. Just because a node is under load, for instance, doesn’t mean something is wrong and needs action – or if it does, why can’t system adjust itself? It will take baby steps, but I think there is progress that can be made here.

Bryan Cantrill and Brendan Gregg were up next. They spoke on DTrace and how they used it to debug system problems at a very low level in JoyentOS. This was one of the best talks of the conference and it was amazing to see them work in a system they clearly knew intimately. They did reveal places where the OS was the problem, but I think it is still safe to assume, on a day to day basis, that the OS is not the problem. It did highlight how as the world gets evermore interconnected and systems are expected to be ever faster problems in the OS can have a ripple effect to the application level.

This talk had one truly fantastic component to it: since Bryan and Brendan were working at such a low leve, they were graphing everything. Every single call through the system (in this case to access a file) and how long it took. They they graphed it. But they didn’t just look at average time. They purposefully created graphs that would plot all the points and the would highlight outliers. This showed things such as a system where the average call was sub-seconds, but them outliers would take ten seconds or more. Investigating the path of the outliers showed flaws in the system. Another graph they showed was the heat map. This was great in that it showed roughly all the calls in the system, but the ones that took the most timer were a darker color. At a glance this showed the average performance of the system but also allowed potential outliers to stand out. A simple average graph loses all this data. This opened my eyes again to how we’re generally approaching monitoring wrong. We need to step back and look more at system’s holistically. This is next realm that tools need to evolve into.

The next talk was Xtreme Deployment by Theo Schlossnagle. I’ll be honest, I don’t remember a lot from this talk, except I know I enjoyed it at the time. I do recall thinking that Circonus is a really nice tool and puts Nagios to shame. Given that you have to pay for Circonus, there’s probably a reason it is light years beyond Nagios.

The next session after lunch was Pedro Canahuati from Facebook, but I only caught the tail end of it where he was describing some of Facebook’s monitoring systems. I was late because I opted to explore Baltimore some after lunch. I hate travelling to a city and then not seeing any of it and Baltimore had a book fair going on that day just a few blocks from the conference.

Matt Graham of Etsy gave a talk. Basically, they deploy a lot. This reduces risk because changes are always small, so they can pinpoint bugs quickly and then deploy quickly to fix them. They measure bughours – how long is a bug alive. With continues deployment they can cut this down drastically because they’re always deploying.

Baron Schwartz gave a talk on Automated Fault detection, but at this point I was pretty fried and had quit taking notes. My apologies to Baron, but I don’t recall much from this talk, other than thinking it was all stuff I had heard before (which, I should point out, doesn’t mean it wasn’t good – just because I’ve heard it doesn’t mean everyone at the talk has, or that I couldn’t stand to hear it again).

The talk concluded with Theo giving a final keynote and parting words, which if I recall correctly was that we should look at other professions and see how they handle failure, because much of what we’re learning in DevOps has been done before. We just need to look past our own walls and see what the larger world is teaching us. My apologies to Theo if that isn’t what he said, but that’s what I’m recalling right now, although I might be mixing it up with Chris Brown’s talk from Velocity Europe that I just watched.

All in all, Surge was a great conference for showing me just how much I don’t know and highlighting directions the industry is moving in, which is both more specialized and more generalized at the same time. It’s clear that as systems get ever larger and more complicated that communication is going to become ever more important, both from a human to human and from a machine to human perspective. DevOps is just the start.

For another perspective on Surge 2012, @obfuscurity’s write up is here.