[ Content | View menu ]

Surge 2012 Recollections

Mark Mzyk October 8, 2012

Being one of the representatives of Opscode at Surge 2012, here are my (slightly belated) thoughts from having attended the conference.

The list of sessions can be found here.

The opening keynote was an analysts of Forrester who presented four general trends that he said made scalability and performance more important than ever. This talk was forgettable. It was entirely non-technical and spoke about items such as wearable computing becoming a trend that Forrester saw coming into its own in 2017. I think the intent was for the talk to give the entire weekend a theme and relate it back to the customer. That is my charitable interpretation, however. Apparently the analyst, Mike Guiltieri, is the person responsible for coining no-ops, but this talk didn’t touch on that at all.

I attended the Build Your Own Database: BerkeleyDB Java Edition at Yammer. Unfortunately this talk did not live up to its lofty title. I was hoping for a talk on DB theory and how Yammer used that to build a database system. Instead this talk was about why Yammer opted to use BerkeleyDB and build tooling around it for their use case. Their use case turned out to be that they didn’t want to shard Postgres. Yet they basically built a sharded system on top of BerkelyDB and the speaker admitted as much in the Q&A period. I assume Yammer had their reason for building on top of BerkeleyDB, but it wasn’t clear from the talk that sharding Postgres might not have worked out just as well, with more community to rely on.

Bryce Howard’s talk, with a long title and poor comparison of the internet to pipes and water, was a good talk. It focused on the the HTTP and SSL stack and the handshake process. Bryce went through how he debugged the network layer and looked at the handshakes that were occurring. By analyzing the process, Bryce tuned his system so it went from 3 or 4 handshakes to 1, shaving seconds off of each request his system had to make, bringing the request time in-line with what his customers needed and saving him from having to look at other costly options. It was a good talk to highlight how understanding a domain can bring significant improvement into a system.

Tom Daly, of DynDNS, did a talk on AnyCast. This was interesting because it dove into the deep end of DNS and showed how DNS providers (and others who might want to take advantage) are fooling the DNS network by assigning the same IP to multiple machines and taking advantage of routing rules to ensure that DNS lookups go to the closest geographically located machine. It was a technically in-depth  talk that again showed how mastery of a domain can give significant performance improvements. It also proved to me just how little I know about operations and specifically about the network.

I opted to give the Insights into Entreprise Application Management talk a go. It was forgettable. The talk might have been relevant 10 years ago, but not really today. The best insight was that garbage collections cause a process in processing and some tuning that could be done to mitigate this as much as possible.

Artur Bergman in Mysteries of a CDN Explained went deep into the workings of, obviously, CDNs. The long and short of it is that CDNs provide value by acting as close caches to the customer and it’s worth it to cache all the things. Even if someone doesn’t have a lot of content they want to cache, CDNs can still provide value with serving as a single entry point into a network, where if many requests are coming in they can be routed through the CDN which can be a throttle, sending one request on, caching the result for a short time (say a minute) and serving the answer back out for that one minute. This can be a life saver during a high load event. The most entertaining part of Artur’s talk was where he claimed AnyCast DNS was shit and Tom Daly was sitting in the room. During Q&A Tom would get Artur to admit his tests were not the best, but Artur wouldn’t fully back off his statement.

During Lightning Talk, Theo Schlossnagle taught everyone how to tie their shoes, someone showed us what beer brewing looks like, and Bryan Cantrill led a Linux forensics session tracing the history and use of a vestigial Linux program ta that still lives on in the operating system, but would no longer ever be used for anything. The lesson: software hangs around much longer than you probably realize.

For day two, the first talk was Paul Hinze and Tony Pitluga of Braintree talking about how they achieve HA in their datacenter. It was interesting in that they talked about the looks they use and how the avoid traditional alerting of X has gone outside the normal range. Instead they prefer a holistic view of their machines and have written custom tools to provide this. The main tool isLitmus Paper, which is how the configure a nodes health. The way health currently works is by examing multiple attributes on a node and using weights assigns an overall health of a node. The load balance can then monitor this number and send traffic only to health nodes. They did mention that they had to do some tuning to this system to avoid the thundering herd problem. Currently the system is simplistic in that it doesn’t take into account many factors in determining a nodes health, but they plan to keep refining the system as they learn more. In addition to Litmus Paper, they make use of several other open source tools, such as Pacemaker, to make the system work. A big take away from this talk for me was that we should consider looking at our monitoring and figuring how how to put in place a more holistic system. Just because a node is under load, for instance, doesn’t mean something is wrong and needs action – or if it does, why can’t system adjust itself? It will take baby steps, but I think there is progress that can be made here.

Bryan Cantrill and Brendan Gregg were up next. They spoke on DTrace and how they used it to debug system problems at a very low level in JoyentOS. This was one of the best talks of the conference and it was amazing to see them work in a system they clearly knew intimately. They did reveal places where the OS was the problem, but I think it is still safe to assume, on a day to day basis, that the OS is not the problem. It did highlight how as the world gets evermore interconnected and systems are expected to be ever faster problems in the OS can have a ripple effect to the application level.

This talk had one truly fantastic component to it: since Bryan and Brendan were working at such a low leve, they were graphing everything. Every single call through the system (in this case to access a file) and how long it took. They they graphed it. But they didn’t just look at average time. They purposefully created graphs that would plot all the points and the would highlight outliers. This showed things such as a system where the average call was sub-seconds, but them outliers would take ten seconds or more. Investigating the path of the outliers showed flaws in the system. Another graph they showed was the heat map. This was great in that it showed roughly all the calls in the system, but the ones that took the most timer were a darker color. At a glance this showed the average performance of the system but also allowed potential outliers to stand out. A simple average graph loses all this data. This opened my eyes again to how we’re generally approaching monitoring wrong. We need to step back and look more at system’s holistically. This is next realm that tools need to evolve into.

The next talk was Xtreme Deployment by Theo Schlossnagle. I’ll be honest, I don’t remember a lot from this talk, except I know I enjoyed it at the time. I do recall thinking that Circonus is a really nice tool and puts Nagios to shame. Given that you have to pay for Circonus, there’s probably a reason it is light years beyond Nagios.

The next session after lunch was Pedro Canahuati from Facebook, but I only caught the tail end of it where he was describing some of Facebook’s monitoring systems. I was late because I opted to explore Baltimore some after lunch. I hate travelling to a city and then not seeing any of it and Baltimore had a book fair going on that day just a few blocks from the conference.

Matt Graham of Etsy gave a talk. Basically, they deploy a lot. This reduces risk because changes are always small, so they can pinpoint bugs quickly and then deploy quickly to fix them. They measure bughours – how long is a bug alive. With continues deployment they can cut this down drastically because they’re always deploying.

Baron Schwartz gave a talk on Automated Fault detection, but at this point I was pretty fried and had quit taking notes. My apologies to Baron, but I don’t recall much from this talk, other than thinking it was all stuff I had heard before (which, I should point out, doesn’t mean it wasn’t good – just because I’ve heard it doesn’t mean everyone at the talk has, or that I couldn’t stand to hear it again).

The talk concluded with Theo giving a final keynote and parting words, which if I recall correctly was that we should look at other professions and see how they handle failure, because much of what we’re learning in DevOps has been done before. We just need to look past our own walls and see what the larger world is teaching us. My apologies to Theo if that isn’t what he said, but that’s what I’m recalling right now, although I might be mixing it up with Chris Brown’s talk from Velocity Europe that I just watched.

All in all, Surge was a great conference for showing me just how much I don’t know and highlighting directions the industry is moving in, which is both more specialized and more generalized at the same time. It’s clear that as systems get ever larger and more complicated that communication is going to become ever more important, both from a human to human and from a machine to human perspective. DevOps is just the start.

For another perspective on Surge 2012, @obfuscurity’s write up is here.

General - 0 Comments

Triangle DevOps

Mark Mzyk June 23, 2012


DevOps is not a role. It is not a title. Fundamentally DevOps is an idea and a way of behaving. It is the belief that development and operations are not domains that have a wall between them, but that development and operations work best when both sides share ideas and understand the concerns and patterns of the other. In practice this can take many forms. Just as everyone interprets and implements agile practices differently, it is the same with DevOps. There is no right or wrong way, because the only way is continually striving to get better.

Triangle DevOps

Sometimes an opportunity comes along and you don’t think much about it. You jump in and start working before you’ve realized what you’ve done.

That’s the situation I find myself in now that I’ve taken over as organizer of the Triangle DevOps meetup. Without much thought I agreed to take over the group. Now I have an obligation to do my best to make the group succeed, because the Triangle has a wealth of DevOps talent and it needs a place to showcase it.

I will do my best to make the group a fun and inclusive place for the exchange of ideas and the development of talks. The group is going to meet on the third Wednesday of the month at 7pm and is going to rotate between three hosting sites: Bronto in Durham, Teradata/Aprimo in Cary, and WebAssign in Raleigh.

DevOps has two parts: Development and Operations. Each part should be interpreted as broadly as possible. I welcome talks on anything that can even be tangentially related to DevOps. The main criteria for a talk is simple: does it contain an idea that will help to make us better? It is only through exposure to new ideas and new ways of thinking that we grow. I can’t know which idea will help someone grow, so the best strategy is to present as many as possible.

It is also a truth that anyone can have an idea. While it’s great to listen to polished speakers present ideas and we should all listen to them, it isn’t true that they are the only ones with good ideas. Any one will be welcome to present at Triangle DevOps, even if they’ve never given a talk before. Only by giving that first talk can someone start on the road to becoming a polished speaker themselves.

Most of all Triangle DevOps is about always getting better. Better talks, better ideas, better community. Talks and ideas follow from community. Without community there is no Triangle DevOps. I invite you to come out and participate. Share your ideas. You might learn that small trick that makes your life easier. And that would be great. That would make the group a smashing success.

General - 0 Comments

Gemspec: Loading Dependent Gems Based On The User’s System

Mark Mzyk May 21, 2012

It’s a scenario that shouldn’t be hard. When installing a gem, have that gem load dependent gems based on what state the system is in.

Yet RubyGems provide no mechanism for doing this. You won’t find mention of it on Rubygems.org.

When you create a gem your gemspec is executed at creation time. When the user installs the gem, nothing that could react to the user’s system is run.

With one exception: extensions.

If your gem has an extension component (most commonly C), then it has to be compiled on the user’s system.

This process can be used to make your gem install dependencies or execute other ruby code at install time. I discovered this when looking for a way to install dependent gems on a user’s system based on the ruby version they were running.

The wikibook on RubyGems mentions this process (“How to install different version of gems depending on which version of ruby the installee is using”), but doesn’t do a great job going into detail on how it works or exactly what is needed.

This is a hack, in the sense that we’re using a system to do something other than what it was designed for. Due to this, if you use this process, it will appear to the user that your gem is installing an extension, when in fact it isn’t.

To go along with the explanation that follows, example code can be found on github: gem_dependency_example. You can install the gem gem_dep_example to see the process in action. This is the gem built from the example github repo. The gem will install itself along with a dependent gem based on the ruby version it detects on your system. It will install the gem gem_dep_shine if you have Ruby 1.9 or greater and it will install the gem gem_dep_polish if you have Ruby 1.8.7 (or older).

Let’s step through how this is accomplished. Looking at the code in the github repo, you’ll see in the gem_dep_example folder that it is a very simple gem. There is the standard gemspec and lib folder. There is also the ext, or extension, folder, which you normally don’t see unless a gem is installing an extension.

Look at the gemspec first. Nothing in the gemspec should seem surprising, except for a line near the bottom.

s.extensions = ["ext/mkrf_conf.rb"]

This line tells RubyGems that it has an extension to install and that the extension can be found in the file mkrf_conf.rb in the ext folder. This is a RubyGem convention, where it looks for this file in this location to know that an extension should be installed. In this case, the name of the file also tells RubyGems it is installing a Ruby extension and not a C extension.

Go the the ext folder and open mkrf_conf.rb, the only file there. You’ll notice some boiler plate code you’ll have to include, where you load the RubyGem’s dependency installer. If it doesn’t find this it will fail with an error.

After loading the installer, write your code to do what you want. This will run on the user’s system, so you can inspect the system and take conditional action based on what you find. In this case the code picks which gem to install based on the user’s Ruby version.

  if RUBY_VERSION < "1.9"
    installer.install "gem_dep_polish", ">=0"
    installer.install "gem_dep_shine", ">=0"

    #Exit with a non-zero value to let rubygems know something went wrong

The extension file needs to end with these lines:

f = File.open(File.join(File.dirname(__FILE__), "Rakefile"), "w")
f.write("task :default\n")

The reason for this goes back to this being a hack. RubyGems thinks it is installing a native extension. To finish installing the extension, it will look for the make or rake file that it assumes was generated and run that to complete the installation. In this case, the relevant code has already run, but RubyGems will error out unless it finds a rake file to run. To make RubyGems happy, we create an empty rake file with an empty default task for RubyGems to run so it will exit normally.

Since it isn’t entirely clear to the user installing the gem what is happening, I recommend you avoid using this trick unless you have no other choice. I wish RubyGems included a more intuitive way of achieving this, but it doesn’t, so this will do for now.

A final note: you shouldn’t use this if you’re looking to install different gems based on the user’s platform (Windows, JRuby, Ruby, etc). In this case the accepted best practice is to create different gems for each platform and leave it to the user to install the appropriate one.

Programming - 6 Comments