November 17, 2010 - by joyeurelijah
Hi there! I'm Elijah, and I'm a Joyeur. You're most likely to encounter me if you send a support request in to firstname.lastname@example.org -- or if you come up with a very interesting architectural question! I like making things scale nicely for our customers, though that means I tend to worry late at night about all of the things that could possibly go wrong. I'm a helpful sort, that way. I'm based in Tennessee, just east of Nashville, and have been 'at' this cloud thing for, well, a while now…
One of the things that I've been keeping an eye on in the last couple of years is deployment tooling.
If you're a member of the developer, sysadmin or operations tribes -- or, more generally, part of the meta-tribe coming to be known as DevOps -- you probably are nodding your head: You know that deployment is difficult and can be a source of major pain. It can break your application entirely, and sometimes that means breaking your entire business as well.
I want to shoot my mouth off a little to start a conversation about where people think the obvious holes in current tooling are, and where they think we can go. If you get all of the things 'right' that I'm about to mention, then you rock, and clearly have an awesome shop and team!
Deploying code seems like an obvious thing. People have done this in a multitude of ways: Rsync, copied tarballs, and lately with remotely-triggered checkouts from a DVCS (distributed version control system) of some flavor. Deployment can be triggered by a tool like capistrano or not. Perhaps you log into all of your servers and trigger checkouts (or code download-and-install actions) with SSH and a custom shell script.
This is layer one of pain, largely felt by developers.
How best can we manage the pain here during development, and eliminate surprises? A first order approximation -- continuous integration systems like Hudson, CruiseControl.rb (and all the other flavors); unit tests to show us where the pain points are or the points where we have few tests to help us detect our own mistakes and test environments that attempt to approximate the production environment as closely as possible.
How do we expect things in this space to evolve? I'd love to see someone push out a set of services that make it ridiculously easy to plug CI into one's development environment. Auto-testing a node.js app out of a git repo just by running a single command to set things up? Brilliant! Making things more like magic and less like drudgery ("dammit, I don't want to update the CI system today") makes for happy folks.
While I've seen a couple of attempts to make test generation easier, what can be done to lower the bar?
What does your environment look like? Do you know who made that change to your MySQL my.cnf file, when, and why? Do you have a rollback plan for reversing a blundered change? What does the disaster recovery plan look like?
This is layer two of pain, and typically felt by system administrator, backend infrastructure guy or a manager who's first to hear about a badly handled change or upgrade.
Test environments to reduce the pain. Buildi things in an automated fashion. Build predictability and regular testing into the environment. Use tools like puppet, chef, marionette collective and recipe-driven monitoring (e.g. classes associated with new nodes that automatically instantiate new checks). Things like cucumber-nagios help narrow the gap between developer-style unit testing and infrastructure reliability.
Where can this space go? I seem to be seeing more and more automation and "neat tricks" to speed things along -- just today, I saw an mcollective plugin that produces the md5sum of every variant of a config file across your entire infrastructure -- so that you can identify that "base" config, find all the variants, and quickly factor out the commonalities and check a template into your config management system. You could do it by hand, and it's quite a sensible thing to work up. But having someone else come up with the idea, implement a simple version of it, and add it to the library of tricks that are available and public… That's priceless!
Then there's the third layer of things that have an impact on your application or service: The things that are nearby, but just outside of the control you have over the deployment of your application.
Do you know where your packets go? Do you get alerts if the upstream default gateway goes away for ten seconds, then comes back as a totally different flavor of appliance than you suspect? What if the ARP address for the NFS server your data lives on changes? Do you know, or does it show up as an unexplained set of failures in your application's logs?
This can be a 'black hole' for developer and sysadmin/operations time, as there are myriad issues that can occur, some of which are far more difficult to imagine than others. I recently encountered a DNS issue that was baffling. Several of my acquaintances looked at it for several hours together, but it proved a tough nut. Finally we discovered that someone upstream of me had made a change -- and not told anyone else. Blame not worth assigning, but this is the flavor of thing that can constitute a very real emergency.
Some tools have a start on automatic detection and configuration of monitoring, but making it "easy" is far from widespread.
What do you think? What tools and procedures are you using to ease the pains of iterative deployments? If you have reflections, want to play the futurist, or otherwise think out loud, continue with the mouthing off in the comments.
Photo by Flouille Salads.