Tuesday, May 5, 2015

Stats Should Be A Commodity: Introducing Stat Badger

My favorite part of any job is finding the resident data addict(s) and finding out what their favorite visualizations are (and what in the world they mean). If you want to be told some incredible stories about a company's tech stack, find the people who live and breath stats and logs. They'll blow your freaking mind, I promise.

Needless to say, I'm a bit obsessed with stats myself. Without a nigh-unmanageable volume of data points with which I can paint pictures of what's going on in my stack, I start to feel like a kid at Christmas with a mountain of presents and no name tags. That said, I'm not picky about how I get those stats or where they're stored. I have some opinions, sure, but if I can make use of what I've got and get what I need, I'm fine.

The problem is that all the tools that exist to collect those stats are extremely opinionated. In fact, almost every stats collection tool out there today:
  • comes as part of a larger metrics aggregation and analysis suite / platform
  • comes without the bits necessary to grab basic system stats (cpu % util, disk and network IO, mem util, etc)
  • is not easily extensible to gather custom "non-system" stats (redis, JMX, apache, you name it)
  • is a nightmare to build / install / configure
  • makes heavy-weight assumptions about where (and in what format) stats are going to be shipped
  • does all of the above

This becomes a hindrance when you try to provide multiple teams in an organization with the flexibility to manage their tools and data in their own way. Teams are left to one of a few unpleasant choices:
  • conform to whatever stats collection tool the InfraNerds are using
  • use whatever stats collection tool they want and:
    • leave the InfraNerds blind (read: "induce much pain")
    • stick the InfraNerds with the task of integrating multiple tools
  • deploy multiple stats collection tools to feed the backend they like in addition to the one used by the InfraNerds
It also becomes a major effort to swap out stats backends, should the need arise, as you have to figure out what the new collection tool can or can't do and how best to configure it.

This sucks. A lot.

So, Stat Badger is my attempt to avoid such issues entirely. It imposes zero opinions on your stats and their destinations. The core loop gathers no stats on its own, and defines no outputs on its own. Instead, those decisions are made by the user, by way of defining one or more Modules and Emitters.

The basic philosophy of Stat Badger is that stats should be a commodity - a raw "material" that can be extracted in large volumes and sent to any place a consumer might want it - intended to be manipulated, refined, and produced by any number of processes into myriad useful products and services.

Now, even though Stat Badger makes no assumptions about your stats, it does ship with a standard set of modules and emitters to get you started. Specifically, it ships with modules to gather basic detailed system stats (cpu, memory, network, disk, load, and per-process memory / cpu... so far), and emitters to spit stats out to a number of back-ends (InfluxDB 0.8, Graphite, Kafka, stdout... so far).

Getting started should be as simple as:

git clone https://github.com/cboggs/stat-badger
cd stat-badger
python badger_core.py -f config.json

This should start up a foreground process that spits pretty-fied JSON to stdout. For more interesting experiments, edit config.json and add to the list of emitters to ship your data to InfluxDB, Graphite, Kafka, or all of the above - all at the same time.

Stat Badger is not yet a fully-polished product. It's lacking the required wiring for tests (I'm not a dev by trade, so I've been cheating so far). It's got a few limitations (addressed in the "More to Come" section of the readme in Github). It also needs more modules and emitters added to make it as universal as I envision. 

All that said, I hope you like Stat Badger, and I hope even more so that you'll contribute and help make it a strong, solid tool for stats gathering!

Wednesday, February 18, 2015

Why Your Early Start-Up Needs an Infrastructure Engineer

Most early stage start-ups dive straight into hiring as many top-notch Software Engineers as they can get their hands on, and then watch the river of code that flows from their fingertips. What they might not realize is that their engineering teams are missing an important component - an Infrastructure Engineer! To elaborate...

The Situation

Your tech start-up has a marginally functional prototype of your latest brainchild, and it just helped you land your first major round of funding. Whoo!!! Now you're on the hunt for scary good talent to start banging out code so as to commence the Journey to Alpha. It's a good day.

The Need

Rockstar 10x Software Engineers are an absolute must-have. You are, after all, in the business of writing software - who better to hire than, ya know, the people who write software? These fine folks are going to translate your vision from pure thought to actual, usable stuff that you can sell to customers. So, that's your recruitment strategy: go for the super devs and call it a day.

The Actual Need

A group of engineers who can collaboratively hammer out a product that is worthy of putting in front of your potential customers is the real goal here - not simply gathering a collection of awesome Software Engineers. Building out your team early on with nothing but SE's tends to leave you lopsided, weighted too heavily toward the "crank out product code" side. To balance out this early engineering collective, you really want to include an Infrastructure Engineer, which I'll fondly refer to as an InfraNerd from here on out (because, honestly, no one wants to be called an "IE").

What Usually Happens

Your start-up ramps up to a team of 5-10 SE's, and they get busy building out your product to (mostly) the spec they're handed. Here are a few generalizations about how things might operate in the early stages of product development:
  • Each SE is likely using the toolchain and build routine that has always treated them well in the past
  • Some write tests to cover a respectable percentage of their code
  • The rest write somewhere between 0 and ∅ tests
  • More SE's come on board
  • Everyone pushes to a dev branch, which is merged to master inconsistently at best
  • Master breaks. Every. Damn. Time.
  • Moar SE's!
  • Codebase begins to win recognition as the best Italian food in town
  • Alpha (and possibly Beta) products fall flat on their face when deployed for customer POC / demo
  • Massive architecture rework is considered, then immediately ignored in the interest of Getting Shit Done
  • You realize that the product isn't even remotely Production-ready and begin attempts to recruit the mythical "DevOps Engineer"
  • After determining that "DevOps Engineers" are mostly made up of dreams and disappointment, you end up with an Ops Person (who may or may not be an InfraNerd)
The nastiest pain point of "the usual way" is generally when the Ops Person is given the unenviable task of "fixing" all the things. It's all too easy for ill will to be felt toward the person who is coming in and telling a sizable group of intelligent, competent, experienced SE's that they need to stop doing X and start doing Y. Right now. Because their stuff is BUSTED. 

This is generally a bad time for everyone involved.

How an InfraNerd Changes the Equation

Your friendly neighborhood InfraNerd can, given the appropriate care and feeding, avoid a lot of the heartache-inducing moments enumerated above. Bringing an InfraNerd on-board early in the life of your start-up means you get to deal with an engineering team moving faster than the founders can keep up, rather than trying to un-break the world.

Here's the straight dope: your SE's are immensely smart folks, and they can work freaking magic at the keyboard pouring forth rivers of feature-packed and shockingly cool software. They casually tweak algorithms that the masses only speak of in hushed reverent tones, they grok unnervingly complex systems that most people couldn't navigate with a GPS and a tour guide, and they build amazing things out of thin air. They know their world forward and backward. But they (usually) don't know the world that sits just below theirs.

Your InfraNerd, however, will know that shadowy world intimately, and they'll draw on that knowledge to augment your engineering team such that it will be able to sustain a much higher level of Awesome. It's not that an InfraNerd can do things that a SE can't grasp, but rather that they think of solutions that your SE's are not accustomed to thinking of (and often don't have time for).

What an InfraNerd Brings to the Table

Without an InfraNerd, your SE's will very likely have something like Jenkins in place to handle some ad-hoc test runs and the like, but I can almost promise it will be broken and underutilized. After all, who has time to fuss with automating the build pipeline? Answer: your InfraNerd does! Their purpose in life is to automate the world, and your team's velocity will ramp up quickly because of it.

An InfraNerd is going to bring in tools of which your SE's may not even be aware, all for the sake of automating the world. They'll help avoid the nightmare of shady shell scripts being used for deployment by calling on one or more of the many available Configuration Management & Orchestration tools (Ansible, Fabric, Chef, Capistrano, Puppet, SaltStack, etc). There's a good chance your SE's know about such things, but have never really seen a need to use them in their day-to-day workflow. There's a 100% chance your SE's will fall in love with such things when they see how easily they can deploy their code changes to arbitrary environments with minimum fuss.

Your InfraNerd will do their damnedest to reduce the amount of code your SE's have to write and maintain. SE's tend to be inventors by nature, and with that nature comes the tendency to see a wheel-shaped hole and immediately set out to find a suitable chunk of rock from which to sculpt a mighty fine wheel. They know the problem and they're capable of building what's needed to fix it, so they set out to do so. Usually the problem that needs solving has already been solved by some existing tools, and your InfraNerd will jump at the chance to use those tools to reduce the team's impromptu Wheel Re-Invention Drills. The real beauty here is that most of the tools an InfraNerd ushers in will quite often open the door to some really cool enhancements and features that might never have crossed anyone's mind up to now.

In the midst of all their other work, your InfraNerd will also weave in tasks to build up what is undeniably the single most important part of your systems: infrastructure you can use to measure everything. They're going to find every logfile your SE's have ever dumped anywhere and get them automatically ingested into a real-time, centralized, and searchable format by way of a service like ELK or Splunk. If there's a metric that can be extracted from anything, they're going to find a way to get to it and pump it into a time series data store like InfluxDB or Graphite. Then they'll take all those data and craft dashboards to display them in all their unapologetic and illuminating glory with something like Grafana. They'll use all of this to tell you stories about your product that you'd never imagined possible. It's kinda great.

The Wrap-Up

In a young start-up looking to hire on Engineer #10, it's going to feel pretty wrong to fill that seat with anyone other than a Software Engineer. Chances are that it's going to feel wrong anytime before #50, to be honest. You should consider, however, that as expensive as that #10 slot might seem, filling it with an InfraNerd can really be a game-changer. You'll end up with an engineering team that turns out a better product in less time and likely doesn't need to rebuild everything from the ground up before you can ship Beta. You'll be faster to market with a superior product and happy SE's - and that slot will suddenly seem so cheap that you'll want 2 or 3 more InfraNerds before you know it.

So now you need to find some InfraNerds and get them on the payroll! How do you find them, and what should they be doing once they're on-board? Tune in next time!

Tuesday, February 3, 2015

Getting gmond metrics into InfluxDB

I'm a huge fan of InfluxDB + Grafana. InfluxDB is on a good path toward making metrics not suck, and Grafana has a great vision for interactive and exceedingly scriptable dashboards.

One problem I've run into recently is that while I can get all kinds of cool custom metrics into InfluxDB without much struggle, I don't have a tool that will spit out nice system metrics into InfluxDB.

My favorite tool for getting useful system metrics so far has been Ganglia's gmond. It gives some decent disk and memory data, but my favorite bit is the built-in CPU % utilization and network bytes in/out metrics. You'd be surprised how few tools actually give that information out-of-the-box.

However, the back-end I don't really want to use for gmond is Ganglia. What it does, it does exceedingly well. What I want it to do, it does terribly. So I either needed to adjust my expectations, or find some way to get gmond metrics into InfluxDB. Turns out there were tools available to get data from most other monitoring and metrics tools into InfluxDB (collectd, statsd, fluentd, graphite, etc) but nothing that would work with gmond.

Now, you can make gmetad output data in Graphite format and point it at InfluxDB's Graphite input plugin. I don't much care for that approach, for a few reasons:

  • it forces an awkward (and generally performance-hindering) data layout in InfluxDB
  • it requires you to run gmetad, which feels a bit heavy when it would just act as a proxy
  • it introduces more layers when I'd much rather simplify things

I'd much rather have some simple tool that polls gmond and puts the metrics into InfluxDB in a sane way.

So, I built one. Get it at: https://github.com/cboggs/gmond-influxdb-bridge

It's not *quite* where I want it to be yet, but it should be of some use right now. I've listed some enhancements I want to make in the Readme.

My thinking behind this tool (and any others I might build) is that, as an infrastructure tool, it should be easy to get started with, easy to automate once you've got the hang of it, and easy to forget about once you've handed if off to your config management tools.

To that end, if any one of those qualities is not present, consider it a bug and file an issue on Github - or send me a pull request, I love those things.

Happy graphing!

Friday, January 16, 2015

Omnibus Cache Exploitation

I've been playing with Omnibus a bit lately, in hopes of making our product installable in locations where internet access is not an option. We have a decent list of dependencies that we install at deploy time, and we need some consistent way to get those items put in place when public repos aren't available.

If you've never seen Omnibus before, it's pretty great (though there is a bit of a learning curve).

Expectedly, our dependency list gets hairy when we start pulling in the dependencies of our dependencies. That's not so bad, by itself. What sucks is that any change to your project definition (omnibus-my-project/config/projects/my-project.rb) - even whitespace - causes Omnibus to invoke its HopelesslyPessimistic mechanism which in turn invalidates your software cache faster than you can say, "but it's just a pip install!".

This is done for perfectly rational reasons, sure, but it's annoying when you're working in a container whose only purpose in life is to run Omnibus builds, and your build times start to creep into the tens of minutes.

Fortunately you can route around this pessimism while you're getting your dependency list ironed out by adding your dependencies to a dependency. To elaborate...

You normally add your run-time dependencies to your project definition (my-project.rb), but you can also add those dependencies to another software definition that you depend on in omnibus-my-project/config/software.

So for example, instead of:

You'd have a more minimal definition:

Then your pycrypto definition might look something like this:

Now you can add dependencies to the bottom of pycrypto.rb and take full advantage of the Omnibus software cache! Saves a lifetime of waiting on builds to complete.

NOTE: You really shouldn't leave your project like this, it'd be terribly irresponsible of you, and then you'll say that I told you to do it, and I'll catch all kinds of grief for it. So don't do that.

When you're done adding all your dependencies, you can simply shovel all the dependency "<name>" lines out of (in this example) pycrypto.rb back into my-project.rb, and kick off one more (very long) build to get a sane final package.

Happy packaging!

Thursday, January 15, 2015

Omnibus for Offline Installs

Software repos are wonderful things, man.

Until your servers are offline, of course... that's no good.

Or if they're online, but not allowed to update but once a year.

Or worse yet, they only get whatever updates are considered mission-critical by IT, and there's no way in hell you're going to get the right versions of your dependencies installed at deploy time.

At the end of the day, trying to get anything installed at a customer site that has externally-hosted dependencies is almost guaranteed to be more of a pain than it's really worth.

So what can be done? Well, a few ideas come to mind...

  • Tar up your top-level dependencies, push it out at deploy time, and hope for the best. Then watch as all the installs fail because dpkg or rpm can't satisfy the dependencies of your dependencies
  • Script out something like  'apt-get --print-uris --yes install <stuff>' and wget the results so you can tar 'em up, push 'em, and install 'em. Then watch as your next deploy fails miserably because the customer did their quarterly security patches and broke your dependency chain all to pieces
  • Containerize all the things! This actually might work just dandy depending on what your software does. In some cases though, your software needs more intimacy with the hardware than a container can offer (mostly ultra-low-latency IO and the like). Plus it might mean introducing technologies that your customer is unwilling to accept on their systems...
  • Stick it all in a single package and jail it all in /opt. Winner!
Enter Omnibus. You get to include every dependency above libc in the correct order and specify how to build each one, jailing them in /opt/<name> (or wherever you want, really). Then you get a nice shiny package out of it (pretty much any platform you want) and can confidently install it nearly anywhere without having to worry about whether or not you can get to any external repos.

I'll see about putting together a Cody-fied (read: spoon-fed) how-to on getting started with Omnibus shortly.

In the meantime, if you're already using Omnibus and your builds are taking forever due to your software cache getting nuked every time you try to add a dependency, check this out.