Monday, June 27, 2016

Let's Build Something: Elixir, Part 6 - Adding Metrics to StatsYard

We're going to take a bit of a detour with this post, and look at what we can do to get some basic metrics spit out of StatsYard! Not only am I a huge metrics geek so this is a pretty strong urge for me, but it's also something that will be helpful as we go forward. Implementing new functionality and fixing up old is generally made more difficult by not having decent measurements, and laying the groundwork for that now will make it easier to keep up with it as we go along.

The old adage that premature optimization is the root of all evil still holds, of course, even for a metrics obsessed type like me. While there's value in checking that we haven't introduced a major performance impact to the system, it just won't do to nitpick every little bit and try to tease out a tiny bit more performance at this stage of the game. I mostly just intend to use it to get an idea of the system's capabilities and as a learning opportunity when we start looking at alternative compilers and organizational structures.

An Overview

So! Let's get into it, shall we? The high level view of what we're going to implement is this:
    • Custom stats should be super easy to instrument as we build new functions and features
    • VM stats are awesome, but not something we should have to gather ourselves
    • Stats need to be shipped somewhere, preferably quickly and easily
    All of this will evolve over time, for sure, but this seems like a good start.

    O Back-End, Where Art Thou?

    Before we get too far, we need a place to stick our metrics! I'm actually using Sysdig Cloud for this, because they have an agent that beats all the rest, and their UI will suit my purposes nicely. The key feature of their agent that I'm leaning on is the built-in statsd server that will ship my stats to their back-end so that I don't have to worry about it. Fret not, though - nothing in this post will demand that anyone else use Sysdig, that's just where I'm going to throw data for now.

    Wiring Up Our New Dependencies

    First things first, let's add all of our dependencies and start up the apps as needed. We'll just lump them all in for now, and spend the rest of this post looking at how to make use of them. Crack open mix.exs and add :folsom:statixand :ex_vmstats dependencies:

    Still in mix.exs, we also need to tell our project to start these new applications, otherwise they won't do their job at startup:

    Our First Custom Metric

    Like I mentioned earlier, we're going to be using Folsom for custom stats. It's an Erlang app, and it's a pretty good analog to Java's codahale.metrics, now known as DropWizard. The Github readme has pretty much all the info you could need about how it works, so we'll jump straight into instrumenting a call to one of our functions - DataPoint.validate/1Open up lib/stats_yard/ingest_consumer.ex, and we'll wrap our function call with a :folsom_metrics function:

    We first have to define our metrics so that Folsom can get the appropriate structures setup in memory, so we add a private function create_metrics/0 and then call it when we init our GenServer.

    Next we wrap our validate/1 function with Folsom's histogram_timed_update/2. This Folsom function is basically a pass-through function, in that it accepts a function of our choosing as one of its arguments, evaluates that function and does some kind of work in relation to it, and then returns the value of the function that it evaluated. In this case, the work done in addition to our validation function is timing its run time and updating an ETS table that Folsom maintains for "doing math" on and maintaining a history for our metric. From our perspective, though, almost nothing has changed. We don't get any new return values or anything from validate/1,  and so we can still use it in the same manner we wanted to originally.

    After we've validated the data point and gotten a timing for the invocation, we increment the counter validate.rate, and then move on!

    Dude, Where's My Data?

    So we created some metrics and updated their values - why don't we see anything happening in our statsd server??? As it turns out, Folsom's only goal in life is to keep track of our stats, not actually send them anywhere. The beauty of Folsom is that it's optimized to keep track of our stats and keep them accessible in ETS tables, which reside in memory and are extremely fast. With Folsom doing the math and handling the history and access paths of our metrics, we're left with all the flexibility in the world when it comes to choosing what to do with the data that's tracked. So, to that end, let's wire up Statix so we can actually send these bits somewhere!

    Setting up Statix

    It doesn't take much to get started with Statix. At a minimum, we need to:

    • Define a config for Statix in config/config.exs
    • use Statix in a module (any module), which will pull in a handful of useful functions that we can use to ship metrics to a statsd server
    • Make a quick call to the connect/0 function that our Statix-using module now has, and we're off to the races with such functions as gauge/2, increment/2, decrement/2etc.

    Our config is simple enough. Clearly one should change the host value to something appropriate as needed, but generally this is fine:

    That's all good and well, but we need to wire up our in-memory Folsom data with Statix so that it's actually of some use. Enter our shiny new StatsYard.StatShipper module!

    In the file lib/stats_yard/stat_shipper.ex we'll place the following:

    Whew! Lots going on here, so let's talk about it at a high level, then we'll jump into the line-by-line breakdown. The general idea here is to build a supervised GenServer that will, on a certain interval, iterate across all metrics known to Folsom and ship them out via Statix. This keeps the shipping of metrics out of the data path for actually collecting the metrics, and since it's supervised separately at the top level, its workload isn't tied to any other part of the app.

    Let's dive in and see what exactly is happening here:
    • Line 9-10: Folsom's histogram and meter metric types are actually collections of various related values, but we don't necessarily always want all of them. Here we define the ones we want for now
    • Line 12: This GenServer needs to do work every N milliseconds, and we default N to 1000
    • Line 17: Our init/1 function does nothing more than schedule the first metrics shipment, then lets everything run itself from here on out
    • Line 21-23: This function shows how easy it is to do something after a set time. Process.send_after/3 does exactly what it sounds like - sends a message to an arbitrary PID after the specified time. In this case, the GenServer is sending itself a message telling itself to ship metrics, but waits ship_interval milliseconds before doing so
    • Line 25: We're using handle_info/2 here, because it's the GenServer callback that can handles all arbitrary messages send to the GenServer that are NOT handled by any other callback. Our message telling the server to ship metrics is, indeed, one such arbitrary message
    • Line 26: Here we call the function that actually kicks off the iteration over Folsom's metrics and then ships them off. We also set the GenServer's state to the return value of this function, which we could use (but aren't currently) to terminate the server if metrics shipping failed
    • Line 27: Once metrics have been shipped, schedule the next shipment for ship_interval from now
    • Line 32-41: For every metric that Folsom has registered, deconstruct its metadata tuple and take appropriate action based on the metric's type
    • Line 50-54: Break apart a meter metric and only ship the bits that we've allowed with @shippable_meter_stats
    • Line 60-71: Histogram metrics in Folsom are a bit of of a mix of data types, so we need to walk every keyword in the list and when we come across the :percentile keyword, handle it appropriately. All other keywords are shippable as-is
    I know, it's a mouthful (keyboardful?), but hopefully it helps to explain things a bit. Now we need to start up our GenServer and get it supervised, so off we go to lib/stats_yard.ex where we'll add a new function to be called from start/2:

    A Note On Running Multiple Aggregators

    I can hear you asking now - why so many gauges in the stat shipper module? Well, there's a bit of a headache that can crop up when you run multiple stats aggregators like we're doing here with Folsom and StatsD. Both tools are going to store incoming values and do some math on them, and maintain some rolling / windowed values as well. If we let both tools do this, however, then by the time our metric data makes it to the back-end, we're going to be seeing skewed (and often, watered-down) values. We can generally avoid this by allowing Folsom to do the math and windowing, and ship the results to StatsD as gauge values. This causes StatsD to consider the value to be a point-in-time value that's not dependent on any past events, and thus passes it through as-is.

    Last But Not Least - VM Stats

    For grabbing VM stats (memory utilization breakdown, process counts, queue sizes, etc.), we're going to start out with ex_vmstats. It does its job right out of the box, but we need to make it work with Statix first. As is mentioned in the Github readme for ex_vmstats, we need to create our own Statix backend and then specify it in our mix config. Let's take care of the backend module by creating a new file named ex_vmstats_statix_backend.ex:

    Now all that's left is to start up the app and we'll start sending VM stats to the back-end!

    Checking It All Out

    To help test out our shiny new stats without having to do a lot of super repetitive typing, I've added a small module called StatsYard.Util.IngestStress. It can be used as such from an iex -S mix session: 2_000_000

    This will lazily generate 2 million valid DataPoint structs and shovel them into our ingest queue. I took the liberty of doing this locally and grabbing a screenshot of my Sysdig dashboard to give an idea of what's going on:

    Cool! Looks like my current setup can validate some ~15,000 data points per second, with each generally taking under 20 microseconds. Excellent! There are some other cool VM stats that I've graphed here too, which are definitely worth taking a look at as we move forward.

    Next Time

    A bit of a departure from what was promised last time, we'll just be looking at the data path next time 'round. Worker pools will have to wait until we actually start writing some data to disk!

    Until next time, feel free to peruse the code for this post at