Monday, June 27, 2016

Let's Build Something: Elixir, Part 6 - Adding Metrics to StatsYard

We're going to take a bit of a detour with this post, and look at what we can do to get some basic metrics spit out of StatsYard! Not only am I a huge metrics geek so this is a pretty strong urge for me, but it's also something that will be helpful as we go forward. Implementing new functionality and fixing up old is generally made more difficult by not having decent measurements, and laying the groundwork for that now will make it easier to keep up with it as we go along.

The old adage that premature optimization is the root of all evil still holds, of course, even for a metrics obsessed type like me. While there's value in checking that we haven't introduced a major performance impact to the system, it just won't do to nitpick every little bit and try to tease out a tiny bit more performance at this stage of the game. I mostly just intend to use it to get an idea of the system's capabilities and as a learning opportunity when we start looking at alternative compilers and organizational structures.

An Overview

So! Let's get into it, shall we? The high level view of what we're going to implement is this:
    • Custom stats should be super easy to instrument as we build new functions and features
    • VM stats are awesome, but not something we should have to gather ourselves
    • Stats need to be shipped somewhere, preferably quickly and easily
    All of this will evolve over time, for sure, but this seems like a good start.

    O Back-End, Where Art Thou?

    Before we get too far, we need a place to stick our metrics! I'm actually using Sysdig Cloud for this, because they have an agent that beats all the rest, and their UI will suit my purposes nicely. The key feature of their agent that I'm leaning on is the built-in statsd server that will ship my stats to their back-end so that I don't have to worry about it. Fret not, though - nothing in this post will demand that anyone else use Sysdig, that's just where I'm going to throw data for now.

    Wiring Up Our New Dependencies

    First things first, let's add all of our dependencies and start up the apps as needed. We'll just lump them all in for now, and spend the rest of this post looking at how to make use of them. Crack open mix.exs and add :folsom:statixand :ex_vmstats dependencies:


    Still in mix.exs, we also need to tell our project to start these new applications, otherwise they won't do their job at startup:


    Our First Custom Metric

    Like I mentioned earlier, we're going to be using Folsom for custom stats. It's an Erlang app, and it's a pretty good analog to Java's codahale.metrics, now known as DropWizard. The Github readme has pretty much all the info you could need about how it works, so we'll jump straight into instrumenting a call to one of our functions - DataPoint.validate/1Open up lib/stats_yard/ingest_consumer.ex, and we'll wrap our function call with a :folsom_metrics function:


    We first have to define our metrics so that Folsom can get the appropriate structures setup in memory, so we add a private function create_metrics/0 and then call it when we init our GenServer.

    Next we wrap our validate/1 function with Folsom's histogram_timed_update/2. This Folsom function is basically a pass-through function, in that it accepts a function of our choosing as one of its arguments, evaluates that function and does some kind of work in relation to it, and then returns the value of the function that it evaluated. In this case, the work done in addition to our validation function is timing its run time and updating an ETS table that Folsom maintains for "doing math" on and maintaining a history for our metric. From our perspective, though, almost nothing has changed. We don't get any new return values or anything from validate/1,  and so we can still use it in the same manner we wanted to originally.

    After we've validated the data point and gotten a timing for the invocation, we increment the counter validate.rate, and then move on!

    Dude, Where's My Data?

    So we created some metrics and updated their values - why don't we see anything happening in our statsd server??? As it turns out, Folsom's only goal in life is to keep track of our stats, not actually send them anywhere. The beauty of Folsom is that it's optimized to keep track of our stats and keep them accessible in ETS tables, which reside in memory and are extremely fast. With Folsom doing the math and handling the history and access paths of our metrics, we're left with all the flexibility in the world when it comes to choosing what to do with the data that's tracked. So, to that end, let's wire up Statix so we can actually send these bits somewhere!

    Setting up Statix

    It doesn't take much to get started with Statix. At a minimum, we need to:

    • Define a config for Statix in config/config.exs
    • use Statix in a module (any module), which will pull in a handful of useful functions that we can use to ship metrics to a statsd server
    • Make a quick call to the connect/0 function that our Statix-using module now has, and we're off to the races with such functions as gauge/2, increment/2, decrement/2etc.

    Our config is simple enough. Clearly one should change the host value to something appropriate as needed, but generally this is fine:


    That's all good and well, but we need to wire up our in-memory Folsom data with Statix so that it's actually of some use. Enter our shiny new StatsYard.StatShipper module!

    In the file lib/stats_yard/stat_shipper.ex we'll place the following:


    Whew! Lots going on here, so let's talk about it at a high level, then we'll jump into the line-by-line breakdown. The general idea here is to build a supervised GenServer that will, on a certain interval, iterate across all metrics known to Folsom and ship them out via Statix. This keeps the shipping of metrics out of the data path for actually collecting the metrics, and since it's supervised separately at the top level, its workload isn't tied to any other part of the app.

    Let's dive in and see what exactly is happening here:
    • Line 9-10: Folsom's histogram and meter metric types are actually collections of various related values, but we don't necessarily always want all of them. Here we define the ones we want for now
    • Line 12: This GenServer needs to do work every N milliseconds, and we default N to 1000
    • Line 17: Our init/1 function does nothing more than schedule the first metrics shipment, then lets everything run itself from here on out
    • Line 21-23: This function shows how easy it is to do something after a set time. Process.send_after/3 does exactly what it sounds like - sends a message to an arbitrary PID after the specified time. In this case, the GenServer is sending itself a message telling itself to ship metrics, but waits ship_interval milliseconds before doing so
    • Line 25: We're using handle_info/2 here, because it's the GenServer callback that can handles all arbitrary messages send to the GenServer that are NOT handled by any other callback. Our message telling the server to ship metrics is, indeed, one such arbitrary message
    • Line 26: Here we call the function that actually kicks off the iteration over Folsom's metrics and then ships them off. We also set the GenServer's state to the return value of this function, which we could use (but aren't currently) to terminate the server if metrics shipping failed
    • Line 27: Once metrics have been shipped, schedule the next shipment for ship_interval from now
    • Line 32-41: For every metric that Folsom has registered, deconstruct its metadata tuple and take appropriate action based on the metric's type
    • Line 50-54: Break apart a meter metric and only ship the bits that we've allowed with @shippable_meter_stats
    • Line 60-71: Histogram metrics in Folsom are a bit of of a mix of data types, so we need to walk every keyword in the list and when we come across the :percentile keyword, handle it appropriately. All other keywords are shippable as-is
    I know, it's a mouthful (keyboardful?), but hopefully it helps to explain things a bit. Now we need to start up our GenServer and get it supervised, so off we go to lib/stats_yard.ex where we'll add a new function to be called from start/2:


    A Note On Running Multiple Aggregators

    I can hear you asking now - why so many gauges in the stat shipper module? Well, there's a bit of a headache that can crop up when you run multiple stats aggregators like we're doing here with Folsom and StatsD. Both tools are going to store incoming values and do some math on them, and maintain some rolling / windowed values as well. If we let both tools do this, however, then by the time our metric data makes it to the back-end, we're going to be seeing skewed (and often, watered-down) values. We can generally avoid this by allowing Folsom to do the math and windowing, and ship the results to StatsD as gauge values. This causes StatsD to consider the value to be a point-in-time value that's not dependent on any past events, and thus passes it through as-is.

    Last But Not Least - VM Stats

    For grabbing VM stats (memory utilization breakdown, process counts, queue sizes, etc.), we're going to start out with ex_vmstats. It does its job right out of the box, but we need to make it work with Statix first. As is mentioned in the Github readme for ex_vmstats, we need to create our own Statix backend and then specify it in our mix config. Let's take care of the backend module by creating a new file named ex_vmstats_statix_backend.ex:


    Now all that's left is to start up the app and we'll start sending VM stats to the back-end!

    Checking It All Out

    To help test out our shiny new stats without having to do a lot of super repetitive typing, I've added a small module called StatsYard.Util.IngestStress. It can be used as such from an iex -S mix session:

    StatsYard.Util.IngestStress.run 2_000_000

    This will lazily generate 2 million valid DataPoint structs and shovel them into our ingest queue. I took the liberty of doing this locally and grabbing a screenshot of my Sysdig dashboard to give an idea of what's going on:


    Cool! Looks like my current setup can validate some ~15,000 data points per second, with each generally taking under 20 microseconds. Excellent! There are some other cool VM stats that I've graphed here too, which are definitely worth taking a look at as we move forward.

    Next Time

    A bit of a departure from what was promised last time, we'll just be looking at the data path next time 'round. Worker pools will have to wait until we actually start writing some data to disk!

    Until next time, feel free to peruse the code for this post at https://github.com/strofcon/stats-yard/tree/lets-build-6.

    Wednesday, May 11, 2016

    Let's Build Something: Elixir, Part 5b - Testing Data Validation and Logging

    Let's wrap up this installment by writing some unit tests to safeguard us against bad changes to our data validation code!

    First we'll put together a couple of tests to ensure that validation and invalidation are working correctly:


    Let's see how we did:


    Well, it certainly worked, but it's a bit clunky. Two things stand out:
    1. We should probably test multiple invalid DataPoints, since there are multiple guards that we want to make sure we got right. 
    2. We probably don't want those log messages showing up in the test output. We likely want to try and capture those to ensure that we are indeed logging appropriately, without cluttering the output of the test run.
    So let's do that! First up, let's add some more bad DataPoint structs to test. We'll switch to using a list instead of individually-named structs for our tests, and we'll also do the same for the valid DataPoints - just in case we decide to test multiple ways there. Consistency FTW!


    And the run:


    Cool! We made our logging problem even worse, though. Let's take care of that. To make this work, we're going to use ExUnit's capture_log/2 to capture the output from Logger, spit out by our DataPoint module:


    I inserted a small bug into the test for valid DataPoints - do you see it? No? It's pretty subtle, no worries. I told capture_log/2 that I'm expecting a Logger message with level :info, but our validation function is actually logging valid structs as :debug messages. Let's see how this works out:


    So capture_log/2 captures the log event, but swallows it and the assertion fails since the level was mis-matched. Handy!

    Switch the valid capture_log/2 call back to :debug and we're good to go:


    (Note: I'm not sure yet how to make it stop complaining about unused variables here. They're certainly used, so I'm not sure why it's not happy about my use. I'll dig into that and iron it out in a later commit.)
    Shout-out to DNNX for pointing out the problem with the unused variables in the DataPoint tests!  I've also updated the gist in this post to reflect his change. Check out his pull request for an explanation of what was happening here. Many thanks Viktar!

    Next Time

    We'll continue next time by introducing a worker pool that we can use to distribute our data validation workload over, and start sketching out how the write path will look when we start persisting data to disk.
    Until then, feel free to peruse this post's code on Github at https://github.com/strofcon/stats-yard/tree/lets-build-5

    Monday, April 18, 2016

    Let's Build Something: Elixir, Part 5a - Data Ingest, Consumption, and Validation

    Whew! Been a little while, but let's keep cruisin'! This installment will tackle defining a custom data type for our data points, providing a means by which we can queue up ingested data and consume it continuously, and validate it as we go along. We'll also write some tests to keep our sanity.

    First let's define what our data points should look like. We'll keep it simple for now, and can expand later. Open up a new file at lib/stats_yard/data_point.ex:


    There a couple of things happening here that you might be interested in:

    • Line 4: We define a type for our DataPoint struct. Note that __MODULE__ is just a safe way to reference the name of the current module (StatsYard.DataPoint). We can later reference this as StatsYard.DataPoint.t when the need arises.
    • Line 5: Define our basic DataPoint struct. Note that structs always inherit the name of the module in which they are defined. If we wanted it to be referenced as something other than %StatsYard.DataPoint{} we would need to define a module within the outer module, such as StatsYard.DataPoint.Thing. There's no real need for that in this case.
    • Line 7-10: Set up a validation function that will only work when the argument passed is one of our shiny new structs, and the various keys therein pass our guards. Specifically we want the metric and entity fields to be strings (or binaries, in Elixir/Erlang land), and the value to be some type of number.
    • Line 11-12: If we end up in this version of the validate function, log a message and return a tuple to indicate success and spit back out the provided struct.
    • Line 15: Define the "fall-through" validate function that will match on any argument that doesn't pass the above guards. In this case, log a warning and an :invalid tuple with the provided value included.
    This module is intended to wrap up the structure of the data we want to use and the functions that are relevant in inspecting and validating it.

    Next up, let's add something to let us queue up incoming data points. I like Joe Kain's BlockingQueue. It's a statically-sized GenServer'd FIFO queue that blocks when full or empty. Super simple, and very effective. NB: Joe also has a super awesome blog called Learning Elixir that you really should check out.

    First up we need to add it to our deps list in mix.exs:


    Then follow that up with a mix deps.get, and we're ready to roll.

    First let's walk through the idea here. I want to have a BlockingQueue GenServer that catches our ingested DataPoints, and a separate consumer GenServer that will pop those DataPoints off the ingest queue and validate them.  The most important part of all this is that I don't want to do things in batches, nor do I want to have to explicitly trigger the consumption of data from the ingest queue. Enter supervision and streams!

    BlockingQueue's API gives two functions for popping events off a queue: pop/1 and pop_stream/1. As you might have guessed, pop/1 removes and returns the oldest single value from the queue, while pop_stream/1 returns a Stream function - specifically, Stream.repeatedly/1. If you're unfamiliar with Streams, they're effectively a safe mechanism by which you can perform actions on arbitrarily large data sets without having to pull the entire thing into memory. I'm not the best to describe these in great detail, but Joe Kain and Elixir's Getting Started guide have some good descriptions and applications.

    So the layout of these bits is going to be something like this:

    1. Start up a named and supervised BlockingQueue GenServer
    2. Start up a named and supervised consumer GenServer that can find the above BlockingQueue process
    3. Consumer process grabs hold of the Stream function returned by BlockingQueue.pop_stream/1 and proceeds to validate every DataPoint that gets pushed into the queue

    Here's our consumer:


    So here's what's going on in this module:

    1. Line 6: This is my nifty way of telling this function "keep an eye out for things popping up in this stream of data, and take appropriate action when you see a new item." We'll test this out shortly in iex
    2. Line 7: Try to validate, as a DataPoint, every item that comes off the stream
    3. Line 8-9: If the validate succeeds, return the DataPoint, otherwise discard it entirely (remember that our StatsYard.DataPoint.validate/1 function will log a warning when a value fails validation)
    4. Line 18: Note that our public start_link/2 function expects to receive the PID of our ingest queue, which we'll provide in lib/stats_yard.ex when we set up our supervision tree
    5. Line 23-25: Start up a linked process that will kick off our queue consumption loop in consume_data_points/1
    6. Line 27: Set our GenServer's state to the PID of our ingest queue, and we're done!
    Notice that this is a very simple GenServer - so simple, in fact, that it doesn't even have any direct means of interaction. For now, this is more than sufficient - we just want something that we can supervise and organize appropriately, with the option to extend it for more robust behavior in the future. (For the studious, you're right - there's always Elixir's GenEvent, but that's for future posts!)

    Now let's rig all this up in our supervision tree, and then we'll poke around in iex to see if it's all working as expected. Notice that this has been cleaned up a bit to accommodate our TimeStampWriter bits without getting too cluttered:


    (I know that's a big chunk of code to dump into a blog post - my apologies. I mostly wanted to be sure to point out that the structure of this stuff changed significantly. Newer changes will be limited to just the diffs. :-) )

    Nothing super exciting here, other than a few things to note:
    1. All supervisors are now started up in their own independent functions, which are called from the pared-down start/2 function
    2. Our supervisors are now named appropriately (Lines 20 and 36)
    3. Our start_main_ingest/0 function lists two workers to be started up under the appropriate supervisor, which will start in the order listed (this will be on the quiz)
    4. Atoms used to name our GenServer processes are pulled out and returned from simple functions at the bottom of the file, so as to avoid headaches later
    Enough work, let's play with it! Fire up iex and we'll see if our stuff works:


    Cool! We're able to push things into our BlockingQueue without having to know much about it, and our IngestConsumer immediately received the pushed values and attempted to validate them, the results of which are spit back out via log messages.

    Now for that quiz I mentioned earlier: in what order were our two ingest GenServers started? Yup, the order listed in our supervision tree definition - the queue first, then the consumer. Why does this matter? 

    There's a failure case that we need to recognize and accommodate. Specifically, if our ingest queue process dies, it will indeed be restarted by the supervisor... but our consumer process will merrily chug along holding onto a Stream function that references a now-dead process! That sounds like bad news, but let's verify that I'm not making stuff up:


    I know that's a bit dense, but the gist (har har) of it is that we used Supervisor.which_children/1 to see what the PIDs of our two GenServers were, stopped the main_ingest_queue process (rather rudely, too, a la :kill), then checked to see that the expected PID had indeed updated in the supervisor's state. Then we tried to push a value into the main ingest queue, which did indeed work since it had been restarted, but our ingest consumer process never knew about it, because it's waiting for events to flow in from a dead process. That's lame, so let's fix it!

    Turns out, this a super simple one-line fix, but reading the docs is a must in order to understand why this fix is appropriate (head over to the Supervisor docs, then search for "rest_for_one"). In lib/stats_yard.ex:


    And now to test it out in iex:


    Woohoo! Worked like a charm. What's happening here? First, read the docs. :-) Second, in a nutshell, using the strategy rest_for_all causes the supervisor to consider the position of any process that dies under its supervision and then kill and restart all supervised processes that were started subsequent to the original dead process. In our case, the queue process is the first one, so if it dies, everything in the supervision tree of our MainIngestSupervisor is restarted. If it were, for example, the 3rd process started by this supervisor, then the 4th, 5th, ..., nth processes would be restarted, while the 1st and 2nd processes would be left alone. Super handy stuff here!

    To Be Continued...

    So now we're in a good place from a supervision point of view. This post is already pretty lengthy, so I'm going to title it "Part 5a," and we'll continue with some unit tests and documentation in Part 5b.

    Til next time!

    Thursday, April 14, 2016

    Let's Build Something: Elixir, Part 4 - Better Tests, TypeSpecs, and Docs

    We left off with our first test case working, but less-than-ideal. Specifically, it's leaving the timestamp file it writes sitting on the disk, and in an inappropriate location (the root of our project). This is super lame, and we should fix that.

    Enter ExUnit callbacks and tags! ExUnit allows us to pass configuration data into and out of our tests by means of a dictionary, usually referred to as the "context". We can make good use of this context data by way of setup callbacks and tags. These are described well in the docs, and we'll lean on their examples for what we need to accomplish here.

    So our test is currently testing TimestampWriter's ability to... ya know... write timestamps. And it works great, other than leaving the temp file sitting in the root of our project. While we could just add some code to our test to explicitly handle this, a better (and less-repetitious) approach is to modify our overall test case to do some setup and tear-down tasks for us automatically!

    First, remove the junk file leftover at stats_yard/tstamp_write.test, and then we'll add a setup callback to our test case that will force our writes to happen in an appropriate directory:


    Here we're exercising a one-way communication from our test to the setup callback by way of a tag. What's happening here is that ExUnit will call our setup callback before execution of every test within our test case. By preceding a test definition with @tag, we are specifying that a context dictionary should be passed to our setup callback that contains a key-value pair of { cd: "fixtures" }This is mostly copy-pasta'd straight out of the ExUnit docs, but a bit of explanation can't hurt:
    • Line 2: We need to make sure our tests don't run in parallel since we're going to be switching directories. It sucks, but it's the nature of the beast
    • Line 4: Define our setup callback, which will be executed prior to every test that is run
    • Line 5: Check to see if our cd tag is present in the callback's current context dict. This is necessary because the same callback is executed for every test, but not every test will necessarily use this particular tag
    • Line 6-7: Store the current directory and switch to the directory specified in the context
    • Line 8: When the test exits (whether success or fail), switch back to our original directory
    • Line 14: Our handy-dandy tag for the test that immediately follows
    Let's see if it works!


    Nope! ExUnit apparently doesn't create directories for you. Oops. Easy fix, and again:


    Much better. Now we should do some cleanup after the fact, because let's face it - no one wants temp files committed to their repo.


    A quick rundown of the updates (slightly out of order):

    • Line 23: Add a `tempfile` tag to our test to indicate that we're going to (attempt) to write a transient file
    • Line 24: Make our test accept a dict argument called `context` (which will contain our `tempfile` key)
    • Line 25: To keep things DRY, refer to the context's value for :tempfile instead of repeating the filename explicitly
    • Line 9: When the test is done, check to see if a tempfile was specified for the test that's being set up
    • Line 10: Make sure the tempfile actually got written, otherwise Line 11 will blow up
    • Line 11: Call the "dirty" version of File.rm/1, just in case there are any weird permissions issues that prevent deletion of the file
    So now we should be able to run our test, and see precisely zero remnants of it:


    Perfect! Now we can write more tests in here and gain some nice organization and cleanup bits without having to provide anything beyond a couple of appropriate tags. (And after all that, yes, I do realize that a tempfile doesn't necessarily need to go into its own directory if it's just going to be immediately deleted. This just makes me feel better.)

    To wrap up, let's make our GenServer's module and API a bit more legit with a typespec and docstrings:


    The @moduledoc and @doc directives are pretty straightforward - wrap up your docstrings in the """ markers, and get yo' docs on. Keep in mind that the docs are in markdown, so you can (and really should) make them look pretty.

    The @spec directive on Line 22 is simply a way to specify the types of the arguments our function can accept, an the type of value it will return. Easy stuff, and super helpful when we start looking into static analysis - it can help iron out a ton of bugs early on.

    Next Time

    Now that we've spent some time on some of the basics that we'll be seeing over and over again, the next post will get into more of the meat of our project and start doing stuff that's more interesting than writing a timestamp to a file. Specifically, we'll define the first iteration of our data format and figure out a way to represent that in code such that we can validate incoming requests for appropriate structure.

    Until then, feel free to peruse the source for this post at: https://github.com/strofcon/stats-yard/tree/lets-build-4









    Monday, April 4, 2016

    Let's Build Something: Elixir, Part 3 - Getting Started with ExUnit for Testing

    NOTE: Before you get too far into this one, I want to mention that I realized I wasn't following convention in my Elixir file names, so there's a commit at the beginning of this post's branch that fixes it (and it's been merged to master as the others have as well). Just a heads-up in case it seems weird all of a sudden. :-)

    Last time we made our TimestampWriter GenServer a supervised process to make it more resilient to bad inputs and other process-killing events. Now it's time to protect our GenServer from a much more sneaky and persistent assailant - us! This seems like a good time to get familiar with ExUnit and build our first test case for StatsYard.

    Defining ExUnit test cases is pretty similar to defining any other module in our Elixir project. If you pop into the stats_yard/test directory and take a look at stats_yard_test.exs, you'll see a simple example test:


    Running this test is as easy as a quick mix test:


    Let's break that test down just a bit:
    • Line 1: As mentioned above, a test case is simply an Elixir module
    • Line 2: Pull in ExUnit's test case bits
    • Line 3: This line will cause ExUnit to do some magic that we'll discuss at a later date
    • Line 5: Defines a unit test with an arbitrary string label
    • Line 6: Makes the bold assertion that 1 + 1 does in fact equal 2
      • assert is basically just shorthand (or more accurately, a macro) that says "everything after the keyword 'assert' better be true, otherwise I'm gonna blow up and fail spectacularly". (There's a bit more to it, and we'll tackle that next.)
    To see assert do its thing in a less-than-true situation, we can just change the 2 to a 3 on Line 6 and run  mix test:


    In a nutshell, what happened here is that we insisted 1 + 1 = 3, and assert totally called us on it. What were we thinking???

    There's some interesting stuff in that output block. First it tells us which test failed ("the truth"), what file that test lives in (test/stats_yard_test.ex), and the line number that the test definition starts on (:5). After that, it tells us the general type of 'thing' we were trying to do (assert with ==) and shows us the specific assertion that failed.

    Next up are two interesting and very helpful lines: lhs and rhs. These acronyms stand for "left-hand side" and "right-hand side" respectively, and these lines actually give us some insight into the way the test actually works under the hood. If you haven't encountered them before, lhs and rhs are hyper-relevant to one of Elixir's most powerful features: pattern matching!

    These two lines are telling us that ExUnit took our assert expression and made an attempted pattern match expression out of it, with the actual evaluated value on the left-hand side of the match, and the asserted value on the right-hand side, like so:


    In this iex session we can see an example of both of the test attempts we've tried so far - the first one being the successful test, and the second being the intentional failure. Hopefully this provides a bit of clarity around how ExUnit is actually accomplishing this particular test.

    So that's all fine and dandy, but we really should work on testing our TimestampWriter. Go ahead and switch that pesky 3 back to a 2, and we'll get started!

    First let's create a directory that will hold our tests - it's cleaner, and seems to be the convention used in most projects. Then we'll create a file in there to hold our first real test case (note that test files have to be named "<stuff>_test.exs, otherwise the mix test task will skip them):


    In timestamp_writer_test.exs we'll start out with the bare-bones first increment of our test, BUT we'll try to make it fail first by passing a bad argument to our public API function write_timestamp/2:


    (Note that I stopped naming these modules StatsYardTest.*, no point to it as far as I can tell.)

    And a quick run to see what's up:


    Huh... well that's... um... awesome? Not really. The GenServer process did indeed fail like expected, but the tests still technically passed. What gives?

    As it turns out, we tested our public API function, write_timestamp/2, not so much our GenServer. Our function is simply calling GenServer.cast/2 which then asynchronously sends a message to our TimestampWriter process. That send is indeed successful and returns the :ok atom - even though our process dies shortly thereafter - and that's exactly how GenServer.cast/2 is intended to operate.

    So how do we fix that? Well to be entirely honest, I don't know yet. BUT! There is a silver lining - this experience has made me re-think whether or not this particular activity is best handled as a cast or a call, which basically boils down to "should it be asynchronous with no response, or synchronous with a response?" Given the intended purpose of this particular function, I think a call is more appropriate: we're going to need some manner of acknowledgement that our data has indeed been written to disk before moving on to whatever our next task might be.

    So! Back to our TimestampWriter code:


    To recap the changes here:
    • Line 9: Switch from GenServer.cast/3 to GenServer.call/3
    • Line 24: Switch from GenServer.cast/3 to GenServer.call/3 and add an (unused) argument, _from, for the sender's PID (we don't particularly need this right now, hence the underscore to keep the compiler happy)
    • Line 25: Bind the result of our file write operation to result
    • Line 26: Use the appropriate response from a call, which is to return the :reply atom, a result of some sort and the new state of the GenServer
    Notice a cool thing here, too: our interface to the GenServer didn't change at all, so we don't need to update our test! We should be able to run mix test and see our test fail appropriately:

    (Note: There will still be some extra output after this as a result of our GenServer tanking on bad inputs. We'll try to fix that another time.)

    Perfect! Now if we stop passing a known-bad argument to our public function in the test, we should get a nice passing test:


    Success! Whew. That was a bit of a runaround to get a simple test in place, but I learned a lot, so it doesn't seem like a wasted effort to me.

    As a final cleanup step (for now), I'm going to remove the simple truth test from the out-of-the-box test file that mix creates, because I don't really care for the clutter.

    Experiment

    We left a bit of an unpleasant side effect in place with our test. Hint: our timestamp writer spits out a timestamp somewhere. Figure out where it's landing, then peruse the ExUnit docs and see if you can figure out how to make that stop happening. No need for that clutter! The next blog post will cover how to fix this.

    Next Time

    We're not quite doing TDD, but hey, it's a start! Next time 'round we'll clean all of this up a bit more (as mentioned in the Experiment above) with some typespecs and docs. Exciting stuff, eh? At the end of the day, it's worlds easier to do these things up front, rather than trying to retrofit them later - plus we can make use of them for some testing convenience (or at least that's the hope!)

    For now, you can peruse the source for this post at: https://github.com/strofcon/stats-yard/tree/lets-build-3

    Wednesday, March 30, 2016

    Let's Build Something: Elixir, Part 2 - Supervising Our GenServer

    In Part 1 we built a simple GenServer and worked our way through making it write a timestamp to a local file.

    During that process, we ran into a usage issue - we had to explicitly start the GenServer before we could invoke its write_timestamp/2 function. It makes sense, but it's inconvenient and dependent on human interaction. We should (and can!) fix that.

    Automatically Starting Our GenServer


    The file lib/stats_yard.ex is the main entry point for our app, so that seems like a good place to add a call to StatsYard.TimestampWriter.start_link/0 (ignoring the supervisor bits for now):


    Now to verify that it acts as we'd expect:


    Much better! Now let's see what happens when it dies, as most everything on a computer is likely to do at the worst possible time.

    There are a few ways to easily kill a process in Elixir, but I always enjoy throwing a value into a function that it can't handle and watching the poor process get shredded. I'm a jerk like that. Let's pass a tuple via the timestamp argument and see what happens:


    Excellent! Our GenServer is dead as a hammer. Soooo now what?

    Well, we can restart it manually in our iex session, but that's a non-starter when you consider your app running unattended. We could also just drop out of our session and then start it back up, but that's still pretty lame.

    Enter supervisors! If we take another glance at the chunk of code we saw above, paying special attention to the bits I said to ignore, we'll see the structure we need to use to start up our GenServer as a supervised process:

    • Line 11-14: A list of your Supervisor's child processes, which can either be regular process or another Supervisor (sub-Supervisor)
      • Each item in this list will contain the module name for a process and will call that module's start_link function
      • Notice that the arguments passed to your start_link function are provided here as a list. This is done so that the Supervisor's start_link function doesn't have to handle arbitrary argument counts (arities) based on your process's start_link arity
    • Line 18: A list of Supervisor options. For now, the provided values are more than sufficient, and we'll talk more about what each one means in a later post
    • Line 19: The start_link call for your Supervisor process, which (shockingly!) looks awfully similar to the call we make for our own GenServer
    Now let's move our GenServer startup bits into the Supervisor's children list, and see what we can make of it:


    And the test:


    Woohoo! Our application is now considerably more "self-healing" than it was before. We can start up iex, and immediately start writing timestamps without explicitly starting our GenServer. Then we can crash the GenServer process, and immediately continue on writing timestamps without having to explicitly restart the process. Excellent!


    Experiments

    If you're the curious sort (and you really should be! It's awesome!), you may want to try and poke around with this a bit more by trying to crash our GenServer several times and seeing if the behavior persists. See if you can find a limit to what we've done here, and see if the docs for Elixir's Supervisors can guide you toward a fix.

    Next Time

    We've got a nice simple GenServer automated and supervised, but there's no guarantee that our next batch of code changes won't break something in some unforeseen way. We'll poke around with ExUnit next time, and see if we can start a basic unit test suite for our project.

    For now, you can peruse the source for this post at: https://github.com/strofcon/stats-yard/tree/lets-build-2

    Monday, March 21, 2016

    Let's Build Something: Elixir, Part 1 - A Simple GenServer

    I'm an "ops guy" by trade. That's my main strength, tackling things at all layers of a SaaS stack to fix broken things and automate my way out of painful places. That said, I find that my mind wanders into the realm of, "wouldn't be cool if I could build <insert cool thing here>?" I like to tinker, and I love solving problems, so I tend to poke around with some development projects on the side for fun and for my own education. I also find that I'm better at my "ops day job" for having understood some of what makes the software development machine tick.

    I've been playing with Elixir a bit lately, and it's incredibly attractive for a lot of reasons that I won't get into here. I learn faster when others peruse my code, and I like sharing what I've learned as I go along... Sooooo, let's build something! 


    Erlang (and by extension, Elixir) lends itself very well to highly-concurrent and distributed systems, so I want to build something that leverages that strength and bends my brain in some weird ways. To that end, I want to build a time series database. A very simple one, mind you, but one that provides some measure of useful functionality and takes advantage of the things Elixir, OTP, and BEAM bring to the table.

    This first part will be pretty simple - we'll spin up a new project with mix, define a GenServer that writes the current timestamp to a file, supervise it, and then manually test out the supervision. We'll get into exunit and such next time.

    Note: My bash commands aren't terribly copy-pasta-friendly in these posts, and I'm OK with that. By simply pasting my actual shell output, you can more easily figure out what directory I'm in at any given time. Fortunately we don't spend much time in bash, at least not in the code snippets.


    Create a New Project

    I'm going to call my awesome new world-changing time series database "StatsYard". We'll use mix to get kicked off:


    This lays down the bones we need to start building stuff. (Note: Even though we named our project "stats_yard", the actual namespace will be StatsYard as shown below.

    Define Our GenServer

    Let's create a GenServer that writes the current timestamp plus some message to a file upon request.

    Mosey on over to stats_yard/lib and create a new directory with the same name as our project, stats_yard. Within that directory, create a file named timestamp_writer.ex :


    Crack that bad boy open and let's build a GenServer!

    I like to name my files the same as the module they contain wherever possible, but it isn't required - I just find it eases troubleshooting. The convention I've seen everywhere else is to name your Elixir files after the modules they contain, but in all lower case and with words separated by an underscore ( _ ). Now let's define our module StatsYard.TimestampWriter:



    Not too much special here if you're generally familiar with GenServer:
    • Line 8: A public function to make it easier to invoke the GenServer's callbacks
    • Line 12: start_link is the function we'll use to startup our process running the GenServer
    • Line 20: init is called by start_link and is the way we set the initial state of our GenServer (for now, just the integer 0)
    • Line 24: Our cast callback for actually doing the work. We could have made the timestamp a value that's calculated every time we cast to the GenServer, but for the sake of a TSDB we'll want to be able to accept arbitrary timestamps, not solely "right now" timestamps
    Let's see if it works! From the top level of our project, we'll hop into iex:


    Now let's run the appropriate commands and hopefully we'll see a timestamp written to the file we specify:



    Oops. We must have missed something - GenServer.do_send/2 is unhappy with our arguments. This particular function accepts two arguments: the Process ID (PID) of the GenServer whose callback you're trying to invoke, and the body of the message you want to send. In our case, our public write_timestamp/2 function is actually not calling a private function in our GenServer, but is instead sending a message to the GenServer's PID. That message contains a tuple that the GenServer should pattern match appropriately upon receipt.

    So, where did we go wrong? The message payload ( {:write_timestamp, "/tmp/foo.tstamp", 1459134853371}}certainly looks correct, so it seems to be the first argument, the GenServer's PID.

    When you name a GenServer you're effectively mapping an atom to a PID, which allows you to reference the process without having to know it's PID in advance. In our case, __MODULE__ equates to :"StatsYard.TimestampWriter", which should then map to our PID.... Oh, right! We don't have a PID, because we never started our GenServer process. Easy fix!




    Now we just need to check our output file and make sure it actually did what it was supposed to:



    Success!

    Next Time

    That's it for now, nothing too interesting just yet. Next time we'll see what happens when our GenServer dies, and what we can do about that.

    For now, you can peruse the source for this post at: https://github.com/strofcon/stats-yard/tree/lets-build-1

    Tuesday, March 15, 2016

    Sysdig Cloud - Monitoring Made Awesome Part 1: Metrics Collection

    Just in case you hadn't noticed before now, I'm a tiny bit obsessed with ops metrics. Most of my preoccupation with metrics seems to stem from the fact that... well... it's hard. "Metrics," for all that term entails, can be a difficult problem to solve. That might sound a bit odd given that there are a thousand-and-one tools and products available to tackle metrics, but with a sufficiently broad perspective the problem becomes pretty clear.

    I've worked with a lot of different metrics tools while trying to solve lots of different problems, but I always tend to come across the same pain points no matter what tool it is that I'm using. As a result, I'm pretty hesitant to really recommend most of those tools and products.

    Fortunately, that may very well be changing with the advent of Sysdig Cloud. Sysdig has really nailed the collection mechanism and is doing some great work on the storage and visualization fronts. This is the first in a series of posts describing how I think Sysdig is changing the game when it comes to metrics and metrics-based monitoring.

    Disclaimer: I am not currently (or soon to be), employed by Sysdig Cloud, nor am I invested in Sysdig Cloud. I'm just genuinely impressed as an ops guy who's got a thing for metrics. I've been using Sysdig for a few months, and I only plan to brag on features that I've used myself. I'll also point out areas where Sysdig is a bit weak and could improve, because let's face it - no one's perfect. :-)

    The Problem(s)

    There are essentially three stages to the life-cycle of a metric: Collection, Storage, and Visualization. Believe it or not, they're all pretty hard to solve in light of the evolving tech landscape (farewell static architectures!). This post will tackle Collection, and is a bit long since they do it so well.

    When Metrics Collection Gets Painful

    For metrics to be of any use you need some mechanism by which to extract / catch / proxy / transport them. The frustrating part is that there are actually several layers of collection that need to be considered. If we take the most complex case - a container-based Platform as a Service - three categories of metric should do the trick: host, container, and application. Handling all of these categories well is difficult - I tried with Stat Badger, and it was... well... a bit unpleasant.

    Host metrics are generally fairly easy to collect and most collectors grab the same stuff, albeit with varying degrees of efficiency and ease of configuration.

    Container metrics aren't too terribly hard to collect, though there are certainly fewer collectors available. The actual values we care about here are usually a subset of our host metrics (CPU % util, memory used, network traffic, etc) scoped to each individual container. This starts to uncover the need for orchestration metadata when considered within a PaaS environment.

    Application metrics can easily become the bane of your existence. Do we have a way to reliably poll metrics out of the app (for example, JMX)? If so, does our collector handle this well or do we need to shove a sidecar container into our deployments to provide each instance with its own purpose-built collector? If not, do we try to get the developers to emit metrics in some particular format? Should they emit straight to the back-end, or should they go through a proxy? If they push to some common endpoint(s), how best can we configure that endpoint per environment? Then once we've answered these questions, how on earth do we correlate the metrics from a particular application instance with the container it's running in, the host the container is running on, the service the app instance belongs to, the deployment and replication / HA policies associated with that service, and so on and so on...???

    Enter the Sysdig agent.

    How Sysdig Makes Metrics Collection Easy

    The Sysdig agent is, right out of the gate, uncommon in its ambition in approaching to tackling all layers of the collection problem.

    Most collectors rely exclusively on polling mechanisms to get their data, whether it's by reading /proc data on some interval, hitting an API endpoint to grab stats, or running some basic Linux command from which it scrapes details. This works, but is generally prone to errors when things get upgrades / tweaked, and can be fairly inefficient.

    Sysdig does have the ability to do some of those things to monitor systems such as HAProxy and whatnot, but that's not its main mechanism. Instead, the Sysdig agent watches the constant stream of system events as they flow through the host's kernel and gleans volumes of information from said events. Pretty much anything that happens on a host, with the exception of apps that run on a VM such as the JVM or BEAM, will be result in (or be the result of) a system event that the host kernel handles. This has a couple of huge benefits: it's very low-overhead, and it's immensely hard to hide from the agent. These two core benefits of the base collection mechanism allow for a number of pretty cool features.

    Fine-Grained Per-Process Metrics

    Watching system events allows the Sysdig agent to avoid having to track the volatile list of PIDs in /proc and traverse that virtual filesystem to get the data you want. All of the relevant information is already present in the system events and this opens the door to some really nifty visualization capabilities.

    "Container Native" Collection

    By inspecting every system event that flows by, there's no middle man in snagging container metrics. No need to hit the Docker /stats endpoint and process its output, no worries about Docker version changes breaking your collection, and ultimately no need to relegate yourself to Docker for your container needs. This also combines beautifully with the fine-grained per-process metrics to give visibility into the processes within your containers in addition to basic container-wide metrics. It's pretty awesome.

    Automatic Process Detection

    The above two features combine very nicely to allow the Sysdig agent to automatically detect a wide variety of services by their process attributes, simply by having seen a relevant system event flow past on its way to the host kernel. This allows some amazing convenience in monitoring applications since the agent immediately sees when a recognized process has started up - even when it's inside a container.

    For example, if you're running a Kafka container, the Sysdig agent will detect the container starting up, see the JVM process start up, notice that the JVM is exposing port 9092, spin up the custom Sysdig Java agent, inject it into the container's namespace, attach directly to the Kafka JVM process (from within the container, mind you), and start collecting some basic JVM JMX metrics (heap usage, GC stats, etc) along with some Kafka-specific JMX metrics - all for free, and without you needing to intervene at all. That's awesome.

    StatsD Teleport

    I'm not going to dig into this one here since this is already a lengthy post. Just read this post from Sysdig's blog - and be amazed.

    Orchestration Awareness Baked In

    Orchestration metadata is 100% crucial to monitoring any PaaS or PaaS-like environment. One simply cannot have any legitimate confidence in their understanding of the health of the services running in their stack without being able to trace where any given instance of a service is running where it lives in the larger ecosystem. Sysdig seems to have a strong focus on integration with a number of orchestration mechanisms. If you configure your orchestration integration correctly, then any metric collected via any of the paths will automatically be tagged with ALL of that metadata on its way to the Sysdig back-end. Even better? The agent configuration on individual nodes in a cluster, for instance with Kubernetes, don't need to know that they're part of the cluster, only the masters need to know. The masters will ship orchestration metadata to the back-end, and then when metrics come in from a node that is part of that master's cluster, it will automatically be correlated and immediately visible in the appropriate context. Seriously, that's been making me a VERY happy metrics geek.

    To Be Continued (Later)

    At some point after this series expounding the ways Sysdig is trying to tackle metrics and monitoring the right way (in my opinion), I'll get around to posting some more technical how-to pieces showing how best to make use of some of these features.

    For now, happy hacking!

    Monday, March 7, 2016

    Deploying a Private PaaS: The Good, the Meh, and the Aw Crap

    Moving your organization's dev and prod deployment architecture to a PaaS model can bring a lot of benefits, and a good chunk of these benefits can be realized very quickly by leaning on public PaaS providers such as RedHat's OpenShift Online, Amazon's ECS, or Google's Container Engine. Sometimes though, a public PaaS offering isn't a viable option, possibly due to technical limitations, security concerns, or enterprise policies.

    Enter the private PaaS! There are a number of options here, all of which offer varying degrees of feature abundance, technical capability, baked-in security goodies, operational viability, and other core considerations. As with anything in the tech world, the only clear winner among the various tools is the one that best fits your environment and your needs. No matter what you choose, however, there are some key things to consider during evaluation and implementation that span the PaaS field independent of any particular tool.

    Let's walk through some of the ups and down of deploying a private PaaS. Please keep in mind that this post isn't about "Private PaaS vs. Public PaaS". Instead it assumes you're in a situation where public cloud offerings aren't a viable option, so it's more about "Private PaaS vs. No PaaS."


    The Good

    A full listing of all the good things a PaaS can bring would be pretty lengthy, so I'll focus on what I think are the most high-value benefits compared to a "legacy" deployment paradigm.


    Increased Plasticity and Reduced Mean Time to Recovery

    Plasticity: the degree to which workloads can be moved around a pool of physical resources with minimal elapsed time and resource consumption
    Mean Time to Recovery (MTTR): the time taken to recover from a failure, particularly failure of hardware resources or bad configurations resulting in a set of inoperable hosts (at least in the context of this post)

    Legacy architectures typically see one application instance (sometimes more, but it's rare) residing on a single (generally static) host. Even with some manner of private IaaS in place, you're still deploying each app instance to a full-blown VM running its own OS. This makes it difficult to attain any reasonable plasticity in your production environments due to the time and resources needed to migrate a compute workload, which in turn forces MTTR upward as you scale up.

    PaaS workloads generally eschew virtualization in favor of containerization, which can dramatically reduce the time and resources needed to spin up and tear down application instances. This allows for greatly increased plasticity, which consequently drags MTTR down and helps keep it reasonably low as you scale up.


    Reduced Configuration Management Surface Area

    Configuration management is not only sufficient for handling large swaths of your core infrastructure, but has effectively become necessary for such. That said, there's a lot of value in reducing the surface area touched by whatever config management tool(s) you're using. A particularly unhealthy pattern that some organizations find themselves following is one in which every application instance host (virtual or not) is "config managed." In the case of bare metal hosting this makes good sense, but in the event that you're deploying to VMs... it's no good. At all. 

    Reducing this surface area can act as a major painkiller by requiring much simpler and more consistently applied configuration management... er... configurations. With a PaaS, you only need to automate configuration for the hosts that run the PaaS (and hosts that run non-PaaS-worthy workloads), not the individual application instance containers. This makes life suck a lot less.


    Consistency and Control

    No one likes an overzealous gatekeeper - they're generally considered antithetical to continuous integration and deployment. On the other hand, an environment that isn't in a position to deploy to a public cloud platform is also very unlikely to be in a position to live without some fairly high level of control over its code deployments. The key here is automated gate-keeping, and a PaaS gives you a decent path to accomplishing this.

    Running a private PaaS for both staging and production environments gives you ample opportunity to funnel all code deploys through a consistent pipeline that has sufficient (and ideally minimal-friction) automated controls in place to protect production. This allows Infrastructure Engineers to provide a well-defined and mutually-acceptable contract for getting things into the staging PaaS, and consequently provides a consistent and high-confidence path to production - all without necessitating manual intervention by us pesky humans. Basically, if the developer's code and enveloping container are up to snuff by the staging environment's standards, they're clear to push on to production.

    Regarding consistency, developer's also gain confidence in their ability to deploy services into an existing ecosystem by being able to rely on a staging environment that mirrors production with far fewer variables than might otherwise have been present in a legacy architecture. Dependencies should all be continually deployed to staging, and thus assurance of API contracts should be much closer to trivial than not.

    Making DevOps a Real Thing

    Everyone's throwing around DevOps these days and it's exhausting, I know - but I'm still going to shamelessly throw my definition on the pile.

    My current take on DevOps is that an engineering organization most likely contains ops folks who are brilliant within the realm of infrastructure, and developer folks who are brilliant within the realm of application code. If you consider a Venn diagram with two circles - one for infrastructure and one for code - most organizations are likely to see those circles sitting miles apart from one another, or overlapped so heavily as to be indecent. The former diagram could be called "the DevOps desert," and the latter "choked with DevOps". Neither of these are particularly attractive to me.

    A well-devised and even-better-implemented PaaS has the potential to adjust that diagram such that the two circles overlap, but only narrowly. Ops people focus hard on infrastructure, and dev people focus hard on code, with a narrow and remarkably well-defined interface between the two disciplines. There's still plenty of room for collaboration and mutually-beneficial contracts, but dev doesn't have to muddy their waters with infrastructure concerns, and ops doesn't have to muddy their waters with code concerns. I think that could be a beautiful thing.

    The Meh

    There aren't a ton of objectively "bad" things about running a PaaS vs. a legacy architecture, but there are a few necessary unpleasantries to consider.

    Introducing Additional Complexity

    A PaaS is an inherently complex beast. There are mitigating factors here and the complexity is generally worth it, but it's additional complexity all the same. Adding moving parts to any system is, by default, a risky move. But in today's market and at today's scale, complexity is a manageable necessity - provided that you choose your PaaS wisely and hire truly intelligent engineers over those who can throw out the most buzzwords.

    Network configurations

    Containers need to be able to route to one another across hosts in addition to being able to reach (and be reachable by) other off-PaaS systems. Most PaaS products handle this or provide good patterns for it, but it still introduces a new layer of networking to consider during troubleshooting, automation, and optimization.

    Maintenance and Upgrades

    If you've build your PaaS and its integration points well, you'll end up with a compelling deployment target on which lots and lots of folks are going to run their code. This can make it tricky to handle host patching and upgrades without impacting service availability. That staging environment (and maybe even a pre-staging environment solely used for infrastructure automation tests) becomes very important here.

    Data Stores

    Anything that needs local persistent storage (think databases and durable message buses) are unlikely to be good candidates for PaaS workloads. It can be made to work in some cases, but unless you're a big fan of your SAN, you're likely best to keep these kinds of things off the PaaS. Even if you can make it work, I'm not yet convinced that there is much value in having such a workload reside in such a fluid environment.

    Capacity Planning and Resource Limits

    Container explosions can bite you very quickly, most likely due to a PaaS' self-healing mechanism glitching out. This is a complex component and you're guaranteed to find really... um... "special"... ways of blowing out your resource capacity until you get a good pattern figured out. A clear pattern for determining and enforcing resource limits is going to be incredibly helpful.

    Capacity planning also demands relevant data, which means you'll need some solid metrics, and that brings us to...

    The "Aw, Crap..."

    There are a couple of problems that a PaaS introduces which will, at some point, demand your attention - and punish you harshly should you neglect them.

    Single-Point-of-Failure Avoidance

    SPoFs are anathema to high availability. If you're going to introduce something like a PaaS, you would do well to think very hard about every piece of the system and how its sudden absence might impact the larger platform. Is the cluster management API highly available? Then you'll need to stick a load balancer in front of those nodes. What happens if the LB goes down though?

    Or what if you're running your PaaS nodes as VMs? That's fine, until you factor in hypervisor affinities - can't afford to have too many PaaS nodes residing on a single hypervisor. Even if you account for node -> hypervisor affinities, can you ensure that your PaaS won't shovel a large portion of a service's containers onto a small subset of nodes that do reside on the same hypervisor? The extra layer of abstraction away from bare metal here is likely to introduce new SLA-impacting failure scenarios that you may not have considered.

    You're very likely to be able to mitigate any SPoF you come across, but they're likely to be slightly different in some cases than what you've handled before, and it's worth considering how to avoid them up front.

    Metrics and Monitoring

    Disclaimer: I obsess over metrics. It's almost unhealthy, really, but hey - it works for me. So it's not too surprising that introducing a PaaS makes me very curious about how one can go about gathering relevant metrics from the various layers involved. There are effectively three layers or classes of metrics you need to concern yourself with in a typical PaaS: host, container, and application.

    Host metrics are just as easy as they've ever been, so that's not of much concern. Collection and storage of metrics for semi-static hosts is largely a solved problem.

    Container metrics (things like CPU, memory, network, etc. used per container) isn't too terribly difficult to collect, though storage and effective visualization can be more difficult due to the transient nature of containers - particularly when those containers are orchestrated across multiple hosts.

    Application metrics (metrics exposed by the actual application process within each container) are potentially a real bear of a problem. Polling metrics out of each instance from a central collection tool isn't too attractive since it's fairly difficult to track transient instances such as those that reside in a PaaS. On the flip side, having each application instance emit metrics to some central collector or proxy is feasible, but storage and visualization are still difficult since you'll inevitably need to see those metrics in the context of orchestration metadata that is unlikely to be available to your application process.

    These are not entirely insurmountable problems, but there are definitely not many viable products currently available that have a good grasp on how best to present the data generated within a PaaS. In fact, so far the only "out of the box" solution I've found so far that handles this well is Sysdig Cloud. These folks are onto something pretty awesome, and I plan to elaborate on that in my next post.

    Most People Are Doing Containers All Wrong

    Containers should run a single process wherever possible, and should be treated as entirely immutable application instances. There should be no need to log into a container and run commands once it's running on the production PaaS. Images should be built layer upon layer, where each layer has been "blessed" by the security and compliance gods. Ignoring these principles is worthy of severe punishment in the PaaS world, and your PaaS will be all too happy to dish out said punishment. Don't go in unarmed.


    Summary

    Implementing a PaaS in your environment is most likely well worth the effort and risk involved, just so that your organization can push forward into a more scalable and modern architecture. While there are certainly a few potential "gotchas," there's a lot to be gained by going into this kind of project with a level head and enough information to make wise decisions. Just keep in mind that this landscape is changing pretty rapidly, and moving to a PaaS is certainly not a decision to be made lightly - but it's still probably the right decision.