Thursday, September 7, 2017

An Ops Nerd Learns Probabilistic Data Structures, Part 1: Intro

I have a friend who is scary smart. I mean, I'm a reasonably smart guy, sure... but whoa. As it turns out (very much in my favor) he's also super nice and happy to help me learn new things. So when he said something about a "probabilistic data structure," and I promptly responded with, "right, I know all three of those words... but... huh?", he kindly gave me a gentle introduction to what I would soon learn is a super cool category of Computer Science.

What the heck is a "probabilistic data structure?"

Well, to be glib, it's a data structure that relies on probabilities and stuff. More usefully, it's (usually) an efficient way to store massive quantities of data in such a way that you get, at most, "fairly confident" answers about that data.

I know, that's not terribly helpful either, but we'll get there! I aim to make this series easy to digest and relate to. I'm not a math whiz nor am I a software engineer, which means I have to think about things in way simpler terms than either of those people would. Then when I go to explain it to folks smarter than myself (read: you), I like to do it in terms that are less muddied than the symbol-filled abstract articles and videos I've seen so far. I figure if I can't explain it to someone else in simpler terms than those in which I learned it, I probably don't understand it as well as I think.

Let's get some context: Why would I need such magic?

Shell Game - Distributed Data Stores

Imagine that we have a pile of records that you need to store. We want the full data set to be stored on disk, but we can't possibly fit it all on one machine. So we split it up across 3 machines, and life's good!

Eventually though, we'll want to retrieve some of those records. It would be pretty wasteful to go to each machine, ask for a record, and force the machine to make an expensive series of disk accesses just to tell you that the princess is another castle. We might get lucky and land on the right machine the first time, but my (sketchy) math skills tell me that this is unlikely to happen with more than, roughly, one-third of our attempts.

Instead it would be handy if, as we ingest each record into our multi-machine data store, we write a highly-compressed version of that record to an in-memory data structure that can tell us - with some tune-able degree of confidence - wheter the data we want does or does not reside on a particular machine. One way of doing this with an impressive level of space efficiency is by implementing a Bloom filter, which we'll cover in the next post in this series.

#Hashtag #Frequency #Tracking #For #Fun #And (#Probably #No) #Profit

Imagine that we wanted to track the approximate frequency of various hashtags from the live Twitter stream. We can't really expect to store a full set of this data in memory with perfect fidelity, as we'd never have enough space. We also aren't likely to be successful dumping it all to disk, as that's a lot of writes that never freaking slow down.

We can, however, sacrifice a small bit of accuracy for a massive increase in space efficiency. We would do this with a data structure called a count-min sketch, which can tell us that a particular hashtag has most certainly not been seen more than N times, but may have been seen slightly fewer than N times. We'll cover this one in Part 3 of the series.

An understanding of HyperLogLog is its own reward

Don't worry, we'll get there. For now, it should suffice to say "it's weird, and it lets us achieve nigh-ludicrous space efficiency."

Up Next!

Part 2 will cover the Bloom filter. We'll talk about its perks, drawbacks, mechanisms, and use-cases. Then we'll build one and benchmark it! It should be all kinds of fun. See ya then!

Monday, June 27, 2016

Let's Build Something: Elixir, Part 6 - Adding Metrics to StatsYard

We're going to take a bit of a detour with this post, and look at what we can do to get some basic metrics spit out of StatsYard! Not only am I a huge metrics geek so this is a pretty strong urge for me, but it's also something that will be helpful as we go forward. Implementing new functionality and fixing up old is generally made more difficult by not having decent measurements, and laying the groundwork for that now will make it easier to keep up with it as we go along.

The old adage that premature optimization is the root of all evil still holds, of course, even for a metrics obsessed type like me. While there's value in checking that we haven't introduced a major performance impact to the system, it just won't do to nitpick every little bit and try to tease out a tiny bit more performance at this stage of the game. I mostly just intend to use it to get an idea of the system's capabilities and as a learning opportunity when we start looking at alternative compilers and organizational structures.

An Overview

So! Let's get into it, shall we? The high level view of what we're going to implement is this:
    • Custom stats should be super easy to instrument as we build new functions and features
    • VM stats are awesome, but not something we should have to gather ourselves
    • Stats need to be shipped somewhere, preferably quickly and easily
    All of this will evolve over time, for sure, but this seems like a good start.

    O Back-End, Where Art Thou?

    Before we get too far, we need a place to stick our metrics! I'm actually using Sysdig Cloud for this, because they have an agent that beats all the rest, and their UI will suit my purposes nicely. The key feature of their agent that I'm leaning on is the built-in statsd server that will ship my stats to their back-end so that I don't have to worry about it. Fret not, though - nothing in this post will demand that anyone else use Sysdig, that's just where I'm going to throw data for now.

    Wiring Up Our New Dependencies

    First things first, let's add all of our dependencies and start up the apps as needed. We'll just lump them all in for now, and spend the rest of this post looking at how to make use of them. Crack open mix.exs and add :folsom:statixand :ex_vmstats dependencies:


    Still in mix.exs, we also need to tell our project to start these new applications, otherwise they won't do their job at startup:


    Our First Custom Metric

    Like I mentioned earlier, we're going to be using Folsom for custom stats. It's an Erlang app, and it's a pretty good analog to Java's codahale.metrics, now known as DropWizard. The Github readme has pretty much all the info you could need about how it works, so we'll jump straight into instrumenting a call to one of our functions - DataPoint.validate/1Open up lib/stats_yard/ingest_consumer.ex, and we'll wrap our function call with a :folsom_metrics function:


    We first have to define our metrics so that Folsom can get the appropriate structures setup in memory, so we add a private function create_metrics/0 and then call it when we init our GenServer.

    Next we wrap our validate/1 function with Folsom's histogram_timed_update/2. This Folsom function is basically a pass-through function, in that it accepts a function of our choosing as one of its arguments, evaluates that function and does some kind of work in relation to it, and then returns the value of the function that it evaluated. In this case, the work done in addition to our validation function is timing its run time and updating an ETS table that Folsom maintains for "doing math" on and maintaining a history for our metric. From our perspective, though, almost nothing has changed. We don't get any new return values or anything from validate/1,  and so we can still use it in the same manner we wanted to originally.

    After we've validated the data point and gotten a timing for the invocation, we increment the counter validate.rate, and then move on!

    Dude, Where's My Data?

    So we created some metrics and updated their values - why don't we see anything happening in our statsd server??? As it turns out, Folsom's only goal in life is to keep track of our stats, not actually send them anywhere. The beauty of Folsom is that it's optimized to keep track of our stats and keep them accessible in ETS tables, which reside in memory and are extremely fast. With Folsom doing the math and handling the history and access paths of our metrics, we're left with all the flexibility in the world when it comes to choosing what to do with the data that's tracked. So, to that end, let's wire up Statix so we can actually send these bits somewhere!

    Setting up Statix

    It doesn't take much to get started with Statix. At a minimum, we need to:

    • Define a config for Statix in config/config.exs
    • use Statix in a module (any module), which will pull in a handful of useful functions that we can use to ship metrics to a statsd server
    • Make a quick call to the connect/0 function that our Statix-using module now has, and we're off to the races with such functions as gauge/2, increment/2, decrement/2etc.

    Our config is simple enough. Clearly one should change the host value to something appropriate as needed, but generally this is fine:


    That's all good and well, but we need to wire up our in-memory Folsom data with Statix so that it's actually of some use. Enter our shiny new StatsYard.StatShipper module!

    In the file lib/stats_yard/stat_shipper.ex we'll place the following:


    Whew! Lots going on here, so let's talk about it at a high level, then we'll jump into the line-by-line breakdown. The general idea here is to build a supervised GenServer that will, on a certain interval, iterate across all metrics known to Folsom and ship them out via Statix. This keeps the shipping of metrics out of the data path for actually collecting the metrics, and since it's supervised separately at the top level, its workload isn't tied to any other part of the app.

    Let's dive in and see what exactly is happening here:
    • Line 9-10: Folsom's histogram and meter metric types are actually collections of various related values, but we don't necessarily always want all of them. Here we define the ones we want for now
    • Line 12: This GenServer needs to do work every N milliseconds, and we default N to 1000
    • Line 17: Our init/1 function does nothing more than schedule the first metrics shipment, then lets everything run itself from here on out
    • Line 21-23: This function shows how easy it is to do something after a set time. Process.send_after/3 does exactly what it sounds like - sends a message to an arbitrary PID after the specified time. In this case, the GenServer is sending itself a message telling itself to ship metrics, but waits ship_interval milliseconds before doing so
    • Line 25: We're using handle_info/2 here, because it's the GenServer callback that can handles all arbitrary messages send to the GenServer that are NOT handled by any other callback. Our message telling the server to ship metrics is, indeed, one such arbitrary message
    • Line 26: Here we call the function that actually kicks off the iteration over Folsom's metrics and then ships them off. We also set the GenServer's state to the return value of this function, which we could use (but aren't currently) to terminate the server if metrics shipping failed
    • Line 27: Once metrics have been shipped, schedule the next shipment for ship_interval from now
    • Line 32-41: For every metric that Folsom has registered, deconstruct its metadata tuple and take appropriate action based on the metric's type
    • Line 50-54: Break apart a meter metric and only ship the bits that we've allowed with @shippable_meter_stats
    • Line 60-71: Histogram metrics in Folsom are a bit of of a mix of data types, so we need to walk every keyword in the list and when we come across the :percentile keyword, handle it appropriately. All other keywords are shippable as-is
    I know, it's a mouthful (keyboardful?), but hopefully it helps to explain things a bit. Now we need to start up our GenServer and get it supervised, so off we go to lib/stats_yard.ex where we'll add a new function to be called from start/2:


    A Note On Running Multiple Aggregators

    I can hear you asking now - why so many gauges in the stat shipper module? Well, there's a bit of a headache that can crop up when you run multiple stats aggregators like we're doing here with Folsom and StatsD. Both tools are going to store incoming values and do some math on them, and maintain some rolling / windowed values as well. If we let both tools do this, however, then by the time our metric data makes it to the back-end, we're going to be seeing skewed (and often, watered-down) values. We can generally avoid this by allowing Folsom to do the math and windowing, and ship the results to StatsD as gauge values. This causes StatsD to consider the value to be a point-in-time value that's not dependent on any past events, and thus passes it through as-is.

    Last But Not Least - VM Stats

    For grabbing VM stats (memory utilization breakdown, process counts, queue sizes, etc.), we're going to start out with ex_vmstats. It does its job right out of the box, but we need to make it work with Statix first. As is mentioned in the Github readme for ex_vmstats, we need to create our own Statix backend and then specify it in our mix config. Let's take care of the backend module by creating a new file named ex_vmstats_statix_backend.ex:


    Now all that's left is to start up the app and we'll start sending VM stats to the back-end!

    Checking It All Out

    To help test out our shiny new stats without having to do a lot of super repetitive typing, I've added a small module called StatsYard.Util.IngestStress. It can be used as such from an iex -S mix session:

    StatsYard.Util.IngestStress.run 2_000_000

    This will lazily generate 2 million valid DataPoint structs and shovel them into our ingest queue. I took the liberty of doing this locally and grabbing a screenshot of my Sysdig dashboard to give an idea of what's going on:


    Cool! Looks like my current setup can validate some ~15,000 data points per second, with each generally taking under 20 microseconds. Excellent! There are some other cool VM stats that I've graphed here too, which are definitely worth taking a look at as we move forward.

    Next Time

    A bit of a departure from what was promised last time, we'll just be looking at the data path next time 'round. Worker pools will have to wait until we actually start writing some data to disk!

    Until next time, feel free to peruse the code for this post at https://github.com/strofcon/stats-yard/tree/lets-build-6.

    Wednesday, May 11, 2016

    Let's Build Something: Elixir, Part 5b - Testing Data Validation and Logging

    Let's wrap up this installment by writing some unit tests to safeguard us against bad changes to our data validation code!

    First we'll put together a couple of tests to ensure that validation and invalidation are working correctly:


    Let's see how we did:


    Well, it certainly worked, but it's a bit clunky. Two things stand out:
    1. We should probably test multiple invalid DataPoints, since there are multiple guards that we want to make sure we got right. 
    2. We probably don't want those log messages showing up in the test output. We likely want to try and capture those to ensure that we are indeed logging appropriately, without cluttering the output of the test run.
    So let's do that! First up, let's add some more bad DataPoint structs to test. We'll switch to using a list instead of individually-named structs for our tests, and we'll also do the same for the valid DataPoints - just in case we decide to test multiple ways there. Consistency FTW!


    And the run:


    Cool! We made our logging problem even worse, though. Let's take care of that. To make this work, we're going to use ExUnit's capture_log/2 to capture the output from Logger, spit out by our DataPoint module:


    I inserted a small bug into the test for valid DataPoints - do you see it? No? It's pretty subtle, no worries. I told capture_log/2 that I'm expecting a Logger message with level :info, but our validation function is actually logging valid structs as :debug messages. Let's see how this works out:


    So capture_log/2 captures the log event, but swallows it and the assertion fails since the level was mis-matched. Handy!

    Switch the valid capture_log/2 call back to :debug and we're good to go:


    (Note: I'm not sure yet how to make it stop complaining about unused variables here. They're certainly used, so I'm not sure why it's not happy about my use. I'll dig into that and iron it out in a later commit.)
    Shout-out to DNNX for pointing out the problem with the unused variables in the DataPoint tests!  I've also updated the gist in this post to reflect his change. Check out his pull request for an explanation of what was happening here. Many thanks Viktar!

    Next Time

    We'll continue next time by introducing a worker pool that we can use to distribute our data validation workload over, and start sketching out how the write path will look when we start persisting data to disk.
    Until then, feel free to peruse this post's code on Github at https://github.com/strofcon/stats-yard/tree/lets-build-5

    Monday, April 18, 2016

    Let's Build Something: Elixir, Part 5a - Data Ingest, Consumption, and Validation

    Whew! Been a little while, but let's keep cruisin'! This installment will tackle defining a custom data type for our data points, providing a means by which we can queue up ingested data and consume it continuously, and validate it as we go along. We'll also write some tests to keep our sanity.

    First let's define what our data points should look like. We'll keep it simple for now, and can expand later. Open up a new file at lib/stats_yard/data_point.ex:


    There a couple of things happening here that you might be interested in:

    • Line 4: We define a type for our DataPoint struct. Note that __MODULE__ is just a safe way to reference the name of the current module (StatsYard.DataPoint). We can later reference this as StatsYard.DataPoint.t when the need arises.
    • Line 5: Define our basic DataPoint struct. Note that structs always inherit the name of the module in which they are defined. If we wanted it to be referenced as something other than %StatsYard.DataPoint{} we would need to define a module within the outer module, such as StatsYard.DataPoint.Thing. There's no real need for that in this case.
    • Line 7-10: Set up a validation function that will only work when the argument passed is one of our shiny new structs, and the various keys therein pass our guards. Specifically we want the metric and entity fields to be strings (or binaries, in Elixir/Erlang land), and the value to be some type of number.
    • Line 11-12: If we end up in this version of the validate function, log a message and return a tuple to indicate success and spit back out the provided struct.
    • Line 15: Define the "fall-through" validate function that will match on any argument that doesn't pass the above guards. In this case, log a warning and an :invalid tuple with the provided value included.
    This module is intended to wrap up the structure of the data we want to use and the functions that are relevant in inspecting and validating it.

    Next up, let's add something to let us queue up incoming data points. I like Joe Kain's BlockingQueue. It's a statically-sized GenServer'd FIFO queue that blocks when full or empty. Super simple, and very effective. NB: Joe also has a super awesome blog called Learning Elixir that you really should check out.

    First up we need to add it to our deps list in mix.exs:


    Then follow that up with a mix deps.get, and we're ready to roll.

    First let's walk through the idea here. I want to have a BlockingQueue GenServer that catches our ingested DataPoints, and a separate consumer GenServer that will pop those DataPoints off the ingest queue and validate them.  The most important part of all this is that I don't want to do things in batches, nor do I want to have to explicitly trigger the consumption of data from the ingest queue. Enter supervision and streams!

    BlockingQueue's API gives two functions for popping events off a queue: pop/1 and pop_stream/1. As you might have guessed, pop/1 removes and returns the oldest single value from the queue, while pop_stream/1 returns a Stream function - specifically, Stream.repeatedly/1. If you're unfamiliar with Streams, they're effectively a safe mechanism by which you can perform actions on arbitrarily large data sets without having to pull the entire thing into memory. I'm not the best to describe these in great detail, but Joe Kain and Elixir's Getting Started guide have some good descriptions and applications.

    So the layout of these bits is going to be something like this:

    1. Start up a named and supervised BlockingQueue GenServer
    2. Start up a named and supervised consumer GenServer that can find the above BlockingQueue process
    3. Consumer process grabs hold of the Stream function returned by BlockingQueue.pop_stream/1 and proceeds to validate every DataPoint that gets pushed into the queue

    Here's our consumer:


    So here's what's going on in this module:

    1. Line 6: This is my nifty way of telling this function "keep an eye out for things popping up in this stream of data, and take appropriate action when you see a new item." We'll test this out shortly in iex
    2. Line 7: Try to validate, as a DataPoint, every item that comes off the stream
    3. Line 8-9: If the validate succeeds, return the DataPoint, otherwise discard it entirely (remember that our StatsYard.DataPoint.validate/1 function will log a warning when a value fails validation)
    4. Line 18: Note that our public start_link/2 function expects to receive the PID of our ingest queue, which we'll provide in lib/stats_yard.ex when we set up our supervision tree
    5. Line 23-25: Start up a linked process that will kick off our queue consumption loop in consume_data_points/1
    6. Line 27: Set our GenServer's state to the PID of our ingest queue, and we're done!
    Notice that this is a very simple GenServer - so simple, in fact, that it doesn't even have any direct means of interaction. For now, this is more than sufficient - we just want something that we can supervise and organize appropriately, with the option to extend it for more robust behavior in the future. (For the studious, you're right - there's always Elixir's GenEvent, but that's for future posts!)

    Now let's rig all this up in our supervision tree, and then we'll poke around in iex to see if it's all working as expected. Notice that this has been cleaned up a bit to accommodate our TimeStampWriter bits without getting too cluttered:


    (I know that's a big chunk of code to dump into a blog post - my apologies. I mostly wanted to be sure to point out that the structure of this stuff changed significantly. Newer changes will be limited to just the diffs. :-) )

    Nothing super exciting here, other than a few things to note:
    1. All supervisors are now started up in their own independent functions, which are called from the pared-down start/2 function
    2. Our supervisors are now named appropriately (Lines 20 and 36)
    3. Our start_main_ingest/0 function lists two workers to be started up under the appropriate supervisor, which will start in the order listed (this will be on the quiz)
    4. Atoms used to name our GenServer processes are pulled out and returned from simple functions at the bottom of the file, so as to avoid headaches later
    Enough work, let's play with it! Fire up iex and we'll see if our stuff works:


    Cool! We're able to push things into our BlockingQueue without having to know much about it, and our IngestConsumer immediately received the pushed values and attempted to validate them, the results of which are spit back out via log messages.

    Now for that quiz I mentioned earlier: in what order were our two ingest GenServers started? Yup, the order listed in our supervision tree definition - the queue first, then the consumer. Why does this matter? 

    There's a failure case that we need to recognize and accommodate. Specifically, if our ingest queue process dies, it will indeed be restarted by the supervisor... but our consumer process will merrily chug along holding onto a Stream function that references a now-dead process! That sounds like bad news, but let's verify that I'm not making stuff up:


    I know that's a bit dense, but the gist (har har) of it is that we used Supervisor.which_children/1 to see what the PIDs of our two GenServers were, stopped the main_ingest_queue process (rather rudely, too, a la :kill), then checked to see that the expected PID had indeed updated in the supervisor's state. Then we tried to push a value into the main ingest queue, which did indeed work since it had been restarted, but our ingest consumer process never knew about it, because it's waiting for events to flow in from a dead process. That's lame, so let's fix it!

    Turns out, this a super simple one-line fix, but reading the docs is a must in order to understand why this fix is appropriate (head over to the Supervisor docs, then search for "rest_for_one"). In lib/stats_yard.ex:


    And now to test it out in iex:


    Woohoo! Worked like a charm. What's happening here? First, read the docs. :-) Second, in a nutshell, using the strategy rest_for_all causes the supervisor to consider the position of any process that dies under its supervision and then kill and restart all supervised processes that were started subsequent to the original dead process. In our case, the queue process is the first one, so if it dies, everything in the supervision tree of our MainIngestSupervisor is restarted. If it were, for example, the 3rd process started by this supervisor, then the 4th, 5th, ..., nth processes would be restarted, while the 1st and 2nd processes would be left alone. Super handy stuff here!

    To Be Continued...

    So now we're in a good place from a supervision point of view. This post is already pretty lengthy, so I'm going to title it "Part 5a," and we'll continue with some unit tests and documentation in Part 5b.

    Til next time!

    Thursday, April 14, 2016

    Let's Build Something: Elixir, Part 4 - Better Tests, TypeSpecs, and Docs

    We left off with our first test case working, but less-than-ideal. Specifically, it's leaving the timestamp file it writes sitting on the disk, and in an inappropriate location (the root of our project). This is super lame, and we should fix that.

    Enter ExUnit callbacks and tags! ExUnit allows us to pass configuration data into and out of our tests by means of a dictionary, usually referred to as the "context". We can make good use of this context data by way of setup callbacks and tags. These are described well in the docs, and we'll lean on their examples for what we need to accomplish here.

    So our test is currently testing TimestampWriter's ability to... ya know... write timestamps. And it works great, other than leaving the temp file sitting in the root of our project. While we could just add some code to our test to explicitly handle this, a better (and less-repetitious) approach is to modify our overall test case to do some setup and tear-down tasks for us automatically!

    First, remove the junk file leftover at stats_yard/tstamp_write.test, and then we'll add a setup callback to our test case that will force our writes to happen in an appropriate directory:


    Here we're exercising a one-way communication from our test to the setup callback by way of a tag. What's happening here is that ExUnit will call our setup callback before execution of every test within our test case. By preceding a test definition with @tag, we are specifying that a context dictionary should be passed to our setup callback that contains a key-value pair of { cd: "fixtures" }This is mostly copy-pasta'd straight out of the ExUnit docs, but a bit of explanation can't hurt:
    • Line 2: We need to make sure our tests don't run in parallel since we're going to be switching directories. It sucks, but it's the nature of the beast
    • Line 4: Define our setup callback, which will be executed prior to every test that is run
    • Line 5: Check to see if our cd tag is present in the callback's current context dict. This is necessary because the same callback is executed for every test, but not every test will necessarily use this particular tag
    • Line 6-7: Store the current directory and switch to the directory specified in the context
    • Line 8: When the test exits (whether success or fail), switch back to our original directory
    • Line 14: Our handy-dandy tag for the test that immediately follows
    Let's see if it works!


    Nope! ExUnit apparently doesn't create directories for you. Oops. Easy fix, and again:


    Much better. Now we should do some cleanup after the fact, because let's face it - no one wants temp files committed to their repo.


    A quick rundown of the updates (slightly out of order):

    • Line 23: Add a `tempfile` tag to our test to indicate that we're going to (attempt) to write a transient file
    • Line 24: Make our test accept a dict argument called `context` (which will contain our `tempfile` key)
    • Line 25: To keep things DRY, refer to the context's value for :tempfile instead of repeating the filename explicitly
    • Line 9: When the test is done, check to see if a tempfile was specified for the test that's being set up
    • Line 10: Make sure the tempfile actually got written, otherwise Line 11 will blow up
    • Line 11: Call the "dirty" version of File.rm/1, just in case there are any weird permissions issues that prevent deletion of the file
    So now we should be able to run our test, and see precisely zero remnants of it:


    Perfect! Now we can write more tests in here and gain some nice organization and cleanup bits without having to provide anything beyond a couple of appropriate tags. (And after all that, yes, I do realize that a tempfile doesn't necessarily need to go into its own directory if it's just going to be immediately deleted. This just makes me feel better.)

    To wrap up, let's make our GenServer's module and API a bit more legit with a typespec and docstrings:


    The @moduledoc and @doc directives are pretty straightforward - wrap up your docstrings in the """ markers, and get yo' docs on. Keep in mind that the docs are in markdown, so you can (and really should) make them look pretty.

    The @spec directive on Line 22 is simply a way to specify the types of the arguments our function can accept, an the type of value it will return. Easy stuff, and super helpful when we start looking into static analysis - it can help iron out a ton of bugs early on.

    Next Time

    Now that we've spent some time on some of the basics that we'll be seeing over and over again, the next post will get into more of the meat of our project and start doing stuff that's more interesting than writing a timestamp to a file. Specifically, we'll define the first iteration of our data format and figure out a way to represent that in code such that we can validate incoming requests for appropriate structure.

    Until then, feel free to peruse the source for this post at: https://github.com/strofcon/stats-yard/tree/lets-build-4









    Monday, April 4, 2016

    Let's Build Something: Elixir, Part 3 - Getting Started with ExUnit for Testing

    NOTE: Before you get too far into this one, I want to mention that I realized I wasn't following convention in my Elixir file names, so there's a commit at the beginning of this post's branch that fixes it (and it's been merged to master as the others have as well). Just a heads-up in case it seems weird all of a sudden. :-)

    Last time we made our TimestampWriter GenServer a supervised process to make it more resilient to bad inputs and other process-killing events. Now it's time to protect our GenServer from a much more sneaky and persistent assailant - us! This seems like a good time to get familiar with ExUnit and build our first test case for StatsYard.

    Defining ExUnit test cases is pretty similar to defining any other module in our Elixir project. If you pop into the stats_yard/test directory and take a look at stats_yard_test.exs, you'll see a simple example test:


    Running this test is as easy as a quick mix test:


    Let's break that test down just a bit:
    • Line 1: As mentioned above, a test case is simply an Elixir module
    • Line 2: Pull in ExUnit's test case bits
    • Line 3: This line will cause ExUnit to do some magic that we'll discuss at a later date
    • Line 5: Defines a unit test with an arbitrary string label
    • Line 6: Makes the bold assertion that 1 + 1 does in fact equal 2
      • assert is basically just shorthand (or more accurately, a macro) that says "everything after the keyword 'assert' better be true, otherwise I'm gonna blow up and fail spectacularly". (There's a bit more to it, and we'll tackle that next.)
    To see assert do its thing in a less-than-true situation, we can just change the 2 to a 3 on Line 6 and run  mix test:


    In a nutshell, what happened here is that we insisted 1 + 1 = 3, and assert totally called us on it. What were we thinking???

    There's some interesting stuff in that output block. First it tells us which test failed ("the truth"), what file that test lives in (test/stats_yard_test.ex), and the line number that the test definition starts on (:5). After that, it tells us the general type of 'thing' we were trying to do (assert with ==) and shows us the specific assertion that failed.

    Next up are two interesting and very helpful lines: lhs and rhs. These acronyms stand for "left-hand side" and "right-hand side" respectively, and these lines actually give us some insight into the way the test actually works under the hood. If you haven't encountered them before, lhs and rhs are hyper-relevant to one of Elixir's most powerful features: pattern matching!

    These two lines are telling us that ExUnit took our assert expression and made an attempted pattern match expression out of it, with the actual evaluated value on the left-hand side of the match, and the asserted value on the right-hand side, like so:


    In this iex session we can see an example of both of the test attempts we've tried so far - the first one being the successful test, and the second being the intentional failure. Hopefully this provides a bit of clarity around how ExUnit is actually accomplishing this particular test.

    So that's all fine and dandy, but we really should work on testing our TimestampWriter. Go ahead and switch that pesky 3 back to a 2, and we'll get started!

    First let's create a directory that will hold our tests - it's cleaner, and seems to be the convention used in most projects. Then we'll create a file in there to hold our first real test case (note that test files have to be named "<stuff>_test.exs, otherwise the mix test task will skip them):


    In timestamp_writer_test.exs we'll start out with the bare-bones first increment of our test, BUT we'll try to make it fail first by passing a bad argument to our public API function write_timestamp/2:


    (Note that I stopped naming these modules StatsYardTest.*, no point to it as far as I can tell.)

    And a quick run to see what's up:


    Huh... well that's... um... awesome? Not really. The GenServer process did indeed fail like expected, but the tests still technically passed. What gives?

    As it turns out, we tested our public API function, write_timestamp/2, not so much our GenServer. Our function is simply calling GenServer.cast/2 which then asynchronously sends a message to our TimestampWriter process. That send is indeed successful and returns the :ok atom - even though our process dies shortly thereafter - and that's exactly how GenServer.cast/2 is intended to operate.

    So how do we fix that? Well to be entirely honest, I don't know yet. BUT! There is a silver lining - this experience has made me re-think whether or not this particular activity is best handled as a cast or a call, which basically boils down to "should it be asynchronous with no response, or synchronous with a response?" Given the intended purpose of this particular function, I think a call is more appropriate: we're going to need some manner of acknowledgement that our data has indeed been written to disk before moving on to whatever our next task might be.

    So! Back to our TimestampWriter code:


    To recap the changes here:
    • Line 9: Switch from GenServer.cast/3 to GenServer.call/3
    • Line 24: Switch from GenServer.cast/3 to GenServer.call/3 and add an (unused) argument, _from, for the sender's PID (we don't particularly need this right now, hence the underscore to keep the compiler happy)
    • Line 25: Bind the result of our file write operation to result
    • Line 26: Use the appropriate response from a call, which is to return the :reply atom, a result of some sort and the new state of the GenServer
    Notice a cool thing here, too: our interface to the GenServer didn't change at all, so we don't need to update our test! We should be able to run mix test and see our test fail appropriately:

    (Note: There will still be some extra output after this as a result of our GenServer tanking on bad inputs. We'll try to fix that another time.)

    Perfect! Now if we stop passing a known-bad argument to our public function in the test, we should get a nice passing test:


    Success! Whew. That was a bit of a runaround to get a simple test in place, but I learned a lot, so it doesn't seem like a wasted effort to me.

    As a final cleanup step (for now), I'm going to remove the simple truth test from the out-of-the-box test file that mix creates, because I don't really care for the clutter.

    Experiment

    We left a bit of an unpleasant side effect in place with our test. Hint: our timestamp writer spits out a timestamp somewhere. Figure out where it's landing, then peruse the ExUnit docs and see if you can figure out how to make that stop happening. No need for that clutter! The next blog post will cover how to fix this.

    Next Time

    We're not quite doing TDD, but hey, it's a start! Next time 'round we'll clean all of this up a bit more (as mentioned in the Experiment above) with some typespecs and docs. Exciting stuff, eh? At the end of the day, it's worlds easier to do these things up front, rather than trying to retrofit them later - plus we can make use of them for some testing convenience (or at least that's the hope!)

    For now, you can peruse the source for this post at: https://github.com/strofcon/stats-yard/tree/lets-build-3

    Wednesday, March 30, 2016

    Let's Build Something: Elixir, Part 2 - Supervising Our GenServer

    In Part 1 we built a simple GenServer and worked our way through making it write a timestamp to a local file.

    During that process, we ran into a usage issue - we had to explicitly start the GenServer before we could invoke its write_timestamp/2 function. It makes sense, but it's inconvenient and dependent on human interaction. We should (and can!) fix that.

    Automatically Starting Our GenServer


    The file lib/stats_yard.ex is the main entry point for our app, so that seems like a good place to add a call to StatsYard.TimestampWriter.start_link/0 (ignoring the supervisor bits for now):


    Now to verify that it acts as we'd expect:


    Much better! Now let's see what happens when it dies, as most everything on a computer is likely to do at the worst possible time.

    There are a few ways to easily kill a process in Elixir, but I always enjoy throwing a value into a function that it can't handle and watching the poor process get shredded. I'm a jerk like that. Let's pass a tuple via the timestamp argument and see what happens:


    Excellent! Our GenServer is dead as a hammer. Soooo now what?

    Well, we can restart it manually in our iex session, but that's a non-starter when you consider your app running unattended. We could also just drop out of our session and then start it back up, but that's still pretty lame.

    Enter supervisors! If we take another glance at the chunk of code we saw above, paying special attention to the bits I said to ignore, we'll see the structure we need to use to start up our GenServer as a supervised process:

    • Line 11-14: A list of your Supervisor's child processes, which can either be regular process or another Supervisor (sub-Supervisor)
      • Each item in this list will contain the module name for a process and will call that module's start_link function
      • Notice that the arguments passed to your start_link function are provided here as a list. This is done so that the Supervisor's start_link function doesn't have to handle arbitrary argument counts (arities) based on your process's start_link arity
    • Line 18: A list of Supervisor options. For now, the provided values are more than sufficient, and we'll talk more about what each one means in a later post
    • Line 19: The start_link call for your Supervisor process, which (shockingly!) looks awfully similar to the call we make for our own GenServer
    Now let's move our GenServer startup bits into the Supervisor's children list, and see what we can make of it:


    And the test:


    Woohoo! Our application is now considerably more "self-healing" than it was before. We can start up iex, and immediately start writing timestamps without explicitly starting our GenServer. Then we can crash the GenServer process, and immediately continue on writing timestamps without having to explicitly restart the process. Excellent!


    Experiments

    If you're the curious sort (and you really should be! It's awesome!), you may want to try and poke around with this a bit more by trying to crash our GenServer several times and seeing if the behavior persists. See if you can find a limit to what we've done here, and see if the docs for Elixir's Supervisors can guide you toward a fix.

    Next Time

    We've got a nice simple GenServer automated and supervised, but there's no guarantee that our next batch of code changes won't break something in some unforeseen way. We'll poke around with ExUnit next time, and see if we can start a basic unit test suite for our project.

    For now, you can peruse the source for this post at: https://github.com/strofcon/stats-yard/tree/lets-build-2