tag:blogger.com,1999:blog-68545736577842405552024-02-20T14:13:56.317-06:00Ramblings of a GeekMy name is Cody Boggs, and I'm a geek. I like to build things, break them, and then build them better!
<br>
Github: <a href="https://github.com/cboggs">cboggs</a>
::
Twitter: <a href="https://twitter.com/Strofcon">@Strofcon</a>Unknownnoreply@blogger.comBlogger17125tag:blogger.com,1999:blog-6854573657784240555.post-74346703939375235532017-09-07T08:00:00.000-05:002017-09-07T08:00:04.670-05:00An Ops Nerd Learns Probabilistic Data Structures, Part 1: IntroI have a friend who is <i>scary </i>smart. I mean, I'm a reasonably smart guy, sure... but <i>whoa.</i> As it turns out (very much in my favor) he's also super nice and happy to help me learn new things. So when he said something about a "probabilistic data structure," and I promptly responded with, "right, I know all three of those words... but... huh?", he kindly gave me a gentle introduction to what I would soon learn is a super cool category of Computer Science.<br />
<br />
<h4>
What the heck is a "probabilistic data structure?"</h4>
<div>
Well, to be glib, it's a data structure that relies on probabilities and stuff. More usefully, it's (usually) an efficient way to store massive quantities of data in such a way that you get, at most, "fairly confident" answers about that data.<br />
<br />
I know, that's not terribly helpful either, but we'll get there! I aim to make this series easy to digest and relate to. I'm not a math whiz nor am I a software engineer, which means I have to think about things in <i>way</i> simpler terms than either of those people would. Then when I go to explain it to folks smarter than myself (read: you), I like to do it in terms that are less muddied than the symbol-filled abstract articles and videos I've seen so far. I figure if I can't explain it to someone else in simpler terms than those in which I learned it, I probably don't understand it as well as I think.<br />
<br />
<h4>
Let's get some context: Why would I need such magic?</h4>
</div>
<h3>
Shell Game - Distributed Data Stores</h3>
<div>
Imagine that we have a pile of records that you need to store. We want the full data set to be stored on disk, but we can't possibly fit it all on one machine. So we split it up across 3 machines, and life's good!<br /><br />Eventually though, we'll want to retrieve some of those records. It would be pretty wasteful to go to each machine, ask for a record, and force the machine to make an expensive series of disk accesses just to tell you that the princess is another castle. We might get lucky and land on the right machine the first time, but my (sketchy) math skills tell me that this is unlikely to happen with more than, roughly, one-third of our attempts.</div>
<div>
<br /></div>
<div>
Instead it would be handy if, as we ingest each record into our multi-machine data store, we write a highly-compressed version of that record to an in-memory data structure that can tell us - with some tune-able degree of confidence - wheter the data we want does or does not reside on a particular machine. One way of doing this with an impressive level of space efficiency is by implementing a <b>Bloom filter</b>, which we'll cover in the next post in this series.</div>
<div>
<br /></div>
<h3>
#Hashtag #Frequency #Tracking #For #Fun #And (#Probably #No) #Profit</h3>
<div>
Imagine that we wanted to track the approximate frequency of various hashtags from the live Twitter stream. We can't really expect to store a full set of this data in memory with perfect fidelity, as we'd never have enough space. We also aren't likely to be successful dumping it all to disk, as that's a <i>lot</i> of writes that <i>never freaking slow down</i>.</div>
<div>
<br /></div>
<div>
We can, however, sacrifice a small bit of accuracy for a massive increase in space efficiency. We would do this with a data structure called a <b>count-min sketch</b>, which can tell us that a particular hashtag has most certainly <i>not</i> been seen more than N times, but <i>may</i> have been seen slightly fewer than N times. We'll cover this one in Part 3 of the series.</div>
<div>
<br /></div>
<h3>
An understanding of HyperLogLog is its own reward</h3>
<div>
Don't worry, we'll get there. For now, it should suffice to say "it's weird, and it lets us achieve <i>nigh-ludicrous</i> space efficiency."</div>
<div>
<br /></div>
<h4>
Up Next!</h4>
<div>
Part 2 will cover the <b>Bloom filter</b>. We'll talk about its perks, drawbacks, mechanisms, and use-cases. Then we'll build one and benchmark it! It should be all kinds of fun. See ya then!</div>
Unknownnoreply@blogger.com3tag:blogger.com,1999:blog-6854573657784240555.post-65955071145974877992016-06-27T00:23:00.002-05:002016-06-27T11:55:29.539-05:00Let's Build Something: Elixir, Part 6 - Adding Metrics to StatsYardWe're going to take a bit of a detour with this post, and look at what we can do to get some basic metrics spit out of StatsYard! Not only am I a huge metrics geek so this is a pretty strong urge for me, but it's also something that will be helpful as we go forward. Implementing new functionality and fixing up old is generally made more difficult by not having decent measurements, and laying the groundwork for that now will make it easier to keep up with it as we go along.<br />
<br />
The old adage that <a href="http://c2.com/cgi/wiki?PrematureOptimization" target="_blank">premature optimization is the root of all evil</a> still holds, of course, even for a metrics obsessed type like me. While there's value in checking that we haven't introduced a major performance impact to the system, it just won't do to nitpick every little bit and try to tease out a tiny bit more performance at this stage of the game. I mostly just intend to use it to get an idea of the system's capabilities and as a learning opportunity when we start looking at alternative compilers and organizational structures.<br />
<br />
<h2>An Overview</h2><div></div>So! Let's get into it, shall we? The high level view of what we're going to implement is this:<br />
<ul><ul></ul><li>Custom stats should be super easy to instrument as we build new functions and features</li>
<ul><li>Enter <a href="https://github.com/folsom-project/folsom" target="_blank">Folsom</a></li>
</ul><li>VM stats are awesome, but not something we should have to gather ourselves</li>
<ul><li>Enter <a href="https://github.com/fanduel/ex_vmstats" target="_blank">ex_vmstats</a></li>
</ul><li>Stats need to be shipped somewhere, preferably quickly and easily</li>
<ul><li>Enter <a href="https://github.com/lexmag/statix" target="_blank">Statix (statsd)</a></li>
</ul></ul><div>All of this will evolve over time, for sure, but this seems like a good start.</div><div><br />
</div><h2>O Back-End, Where Art Thou?</h2><div></div><div>Before we get too far, we need a place to stick our metrics! I'm actually using <a href="https://sysdig.com/" target="_blank">Sysdig Cloud</a> for this, because they have an <a href="http://tech.strofcon.org/2016/03/sysdig-cloud-monitoring-made-awesome-p1.html" target="_blank">agent that beats all the rest</a>, and their UI will suit my purposes nicely. The key feature of their agent that I'm leaning on is the built-in <a href="https://github.com/armon/statsite" target="_blank">statsd</a> server that will ship my stats to their back-end so that I don't have to worry about it. Fret not, though - nothing in this post will demand that anyone else use Sysdig, that's just where I'm going to throw data for now.</div><div><br />
<h2>Wiring Up Our New Dependencies</h2><div></div>First things first, let's add all of our dependencies and start up the apps as needed. We'll just lump them all in for now, and spend the rest of this post looking at how to make use of them. Crack open <span style="font-family: "courier new" , "courier" , monospace;">mix.exs</span><span style="font-family: inherit;"> and add </span><span style="font-family: "courier new" , "courier" , monospace;">:folsom</span><span style="font-family: inherit;">, </span><span style="font-family: "courier new" , "courier" , monospace;">:statix</span><span style="font-family: inherit;">, </span><span style="font-family: inherit;">and </span><span style="font-family: "courier new" , "courier" , monospace;">:ex_vmstats</span><span style="font-family: inherit;"> dependencies:</span></div><div><!-- lets-build-elixir-6_deps --><br />
<script src="https://gist.github.com/cboggs/b165496087f1af61fded1cc1a43582bb.js"></script><br />
</div><div>Still in <span style="font-family: "courier new" , "courier" , monospace;">mix.exs</span><span style="font-family: inherit;">, we also need to tell our project to start these new applications, otherwise they won't do their job at startup:</span></div><div><!-- lets-build-elixir-6_apps --><br />
<script src="https://gist.github.com/cboggs/87b68621532d4aba527140f8d5d0e24d.js"></script><br />
</div><h2>Our First Custom Metric</h2><div></div><div>Like I mentioned earlier, we're going to be using <a href="https://github.com/folsom-project/folsom" target="_blank">Folsom</a> for custom stats. It's an Erlang app, and it's a pretty good analog to Java's codahale.metrics, now known as <a href="http://metrics.dropwizard.io/3.1.0/" target="_blank">DropWizard</a>. The Github readme has pretty much all the info you could need about how it works, so we'll jump straight into instrumenting a call to one of our functions - <span style="font-family: "courier new" , "courier" , monospace;">DataPoint.validate/1</span><span style="font-family: inherit;">. </span><span style="font-family: inherit;">Open up </span><span style="font-family: "courier new" , "courier" , monospace;">lib/stats_yard/ingest_consumer.ex</span><span style="font-family: inherit;">, and we'll wrap our function call with a </span><span style="font-family: "courier new" , "courier" , monospace;">:folsom_metrics</span><span style="font-family: inherit;"> function:</span></div><div><!-- lets-build-elixir-6_validation-metric --><br />
<script src="https://gist.github.com/cboggs/8f0e4317e7c014da4e760ef8bcacac5b.js"></script><br />
We first have to define our metrics so that Folsom can get the appropriate structures setup in memory, so we add a private function <span style="font-family: "courier new" , "courier" , monospace;">create_metrics/0</span><span style="font-family: inherit;"> and then call it when we </span><span style="font-family: "courier new" , "courier" , monospace;">init</span><span style="font-family: inherit;"> our GenServer.</span><br />
<br />
Next we wrap our <span style="font-family: "courier new" , "courier" , monospace;">validate/1</span> function with Folsom's <span style="font-family: "courier new" , "courier" , monospace;">histogram_timed_update/2.</span><span style="font-family: inherit;"> This Folsom function is basically a pass-through function, in that it accepts a function of our choosing as one of its arguments, evaluates that function and does some kind of work in relation to it, and then returns the value of the function that it evaluated. In this case, the work done in addition to our validation function is timing its run time and updating an ETS table that Folsom maintains for "doing math" on and maintaining a history for our metric. From our perspective, though, almost nothing has changed. We don't get any new return values or anything from </span><span style="font-family: "courier new" , "courier" , monospace;">validate/1</span><span style="font-family: inherit;">, and so we can still use it in the same manner we wanted to originally.</span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">After we've validated the data point and gotten a timing for the invocation, we increment the counter </span><span style="font-family: "courier new" , "courier" , monospace;">validate.rate</span><span style="font-family: inherit;">, and then move on!</span></div><div><br />
</div><h2>Dude, Where's My Data?</h2><div></div>So we created some metrics and updated their values - why don't we see anything happening in our statsd server??? As it turns out, Folsom's only goal in life is to keep track of our stats, not actually send them anywhere. The beauty of Folsom is that it's optimized to keep track of our stats and keep them accessible in ETS tables, which reside in memory and are extremely fast. With Folsom doing the math and handling the history and access paths of our metrics, we're left with all the flexibility in the world when it comes to choosing what to do with the data that's tracked. So, to that end, let's wire up Statix so we can actually send these bits somewhere!<br />
<div><br />
</div><h2>Setting up Statix</h2><div></div><div>It doesn't take much to get started with Statix. At a minimum, we need to:<br />
<br />
<ul><li>Define a config for Statix in <span style="font-family: "courier new" , "courier" , monospace;">config/config.exs</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace;">use Statix</span><span style="font-family: inherit;"> in a module (any module), which will pull in a handful of useful functions that we can use to ship metrics to a statsd server</span></li>
<li><span style="font-family: inherit;">Make a quick call to the </span><span style="font-family: "courier new" , "courier" , monospace;">connect/0</span><span style="font-family: inherit;"> function that our Statix-using module now has, and we're off to the races with such functions as </span><span style="font-family: "courier new" , "courier" , monospace;">gauge/2</span><span style="font-family: inherit;">, </span><span style="font-family: "courier new" , "courier" , monospace;">increment/2</span><span style="font-family: inherit;">, </span><span style="font-family: "courier new" , "courier" , monospace;">de</span><span style="font-family: "courier new" , "courier" , monospace;">crement/2</span><span style="font-family: inherit;">, </span><span style="font-family: inherit;">etc.</span></li>
</ul><br />
<div>Our config is simple enough. Clearly one should change the host value to something appropriate as needed, but generally this is fine:<br />
<div><!-- lets-build-elixir-6_statix-config --><br />
<script src="https://gist.github.com/cboggs/63ab7cfd61ef242357f1a053a76c64d0.js"></script><br />
</div></div><span style="font-family: inherit;">That's all good and well, but we need to wire up our in-memory Folsom data with Statix so that it's actually of some use. Enter our shiny new </span><span style="font-family: "courier new" , "courier" , monospace;">StatsYard.StatShipper</span><span style="font-family: inherit;"> module!</span><br />
<div><br />
</div><span style="font-family: inherit;">In the file </span><span style="font-family: "courier new" , "courier" , monospace;">lib/stats_yard/stat_shipper.ex</span><span style="font-family: inherit;"> we'll place the following:</span></div><div><!-- lets-build-elixir-6_stat_shipper --><br />
<script src="https://gist.github.com/cboggs/f770c368e57e62b1838002b933089021.js"></script><br />
</div><div>Whew! Lots going on here, so let's talk about it at a high level, then we'll jump into the line-by-line breakdown. The general idea here is to build a supervised GenServer that will, on a certain interval, iterate across all metrics known to Folsom and ship them out via Statix. This keeps the shipping of metrics out of the data path for actually <i>collecting</i> the metrics, and since it's supervised separately at the top level, its workload isn't tied to any other part of the app.</div><div><br />
</div><div>Let's dive in and see what exactly is happening here:</div><div><ul><li>Line 9-10: Folsom's <span style="font-family: "courier new" , "courier" , monospace;">histogram</span> and <span style="font-family: "courier new" , "courier" , monospace;">meter</span><span style="font-family: inherit;"> metric types are actually collections of various related values, but we don't necessarily always want all of them. Here we define the ones we want for now</span></li>
<li>Line 12: This GenServer needs to do work every N milliseconds, and we default N to 1000</li>
<li>Line 17: Our <span style="font-family: "courier new" , "courier" , monospace;">init/1</span><span style="font-family: inherit;"> function does nothing more than schedule the first metrics shipment, then lets everything run itself from here on out</span></li>
<li><span style="font-family: inherit;">Line 21-23: This function shows how easy it is to do something after a set time. </span><span style="font-family: "courier new" , "courier" , monospace;">Process.send_after/3</span><span style="font-family: inherit;"> does exactly what it sounds like - sends a message to an arbitrary PID after the specified time. In this case, the GenServer is sending itself a message telling itself to ship metrics, but waits </span><span style="font-family: "courier new" , "courier" , monospace;">ship_interval</span><span style="font-family: inherit;"> milliseconds before doing so</span></li>
<li><span style="font-family: inherit;">Line 25: We're using </span><span style="font-family: "courier new" , "courier" , monospace;">handle_info/2</span><span style="font-family: inherit;"> here, because it's the GenServer callback that can handles all arbitrary messages send to the GenServer that are NOT handled by any other callback. Our message telling the server to ship metrics is, indeed, one such arbitrary message</span></li>
<li><span style="font-family: inherit;">Line 26: Here we call the function that actually kicks off the iteration over Folsom's metrics and then ships them off. We also set the GenServer's state to the return value of this function, which we could use (but aren't currently) to terminate the server if metrics shipping failed</span></li>
<li>Line 27: Once metrics have been shipped, schedule the next shipment for <span style="font-family: "courier new" , "courier" , monospace;">ship_interval</span><span style="font-family: inherit;"> from now</span></li>
<li><span style="font-family: inherit;">Line 32-41: For every metric that Folsom has registered, deconstruct its metadata tuple and take appropriate action based on the metric's type</span></li>
<li><span style="font-family: inherit;">Line 50-54: Break apart a meter metric and only ship the bits that we've allowed with </span><span style="font-family: "courier new" , "courier" , monospace;">@shippable_meter_stats</span></li>
<li><span style="font-family: inherit;">Line 60-71: Histogram metrics in Folsom are a bit of of a mix of data types, so we need to walk every keyword in the list and when we come across the </span><span style="font-family: "courier new" , "courier" , monospace;">:percentile</span><span style="font-family: inherit;"> keyword, handle it appropriately. All other keywords are shippable as-is</span></li>
</ul></div><div>I know, it's a mouthful (keyboardful?), but hopefully it helps to explain things a bit. Now we need to start up our GenServer and get it supervised, so off we go to <span style="font-family: "courier new" , "courier" , monospace;">lib/stats_yard.ex</span><span style="font-family: inherit;"> where we'll add a new function to be called from </span><span style="font-family: "courier new" , "courier" , monospace;">start/2</span><span style="font-family: inherit;">:</span></div><div><!-- lets-build-elixir-6_start-stat-shipper --><br />
<script src="https://gist.github.com/cboggs/92a80dc2c09b4470a115d91ebcb03ac3.js"></script><br />
</div><h2>A Note On Running Multiple Aggregators</h2><div></div><div>I can hear you asking now - why so many gauges in the stat shipper module? Well, there's a bit of a headache that can crop up when you run multiple stats aggregators like we're doing here with Folsom and StatsD. Both tools are going to store incoming values and do some math on them, and maintain some rolling / windowed values as well. If we let both tools do this, however, then by the time our metric data makes it to the back-end, we're going to be seeing skewed (and often, watered-down) values. We can generally avoid this by allowing Folsom to do the math and windowing, and ship the results to StatsD as <span style="font-family: "courier new" , "courier" , monospace;">gauge</span><span style="font-family: inherit;"> values. This causes StatsD to consider the value to be a point-in-time value that's not dependent on any past events, and thus passes it through as-is.</span></div><div><br />
</div><h2>Last But Not Least - VM Stats</h2><div></div><div>For grabbing VM stats (memory utilization breakdown, process counts, queue sizes, etc.), we're going to start out with <a href="https://github.com/fanduel/ex_vmstats">ex_vmstats</a>. It does its job right out of the box, but we need to make it work with Statix first. As is mentioned in the Github readme for ex_vmstats, we need to create our own Statix backend and then specify it in our mix config. Let's take care of the backend module by creating a new file named <span courier="" monospace="" new="">ex_vmstats_statix_backend.ex</span><span style="font-family: inherit;">:</span></div><div><!-- lets-build-elixir-6_statix-backend --><br />
<script src="https://gist.github.com/cboggs/4c65b6d6d7afa3a3f719c60205276dc9.js"></script><br />
</div><div>Now all that's left is to start up the app and we'll start sending VM stats to the back-end!</div><div><br />
</div><h2>Checking It All Out</h2><div></div><div>To help test out our shiny new stats without having to do a lot of super repetitive typing, I've added a small module called <span style="font-family: Courier New, Courier, monospace;">StatsYard.Util.IngestStress</span><span style="font-family: inherit;">. It can be used as such from an </span><span style="font-family: Courier New, Courier, monospace;">iex -S mix</span><span style="font-family: inherit;"> session:</span></div><div><span style="font-family: inherit;"><br />
</span></div><div><span style="font-family: Courier New, Courier, monospace;">StatsYard.Util.IngestStress.run 2_000_000</span></div><div><span style="font-family: Courier New, Courier, monospace;"><br />
</span></div><div><span style="font-family: inherit;">This will lazily generate 2 million valid </span><span style="font-family: Courier New, Courier, monospace;">DataPoint</span><span style="font-family: inherit;"> structs and shovel them into our ingest queue. I took the liberty of doing this locally and grabbing a screenshot of my Sysdig dashboard to give an idea of what's going on:</span></div><div><br />
</div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://raw.githubusercontent.com/strofcon/stats-yard/lets-build-6/screenshots/SysdigDashboard1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="313" src="https://raw.githubusercontent.com/strofcon/stats-yard/lets-build-6/screenshots/SysdigDashboard1.png" width="640" /></a></div></div><div><br />
</div><div>Cool! Looks like my current setup can validate some ~15,000 data points per second, with each generally taking under 20 microseconds. Excellent! There are some other cool VM stats that I've graphed here too, which are definitely worth taking a look at as we move forward.<br />
</div><div><br />
</div><h2>Next Time</h2><div></div><div>A bit of a departure from what was promised last time, we'll just be looking at the data path next time 'round. Worker pools will have to wait until we actually start writing some data to disk!</div><div><br />
</div><div>Until next time, feel free to peruse the code for this post at <a href="https://github.com/strofcon/stats-yard/tree/lets-build-6" target="_blank">https://github.com/strofcon/stats-yard/tree/lets-build-6</a>.</div>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-6854573657784240555.post-3709421955618566702016-05-11T23:18:00.002-05:002016-05-16T21:54:03.955-05:00Let's Build Something: Elixir, Part 5b - Testing Data Validation and Logging<div>
Let's wrap up <a href="http://tech.strofcon.org/2016/04/lets-build-something-elixir-part-5a.html" target="_blank">this installment</a> by writing some unit tests to safeguard us against bad changes to our data validation code!</div>
<br />
<div>
First we'll put together a couple of tests to ensure that validation and invalidation are working correctly:</div>
<div>
<!-- lets-build-elixir-5_test-take-1 --><br />
<script src="https://gist.github.com/cboggs/0188178dcbe7655e0df48131b4a336c5.js"></script><br /></div>
<div>
Let's see how we did:</div>
<div>
<!-- lets-build-elixir-5_test-run-1 --><br />
<script src="https://gist.github.com/cboggs/85afd501a70afa6619d3f2eb9c6b1924.js"></script><br /></div>
<div>
Well, it certainly worked, but it's a bit clunky. Two things stand out:<br />
<ol>
<li>We should probably test multiple invalid <span style="font-family: "courier new" , "courier" , monospace;">DataPoint</span>s, since there are multiple guards that we want to make sure we got right. </li>
<li>We probably don't want those log messages showing up in the test output. We likely want to try and capture those to ensure that we are indeed logging appropriately, without cluttering the output of the test run.</li>
</ol>
</div>
<div>
So let's do that! First up, let's add some more bad <span style="font-family: "courier new" , "courier" , monospace;">DataPoint</span> structs to test. We'll switch to using a list instead of individually-named structs for our tests, and we'll also do the same for the valid <span style="font-family: "courier new" , "courier" , monospace;">DataPoint</span>s - just in case we decide to test multiple ways there. Consistency FTW!</div>
<div>
<!-- lets-build-elixir-5_test-take-2 --><br />
<script src="https://gist.github.com/cboggs/cc0ea7f0ea7690954c12d064099dcbac.js"></script><br /></div>
<div>
And the run:</div>
<div>
<!-- lets-build-elixir-5_test-run-2 --><br />
<script src="https://gist.github.com/cboggs/80eaa7397273eb98d30c12b06faf4311.js"></script><br /></div>
<div>
Cool! We made our logging problem even worse, though. Let's take care of that. To make this work, we're going to use ExUnit's <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://elixir-lang.org/docs/stable/ex_unit/ExUnit.CaptureLog.html" target="_blank">capture_log/2</a></span><span style="font-family: inherit;"> to capture the output from </span><span style="font-family: "courier new" , "courier" , monospace;">Logger</span><span style="font-family: inherit;">, spit out by our </span><span style="font-family: "courier new" , "courier" , monospace;">DataPoint</span><span style="font-family: inherit;"> module:</span></div>
<div>
<!-- lets-build-elixir-5_test-take-3 --><br />
<script src="https://gist.github.com/cboggs/4179299a60c032ade49722de4942364b.js"></script><br /></div>
<div>
I inserted a small bug into the test for valid <span style="font-family: "courier new" , "courier" , monospace;">DataPoint</span>s - do you see it? No? It's pretty subtle, no worries. I told <span style="font-family: "courier new" , "courier" , monospace;">capture_log/2</span><span style="font-family: inherit;"> that I'm expecting a </span><span style="font-family: "courier new" , "courier" , monospace;">Logger</span><span style="font-family: inherit;"> message with level </span><span style="font-family: "courier new" , "courier" , monospace;">:info</span><span style="font-family: inherit;">, but our validation function is actually logging valid structs as </span><span style="font-family: "courier new" , "courier" , monospace;">:debug</span><span style="font-family: inherit;"> messages. Let's see how this works out:</span></div>
<div>
<!-- lets-build-elixir-5_test-run-3 --><br />
<script src="https://gist.github.com/cboggs/eefeeaf57c26a6b31a1578e969ca034b.js"></script><br /></div>
<div>
So <span style="font-family: "courier new" , "courier" , monospace;">capture_log/2</span> captures the log event, but swallows it and the assertion fails since the level was mis-matched. Handy!</div>
<br />
<div>
Switch the valid <span style="font-family: "courier new" , "courier" , monospace;">capture_log/2</span> call back to :debug and we're good to go:</div>
<div>
<!-- lets-build-elixir-5_test-run-3 --><br />
<script src="https://gist.github.com/cboggs/d4e8a028ccf435850d4b35bedb9f0983.js"></script><br />
<strike>(Note: I'm not sure yet how to make it stop complaining about unused variables here. They're certainly used, so I'm not sure why it's not happy about my use. I'll dig into that and iron it out in a later commit.)</strike><br />
Shout-out to <a href="https://github.com/DNNX" target="_blank">DNNX</a> for pointing out the problem with the unused variables in the <span style="font-family: "courier new" , "courier" , monospace;">DataPoint</span> tests! I've also updated the gist in this post to reflect his change. Check out his <a href="https://github.com/strofcon/stats-yard/pull/8" target="_blank">pull request</a> for an explanation of what was happening here. Many thanks Viktar!</div>
<div>
<br />
<h2>
Next Time</h2>
</div>
<div>
We'll continue next time by introducing a worker pool that we can use to distribute our data validation workload over, and start sketching out how the write path will look when we start persisting data to disk.</div>
<div>
</div>
<div>
Until then, feel free to peruse this post's code on Github at <a href="https://github.com/strofcon/stats-yard/tree/lets-build-5">https://github.com/strofcon/stats-yard/tree/lets-build-5</a>. </div>
Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-6854573657784240555.post-48363723829252010352016-04-18T22:11:00.001-05:002016-05-16T21:54:03.959-05:00Let's Build Something: Elixir, Part 5a - Data Ingest, Consumption, and ValidationWhew! Been a <a href="http://tech.strofcon.org/2016/04/lets-build-something-elixir-part-4.html" target="_blank">little while</a>, but let's keep cruisin'! This installment will tackle defining a custom data type for our data points, providing a means by which we can queue up ingested data and consume it continuously, and validate it as we go along. We'll also write some tests to keep our sanity.<br />
<br />
First let's define what our data points should look like. We'll keep it simple for now, and can expand later. Open up a new file at <span style="font-family: "courier new" , "courier" , monospace;">lib/stats_yard/data_point.ex</span><span style="font-family: inherit;">:</span><br />
<div>
</div>
<div>
<!-- lets-build-elixir-5_DataPoint-struct --><br />
<script src="https://gist.github.com/cboggs/2acacc1422bdf5a91a0c4cf0da60e0d3.js"></script><br /></div>
There a couple of things happening here that you might be interested in:<br />
<br />
<ul>
<li>Line 4: We define a type for our DataPoint <a href="http://elixir-lang.org/getting-started/structs.html" target="_blank">struct</a>. Note that <span style="font-family: "courier new" , "courier" , monospace;">__MODULE__</span> is just a safe way to reference the name of the current module (<span style="font-family: "courier new" , "courier" , monospace;">StatsYard.DataPoint</span>). We can later reference this as <span style="font-family: "courier new" , "courier" , monospace;">StatsYard.DataPoint.t</span><span style="font-family: inherit;"> when the need arises.</span></li>
<li><span style="font-family: inherit;">Line 5: Define our basic DataPoint struct. Note that structs always inherit the name of the module in which they are defined. If we wanted it to be referenced as something other than </span><span style="font-family: "courier new" , "courier" , monospace;">%StatsYard.DataPoint{}</span><span style="font-family: inherit;"> we would need to define a module within the outer module, such as </span><span style="font-family: "courier new" , "courier" , monospace;">StatsYard.DataPoint.Thing</span><span style="font-family: inherit;">. There's no real need for that in this case.</span></li>
<li>Line 7-10: Set up a validation function that will only work when the argument passed is one of our shiny new structs, <i>and</i> the various keys therein pass our guards. Specifically we want the metric and entity fields to be strings (or binaries, in Elixir/Erlang land), and the value to be some type of number.</li>
<li>Line 11-12: If we end up in this version of the validate function, log a message and return a tuple to indicate success and spit back out the provided struct.</li>
<li>Line 15: Define the "fall-through" validate function that will match on any argument that doesn't pass the above guards. In this case, log a warning and an :invalid tuple with the provided value included.</li>
</ul>
<div>
This module is intended to wrap up the structure of the data we want to use and the functions that are relevant in inspecting and validating it.<br />
<br />
Next up, let's add something to let us queue up incoming data points. I like <a href="https://github.com/joekain/blockingqueue" target="_blank">Joe Kain's BlockingQueue</a>. It's a statically-sized GenServer'd <a href="https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics)" target="_blank">FIFO</a> queue that blocks when full or empty. Super simple, and very effective. NB: Joe also has a super awesome blog called <a href="http://learningelixir.joekain.com/" target="_blank">Learning Elixir</a> that you really should check out.<br />
<br />
First up we need to add it to our deps list in <span style="font-family: "courier new" , "courier" , monospace;">mix.exs</span><span style="font-family: inherit;">:</span></div>
<div>
<!-- lets-build-elixir-5_mix-deps-snippet --><br />
<script src="https://gist.github.com/cboggs/6f962f8725f0153195c87c4b03bcc39d.js"></script><br /></div>
Then follow that up with a <span style="font-family: "courier new" , "courier" , monospace;">mix deps.get</span>, and we're ready to roll.<br />
<div>
<br /></div>
<div>
First let's walk through the idea here. I want to have a <span style="font-family: "courier new" , "courier" , monospace;">BlockingQueue</span> GenServer that catches our ingested <span style="font-family: "courier new" , "courier" , monospace;">DataPoint</span><span style="font-family: inherit;">s</span>, and a separate consumer GenServer that will pop those <span style="font-family: "courier new" , "courier" , monospace;">DataPoint</span>s off the ingest queue and validate them. The most important part of all this is that I don't want to do things in batches, nor do I want to have to explicitly trigger the consumption of data from the ingest queue. Enter supervision and streams!<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">BlockingQueue</span>'s API gives two functions for popping events off a queue: <span style="font-family: "courier new" , "courier" , monospace;">pop/1</span><span style="font-family: inherit;"> and </span><span style="font-family: "courier new" , "courier" , monospace;">pop_stream/1</span><span style="font-family: inherit;">. As you might have guessed, </span><span style="font-family: "courier new" , "courier" , monospace;">pop/1</span><span style="font-family: inherit;"> removes and returns the oldest single value from the queue, while </span><span style="font-family: "courier new" , "courier" , monospace;">pop_stream/1</span><span style="font-family: inherit;"> returns a Stream function - specifically, </span><a href="http://elixir-lang.org/docs/stable/elixir/Stream.html#repeatedly/1" style="font-family: 'Courier New', Courier, monospace;">Stream.repeatedly/1</a><span style="font-family: inherit;">. If you're unfamiliar with Streams, they're effectively a safe mechanism by which you can perform actions on arbitrarily large data sets without having to pull the entire thing into memory. I'm not the best to describe these in great detail, but <a href="http://learningelixir.joekain.com/stream-patterns-in-elixir/">Joe Kain</a> and <a href="http://elixir-lang.org/getting-started/enumerables-and-streams.html">Elixir's Getting Started guide</a> have some good descriptions and applications.</span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">So the layout of these bits is going to be something like this:</span><br />
<br />
<ol>
<li>Start up a named and supervised <span style="font-family: "courier new" , "courier" , monospace;">BlockingQueue</span> GenServer</li>
<li>Start up a named and supervised consumer GenServer that can find the above <span style="font-family: "courier new" , "courier" , monospace;">BlockingQueue</span> process</li>
<li>Consumer process grabs hold of the <span style="font-family: "courier new" , "courier" , monospace;">Stream</span><span style="font-family: inherit;"> function returned by </span><span style="font-family: "courier new" , "courier" , monospace;">BlockingQueue.pop_stream/1</span><span style="font-family: inherit;"> and proceeds to validate every </span><span style="font-family: "courier new" , "courier" , monospace;">DataPoint</span><span style="font-family: inherit;"> that gets pushed into the queue</span></li>
</ol>
</div>
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">Here's our consumer:</span><br />
<div>
<!-- lets-build-elixir-5_ingest-consumer --><br />
<script src="https://gist.github.com/cboggs/a66c020820179d7a579897128688b1bd.js"></script><br /></div>
<div>
So here's what's going on in this module:<br />
<br />
<ol>
<li>Line 6: This is my nifty way of telling this function "keep an eye out for things popping up in this stream of data, and take appropriate action when you see a new item." We'll test this out shortly in iex</li>
<li>Line 7: Try to validate, as a <span style="font-family: "courier new" , "courier" , monospace;">DataPoint</span>, every item that comes off the stream</li>
<li>Line 8-9: If the validate succeeds, return the <span style="font-family: "courier new" , "courier" , monospace;">DataPoint</span><span style="font-family: inherit;">, otherwise discard it entirely (remember that our </span><span style="font-family: "courier new" , "courier" , monospace;">StatsYard.DataPoint.validate/1</span><span style="font-family: inherit;"> function will log a warning when a value fails validation)</span></li>
<li><span style="font-family: inherit;">Line 18: Note that our public </span><span style="font-family: "courier new" , "courier" , monospace;">start_link/2</span><span style="font-family: inherit;"> function expects to receive the PID of our ingest queue, which we'll provide in </span><span style="font-family: "courier new" , "courier" , monospace;">lib/stats_yard.ex</span><span style="font-family: inherit;"> when we set up our supervision tree</span></li>
<li><span style="font-family: inherit;">Line 23-25: Start up a <a href="http://elixir-lang.org/getting-started/processes.html#links">linked process</a> that will kick off our queue consumption loop in </span><span style="font-family: "courier new" , "courier" , monospace;">consume_data_points/1</span></li>
<li><span style="font-family: inherit;">Line 27: Set our GenServer's state to the PID of our ingest queue, and we're done!</span></li>
</ol>
<div>
Notice that this is a very simple GenServer - so simple, in fact, that it doesn't even have any direct means of interaction. For now, this is more than sufficient - we just want something that we can supervise and organize appropriately, with the option to extend it for more robust behavior in the future. (For the studious, you're right - there's always Elixir's <a href="http://elixir-lang.org/docs/stable/elixir/GenEvent.html">GenEvent</a>, but that's for future posts!)</div>
<div>
<br /></div>
<div>
Now let's rig all this up in our supervision tree, and then we'll poke around in <span style="font-family: "courier new" , "courier" , monospace;">iex</span> to see if it's all working as expected. Notice that this has been cleaned up a bit to accommodate our <span style="font-family: "courier new" , "courier" , monospace;">TimeStampWriter</span><span style="font-family: inherit;"> bits without getting too cluttered:</span></div>
<div>
<!-- lets-build-elixir-5_supervision-tree --><br />
<script src="https://gist.github.com/cboggs/b4db9eb5e27ac926b7e79bb097185b6c.js"></script><br /></div>
<div>
</div>
(I know that's a big chunk of code to dump into a blog post - my apologies. I mostly wanted to be sure to point out that the structure of this stuff changed significantly. Newer changes will be limited to just the diffs. :-) )<br />
<br />
Nothing super exciting here, other than a few things to note:<br />
<ol>
<li>All supervisors are now started up in their own independent functions, which are called from the pared-down <span style="font-family: "courier new" , "courier" , monospace;">start/2</span><span style="font-family: inherit;"> function</span></li>
<li>Our supervisors are now named appropriately (Lines 20 and 36)</li>
<li>Our <span style="font-family: "courier new" , "courier" , monospace;">start_main_ingest/0</span><span style="font-family: inherit;"> function lists two workers to be started up under the appropriate supervisor, <b>which will start in the order listed </b>(this will be on the quiz)</span></li>
<li>Atoms used to name our GenServer processes are pulled out and returned from simple functions at the bottom of the file, so as to avoid headaches later</li>
</ol>
<div>
Enough work, let's play with it! Fire up <span style="font-family: "courier new" , "courier" , monospace;">iex</span> and we'll see if our stuff works:</div>
<div>
<!-- lets-build-elixir-5_iex-push-vals --><br />
<script src="https://gist.github.com/cboggs/67b36b40e3bb9a780e80d8aabcc403fc.js"></script><br /></div>
<div>
Cool! We're able to push things into our <span style="font-family: "courier new" , "courier" , monospace;">BlockingQueue</span><span style="font-family: inherit;"> without having to know much about it, and our </span><span style="font-family: "courier new" , "courier" , monospace;">IngestConsumer</span><span style="font-family: inherit;"> immediately received the pushed values and attempted to validate them, the results of which are spit back out via log messages.</span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">Now for that quiz I mentioned earlier: in what order were our two ingest GenServers started? Yup, the order listed in our supervision tree definition - the queue first, then the consumer. Why does this matter? </span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">There's a failure case that we need to recognize and accommodate. Specifically, if our ingest queue process dies, it will indeed be restarted by the supervisor... <i>but</i> our consumer process will merrily chug along holding onto a Stream function that references a now-dead process! That sounds like bad news, but let's verify that I'm not making stuff up:</span></div>
<div>
<!-- lets-build-elixir-5_iex-kill-queue --><br />
<script src="https://gist.github.com/cboggs/6d9c951f40e41ac5e03b6476fc0ad4c1.js"></script><br /></div>
I know that's a bit dense, but the gist (har har) of it is that we used <span style="font-family: "courier new" , "courier" , monospace;">Supervisor.which_children/1</span><span style="font-family: inherit;"> to see what the PIDs of our two GenServers were, stopped the </span><span style="font-family: "courier new" , "courier" , monospace;">main_ingest_queue</span><span style="font-family: inherit;"> process (rather rudely, too, a la </span><span style="font-family: "courier new" , "courier" , monospace;">:kill</span><span style="font-family: inherit;">), then checked to see that the expected PID had indeed updated in the supervisor's state. Then we tried to push a value into the main ingest queue, which did indeed work since it had been restarted, but our ingest <i>consumer</i> process never knew about it, because it's waiting for events to flow in from a dead process. That's lame, so let's fix it!</span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">Turns out, this a super simple one-line fix, but reading the docs is a must in order to understand why this fix is appropriate (head over to the <a href="http://elixir-lang.org/docs/stable/elixir/Supervisor.html">Supervisor docs</a>, then search for "rest_for_one"). In </span><span style="font-family: "courier new" , "courier" , monospace;">lib/stats_yard.ex</span><span style="font-family: inherit;">:</span></div>
<div>
<!-- lets-build-elixir-5_change-sup-strategy --><br />
<script src="https://gist.github.com/cboggs/b4e139b21ed0217183c2a179ff86c35c.js"></script><br /></div>
<div>
And now to test it out in <span style="font-family: "courier new" , "courier" , monospace;">iex</span>:<br />
<div>
<!-- lets-build-elixir-5_retry-kill-queue --><br />
<script src="https://gist.github.com/cboggs/4510397279dcff72506c68530df3b871.js"></script><br /></div>
Woohoo! Worked like a charm. What's happening here? First, read the docs. :-) Second, in a nutshell, using the strategy <span style="font-family: "courier new" , "courier" , monospace;">rest_for_all</span><span style="font-family: inherit;"> causes the supervisor to consider the position of any process that dies under its supervision and then kill and restart <i>all supervised processes that were started subsequent to the original dead process</i>. In our case, the queue process is the first one, so if it dies, everything in the supervision tree of our </span><span style="font-family: "courier new" , "courier" , monospace;">MainIngestSupervisor</span><span style="font-family: inherit;"> is restarted. If it were, for example, the 3rd process started by this supervisor, then the 4th, 5th, ..., nth processes would be restarted, while the 1st and 2nd processes would be left alone. Super handy stuff here!</span><br />
<br />
<h2>
<span style="font-family: inherit;">To Be Continued...</span></h2>
<span style="font-family: inherit;">So now we're in a good place from a supervision point of view. This post is already pretty lengthy, so I'm going to title it "Part 5a," and we'll continue with some unit tests and documentation in Part 5b.</span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">Til next time!</span></div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6854573657784240555.post-50467362474395540092016-04-14T22:33:00.000-05:002016-05-16T21:54:03.963-05:00Let's Build Something: Elixir, Part 4 - Better Tests, TypeSpecs, and DocsWe <a href="http://tech.strofcon.org/2016/04/lets-build-something-elixir-part-3.html">left off</a> with our first test case working, but less-than-ideal. Specifically, it's leaving the timestamp file it writes sitting on the disk, and in an inappropriate location (the root of our project). This is super lame, and we should fix that.<br />
<br />
Enter ExUnit callbacks and tags! ExUnit allows us to pass configuration data into and out of our tests by means of a dictionary, usually referred to as the "context". We can make good use of this context data by way of setup callbacks and tags. These are <a href="http://elixir-lang.org/docs/stable/ex_unit/ExUnit.Case.html" target="_blank">described well in the docs</a>, and we'll lean on their examples for what we need to accomplish here.<br />
<br />
So our test is currently testing <span style="font-family: "courier new" , "courier" , monospace;">TimestampWriter</span><span style="font-family: inherit;">'s ability to... ya know... write timestamps. And it works great, other than leaving the temp file sitting in the root of our project. While we could just add some code to our test to explicitly handle this, a better (and less-repetitious) approach is to modify our overall test case to do some setup and tear-down tasks for us automatically!</span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">First, remove the junk file leftover at </span><span style="font-family: "courier new" , "courier" , monospace;">stats_yard/tstamp_write.test</span><span style="font-family: inherit;">, and then we'll add a setup callback to our test case that will force our writes to happen in an appropriate directory:</span><br />
<div><!-- lets-build-elixir-4_1 --><br />
<script src="https://gist.github.com/cboggs/bf4a1c1942f6d1608500b356401025b3.js"></script><br />
</div>Here we're exercising a one-way communication from our test to the <span style="font-family: "courier new" , "courier" , monospace;">setup</span> callback by way of a tag. What's happening here is that ExUnit will call our <span style="font-family: "courier new" , "courier" , monospace;">setup</span> callback before execution of every test within our test case. By preceding a test definition with <span style="font-family: "courier new" , "courier" , monospace;">@tag</span><span style="font-family: inherit;">, we are specifying that a context dictionary should be passed to our </span><span style="font-family: "courier new" , "courier" , monospace;">setup</span><span style="font-family: inherit;"> callback that contains a key-value pair of </span><span style="font-family: "courier new" , "courier" , monospace;">{ cd: "fixtures" }</span><span style="font-family: inherit;">. </span>This is mostly copy-pasta'd straight out of the ExUnit docs, but a bit of explanation can't hurt:<br />
<ul><li>Line 2: We need to make sure our tests don't run in parallel since we're going to be switching directories. It sucks, but it's the nature of the beast</li>
<li>Line 4: Define our setup callback, which will be executed prior to every test that is run</li>
<li>Line 5: Check to see if our <span style="font-family: "courier new" , "courier" , monospace;">cd</span><span style="font-family: inherit;"> tag is present in the callback's current context dict. This is necessary because the same callback is executed for every test, but not every test will necessarily use this particular tag</span></li>
<li><span style="font-family: inherit;">Line 6-7: Store the current directory and switch to the directory specified in the context</span></li>
<li><span style="font-family: inherit;">Line 8: When the test exits (whether success or fail), switch back to our original directory</span></li>
<li><span style="font-family: inherit;">Line 14: Our handy-dandy tag for the test that immediately follows</span></li>
</ul><div>Let's see if it works!</div><div><!-- lets-build-elixir-4_2 --><br />
<script src="https://gist.github.com/cboggs/46d33373e076ca695389a9afefca62b0.js"></script><br />
</div>Nope! ExUnit apparently doesn't create directories for you. Oops. Easy fix, and again:<br />
<div><!-- lets-build-elixir-4_3 --><br />
<script src="https://gist.github.com/cboggs/ba70bf2d1f3009f38a95f4897d7e5004.js"></script><br />
</div>Much better. Now we should do some cleanup after the fact, because let's face it - no one wants temp files committed to their repo.<br />
<div><!-- lets-build-elixir-4_4 --><br />
<script src="https://gist.github.com/cboggs/ec1568a25cc4c7c83f51ba4f7c692549.js"></script><br />
</div>A quick rundown of the updates (slightly out of order):<br />
<br />
<ul><li>Line 23: Add a `tempfile` tag to our test to indicate that we're going to (attempt) to write a transient file</li>
<li>Line 24: Make our test accept a dict argument called `context` (which will contain our `tempfile` key)</li>
<li>Line 25: To keep things <a href="https://www.blogger.com/%3C!--%20HTML%20generated%20using%20hilite.me%20--%3E%3Cdiv%20style=%22background:%20#f8f8f8; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;"><pre style="margin: 0; line-height: 125%"> <span style="color: #666666">1</span> <span style="color: #AA22FF; font-weight: bold">defmodule</span> <span style="color: #880000">StatsYard</span><span style="color: #666666">.</span><span style="color: #880000">TimestampWriterTest</span> <span style="color: #AA22FF; font-weight: bold">do</span> <span style="color: #AA22FF; font-weight: bold"> </span><span style="color: #666666">2</span> <span style="color: #AA22FF; font-weight: bold">use</span> <span style="color: #880000">ExUnit</span><span style="color: #666666">.</span><span style="color: #880000">Case</span>, <span style="color: #B8860B">async:</span> <span style="color: #880000">false</span> <span style="color: #666666">3</span> <span style="color: #666666">4</span> setup context <span style="color: #AA22FF; font-weight: bold">do</span> <span style="color: #AA22FF; font-weight: bold"> </span><span style="color: #666666">5</span> <span style="color: #AA22FF; font-weight: bold">if</span> cd <span style="color: #666666">=</span> context[<span style="color: #B8860B">:cd</span>] <span style="color: #AA22FF; font-weight: bold">do</span> <span style="color: #AA22FF; font-weight: bold"> </span><span style="color: #666666">6</span> prev_cd <span style="color: #666666">=</span> <span style="color: #880000">File</span><span style="color: #666666">.</span>cwd! <span style="color: #666666">7</span> <span style="color: #880000">File</span><span style="color: #666666">.</span>cd!(cd) <span style="color: #666666">8</span> on_exit <span style="color: #AA22FF; font-weight: bold">fn</span> <span style="color: #666666">-&gt;</span> <span style="color: #666666">9</span> <span style="color: #AA22FF; font-weight: bold">if</span> tempfile <span style="color: #666666">=</span> context[<span style="color: #B8860B">:tempfile</span>] <span style="color: #AA22FF; font-weight: bold">do</span> <span style="color: #AA22FF; font-weight: bold"> </span><span style="color: #666666">10</span> <span style="color: #AA22FF; font-weight: bold">if</span> <span style="color: #880000">File</span><span style="color: #666666">.</span>exists?(tempfile) <span style="color: #AA22FF; font-weight: bold">do</span> <span style="color: #AA22FF; font-weight: bold"> </span><span style="color: #666666">11</span> <span style="color: #880000">File</span><span style="color: #666666">.</span>rm!(tempfile) <span style="color: #666666">12</span> <span style="color: #AA22FF; font-weight: bold">end</span> <span style="color: #666666">13</span> <span style="color: #AA22FF; font-weight: bold">end</span> <span style="color: #666666">14</span> <span style="color: #666666">15</span> <span style="color: #880000">File</span><span style="color: #666666">.</span>cd!(prev_cd) <span style="color: #666666">16</span> <span style="color: #AA22FF; font-weight: bold">end</span> <span style="color: #666666">17</span> <span style="color: #AA22FF; font-weight: bold">end</span> <span style="color: #666666">18</span> <span style="color: #666666">19</span> <span style="color: #B8860B">:ok</span> <span style="color: #666666">20</span> <span style="color: #AA22FF; font-weight: bold">end</span> <span style="color: #666666">21</span> <span style="color: #666666">22</span> <span style="color: #B8860B">@tag</span> <span style="color: #B8860B">cd:</span> <span style="color: #BB4444">&quot;test/fixtures&quot;</span> <span style="color: #666666">23</span> <span style="color: #B8860B">@tag</span> <span style="color: #B8860B">tempfile:</span> <span style="color: #BB4444">&quot;tstamp_write.test&quot;</span> <span style="color: #666666">24</span> test <span style="color: #BB4444">&quot;can write timestamp to a file&quot;</span>, context <span style="color: #AA22FF; font-weight: bold">do</span> <span style="color: #AA22FF; font-weight: bold"> </span><span style="color: #666666">25</span> assert <span style="color: #B8860B">:ok</span> <span style="color: #666666">==</span> <span style="color: #880000">StatsYard</span><span style="color: #666666">.</span><span style="color: #880000">TimestampWriter</span><span style="color: #666666">.</span>write_timestamp(context[<span style="color: #B8860B">:tempfile</span>], <span style="color: #B8860B">:os</span><span style="color: #666666">.</span>system_time(<span style="color: #666666">1000</span>)) <span style="color: #666666">26</span> <span style="color: #AA22FF; font-weight: bold">end</span> <span style="color: #666666">27</span> <span style="color: #666666">28</span> <span style="color: #AA22FF; font-weight: bold">end</span> </pre></div>" target="_blank">DRY</a>, refer to the context's value for :tempfile instead of repeating the filename explicitly</li>
<li>Line 9: When the test is done, check to see if a tempfile was specified for the test that's being set up</li>
<li>Line 10: Make sure the tempfile actually got written, otherwise Line 11 will blow up</li>
<li>Line 11: Call the "dirty" version of <span style="font-family: "courier new" , "courier" , monospace;">File.rm/1</span><span style="font-family: inherit;">, just in case there are any weird permissions issues that prevent deletion of the file</span></li>
</ul><div>So now we should be able to run our test, and see precisely zero remnants of it:</div><div><!-- lets-build-elixir-4_5 --><br />
<script src="https://gist.github.com/cboggs/685dfd95881ad4aff60ec266bd1f34b3.js"></script><br />
</div>Perfect! Now we can write more tests in here and gain some nice organization and cleanup bits without having to provide anything beyond a couple of appropriate tags. (And after all that, yes, I do realize that a tempfile doesn't necessarily need to go into its own directory if it's just going to be immediately deleted. This just makes me feel better.)<br />
<br />
To wrap up, let's make our GenServer's module and API a bit more legit with a typespec and docstrings:<br />
<div><!-- lets-build-elixir-4_6 --><br />
<script src="https://gist.github.com/cboggs/070cb2c0bd330c96da4e436e495ea290.js"></script><br />
</div>The <span style="font-family: "courier new" , "courier" , monospace;">@moduledoc</span><span style="font-family: inherit;"> and </span><span style="font-family: "courier new" , "courier" , monospace;">@doc</span><span style="font-family: inherit;"> directives are pretty straightforward - wrap up your docstrings in the """ markers, and get yo' docs on. Keep in mind that the docs are in markdown, so you can (and really should) make them look pretty.</span><br />
<br />
<span style="font-family: inherit;">The </span><span style="font-family: "courier new" , "courier" , monospace;">@spec</span><span style="font-family: inherit;"> directive on Line 22 is simply a way to specify the types of the arguments our function can accept, an the type of value it will return. Easy stuff, and super helpful when we start looking into static analysis - it can help iron out a ton of bugs early on.</span><br />
<br />
<h2>Next Time</h2><div>Now that we've spent some time on some of the basics that we'll be seeing over and over again, the next post will get into more of the meat of our project and start doing stuff that's more interesting than writing a timestamp to a file. Specifically, we'll define the first iteration of our data format and figure out a way to represent that in code such that we can validate incoming requests for appropriate structure.</div><div><br />
</div><div>Until then, feel free to peruse the source for this post at: <a href="https://github.com/strofcon/stats-yard/tree/lets-build-4" target="_blank">https://github.com/strofcon/stats-yard/tree/lets-build-4</a></div><br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6854573657784240555.post-21402467653669428362016-04-04T21:44:00.000-05:002016-05-16T21:54:03.947-05:00Let's Build Something: Elixir, Part 3 - Getting Started with ExUnit for Testing<div>NOTE: Before you get too far into this one, I want to mention that I realized I wasn't following convention in my Elixir file names, so there's a commit at the beginning of this post's branch that fixes it (and it's been merged to master as the others have as well). Just a heads-up in case it seems weird all of a sudden. :-)</div><div><br />
</div><div><a href="http://tech.strofcon.org/2016/03/lets-build-something-elixir-part-2.html">Last time</a> we made our TimestampWriter GenServer a supervised process to make it more resilient to bad inputs and other process-killing events. Now it's time to protect our GenServer from a much more sneaky and persistent assailant - us! This seems like a good time to get familiar with ExUnit and build our first test case for StatsYard.</div><div><br />
</div><div>Defining ExUnit test cases is pretty similar to defining any other module in our Elixir project. If you pop into the <span style="font-family: "courier new" , "courier" , monospace;">stats_yard/test </span><span style="font-family: inherit;">directory and take a look at </span><span style="font-family: "courier new" , "courier" , monospace;">stats_yard_test.exs</span><span style="font-family: inherit;">, you'll see a simple example test:</span></div><div></div><div><!-- lets-build-elixir-3_1 --><br />
<script src="https://gist.github.com/cboggs/1c5ec33ff380f189f206a954995629ac.js"></script><br />
</div>Running this test is as easy as a quick <span style="font-family: "courier new" , "courier" , monospace;">mix test:</span><br />
<div><!-- lets-build-elixir-3_2 --><br />
<script src="https://gist.github.com/cboggs/8493d8c274591d6f350890af639f3e98.js"></script><br />
</div>Let's break that test down just a bit:<br />
<div></div><ul><li>Line 1: As mentioned above, a test case is simply an Elixir module</li>
<li>Line 2: Pull in ExUnit's test case bits</li>
<li>Line 3: This line will cause <a href="http://elixir-lang.org/docs/stable/ex_unit/ExUnit.DocTest.html">ExUnit to do some magic</a> that we'll discuss at a later date</li>
<li>Line 5: Defines a unit test with an arbitrary string label</li>
<li>Line 6: Makes the bold assertion that 1 + 1 does in fact equal 2</li>
<ul><li><span style="font-family: "courier new" , "courier" , monospace;">assert</span><span style="font-family: inherit;"> is basically just shorthand (or more accurately, a macro) that says "everything after the keyword 'assert' better be true, otherwise I'm gonna blow up and fail spectacularly". (There's a bit more to it, and we'll tackle that next.)</span></li>
</ul></ul><div>To see <span style="font-family: "courier new" , "courier" , monospace;">assert</span><span style="font-family: inherit;"> </span>do its thing in a less-than-true situation, we can just change the 2 to a 3 on Line 6 and run <span style="font-family: "courier new" , "courier" , monospace;">mix test:</span></div><div><!-- lets-build-elixir-3_3 --><br />
<script src="https://gist.github.com/cboggs/a7da71931129e7b80768b9baad5a04b7.js"></script><br />
</div>In a nutshell, what happened here is that we <i>insisted</i> 1 + 1 = 3, and <span style="font-family: "courier new" , "courier" , monospace;">assert </span><span style="font-family: inherit;">totally called us on it. What were we thinking???</span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">There's some interesting stuff in that output block. First it tells us which test failed ("the truth"), what file that test lives in (test/stats_yard_test.ex), and the line number that the test definition starts on (:5). After that, it tells us the general type of 'thing' we were trying to do (assert with ==) and shows us the specific assertion that failed.</span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">Next up are two interesting and very helpful lines: lhs and rhs. These acronyms stand for "left-hand side" and "right-hand side" respectively, and these lines actually give us some insight into the way the test actually works under the hood. If you haven't encountered them before, lhs and rhs are hyper-relevant to one of Elixir's most powerful features: <a href="http://elixir-lang.org/getting-started/pattern-matching.html">pattern matching</a>!</span><br />
<span style="font-family: inherit;"><br />
</span> These two lines are telling us that ExUnit took our <span style="font-family: "courier new" , "courier" , monospace;">assert </span><span style="font-family: inherit;">expression and made an attempted pattern match expression out of it, with the actual evaluated value on the left-hand side of the match, and the asserted value on the right-hand side, like so:</span><br />
<div><!-- lets-build-elixir-3_4 --><br />
<script src="https://gist.github.com/cboggs/3cc08b1c31587eedc91ee16ab1d85425.js"></script><br />
</div>In this <span style="font-family: "courier new" , "courier" , monospace;">iex</span> session we can see an example of both of the test attempts we've tried so far - the first one being the successful test, and the second being the intentional failure. Hopefully this provides a bit of clarity around how ExUnit is actually accomplishing this particular test.<br />
<br />
So that's all fine and dandy, but we really should work on testing our <span style="font-family: "courier new" , "courier" , monospace;">TimestampWriter. </span><span style="font-family: inherit;">Go ahead and switch that pesky 3 back to a 2, and we'll get started!</span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">First let's create a directory that will hold our tests - it's cleaner, and seems to be the convention used in most projects. Then we'll create a file in there to hold our first real test case (note that test files have to be named "<stuff>_test.exs, otherwise the </span><span style="font-family: "courier new" , "courier" , monospace;">mix test</span><span style="font-family: inherit;"> task will skip them)</span><span style="font-family: inherit;">:</span><br />
<div><!-- lets-build-elixir-3_5 --><br />
<script src="https://gist.github.com/cboggs/379d9cedbfa94f0142779a07b7723af1.js"></script><br />
</div>In <span style="font-family: "courier new" , "courier" , monospace;">timestamp_writer_test.exs</span><span style="font-family: inherit;"> we'll start out with the bare-bones first increment of our test, BUT we'll try to make it fail first by passing a bad argument to our public API function </span><span style="font-family: "courier new" , "courier" , monospace;">write_timestamp/2</span><span style="font-family: inherit;">:</span><br />
<div><!-- lets-build-elixir-3_6 --><br />
<script src="https://gist.github.com/cboggs/b1a3d3a6c6d45852012181e967c96560.js"></script><br />
</div><span style="font-family: inherit;">(Note that I stopped naming these modules </span><span style="font-family: "courier new" , "courier" , monospace;">StatsYardTest.*</span><span style="font-family: inherit;">, no point to it as far as I can tell.)</span><br />
<div><br />
</div>And a quick run to see what's up:<br />
<div><!-- lets-build-elixir-3_7 --><br />
<script src="https://gist.github.com/cboggs/d4ef848a32caafefe1961ee9b7b72740.js"></script><br />
</div>Huh... well that's... um... awesome? Not really. The GenServer process did indeed fail like expected, but the tests still technically passed. What gives?<br />
<br />
As it turns out, we tested our public API function, <span style="font-family: "courier new" , "courier" , monospace;">write_timestamp/2</span><span style="font-family: inherit;">, not so much our GenServer. Our function is simply calling </span><span style="font-family: "courier new" , "courier" , monospace;">GenServer.cast/2</span><span style="font-family: inherit;"> which then <i>asynchronously</i> sends a message to our TimestampWriter process. That <i>send</i> is indeed successful and returns the </span><span style="font-family: "courier new" , "courier" , monospace;">:ok</span><span style="font-family: inherit;"> atom - even though our process dies shortly thereafter - and that's exactly how </span><span style="font-family: "courier new" , "courier" , monospace;">GenServer.cast/2</span> <span style="font-family: inherit;">is intended to operate.</span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">So how do we fix that? Well to be entirely honest, I don't know yet. BUT! There is a silver lining - this experience has made me re-think whether or not this particular activity is best handled as a </span><span style="font-family: "courier new" , "courier" , monospace;">cast</span><span style="font-family: inherit;"> or a </span><span style="font-family: "courier new" , "courier" , monospace;">call</span><span style="font-family: inherit;">, which basically boils down to "should it be asynchronous with no response, or synchronous with a response?" Given the intended purpose of this particular function, I think a </span><span style="font-family: "courier new" , "courier" , monospace;">call</span><span style="font-family: inherit;"> is more appropriate: we're going to need some manner of acknowledgement that our data has indeed been written to disk before moving on to whatever our next task might be.</span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">So! Back to our </span><span style="font-family: "courier new" , "courier" , monospace;">TimestampWriter</span><span style="font-family: inherit;"> code:</span><br />
<div><!-- lets-build-elixir-3_8 --><br />
<script src="https://gist.github.com/cboggs/0d2eb025e6f9855790983a3f5137869c.js"></script><br />
</div>To recap the changes here:<br />
<ul><li>Line 9: Switch from <span style="font-family: "courier new" , "courier" , monospace;">GenServer.cast/3</span> to <span style="font-family: "courier new" , "courier" , monospace;">GenServer.call/3</span></li>
<li><span style="font-family: inherit;">Line 24: Switch from </span><span style="font-family: "courier new" , "courier" , monospace;">GenServer.cast/3</span><span style="font-family: inherit;"> to </span><span style="font-family: "courier new" , "courier" , monospace;">GenServer.call/3</span><span style="font-family: inherit;"> and add an (unused) argument, </span><span style="font-family: "courier new" , "courier" , monospace;">_from</span><span style="font-family: inherit;">, for the sender's PID (we don't particularly need this right now, hence the underscore to keep the compiler happy)</span></li>
<li><span style="font-family: inherit;">Line 25: Bind the result of our file write operation to </span><span style="font-family: "courier new" , "courier" , monospace;">result</span></li>
<li>Line 26: Use the appropriate response from a <span style="font-family: "courier new" , "courier" , monospace;">call</span><span style="font-family: inherit;">, which is to return the </span><span style="font-family: "courier new" , "courier" , monospace;">:reply</span><span style="font-family: inherit;"> atom, a result of some sort and the new state of the GenServer</span></li>
</ul><div>Notice a cool thing here, too: our interface to the GenServer didn't change at all, so we don't need to update our test! We should be able to run <span style="font-family: "courier new" , "courier" , monospace;">mix test</span><span style="font-family: inherit;"> and see our test fail appropriately:</span></div><div><!-- lets-build-elixir-3_9 --><br />
<script src="https://gist.github.com/cboggs/fa9ec98eeb5c3c8f4b2cc46ce301f4ca.js"></script>(Note: There will still be some extra output after this as a result of our GenServer tanking on bad inputs. We'll try to fix that another time.)<br />
</div><br />
Perfect! Now if we stop passing a known-bad argument to our public function in the test, we should get a nice passing test:<br />
<div><!-- lets-build-elixir-3_10 --><br />
<script src="https://gist.github.com/cboggs/065afafd530aee388193663098e453ae.js"></script><br />
</div>Success! Whew. That was a bit of a runaround to get a simple test in place, but I learned a lot, so it doesn't seem like a wasted effort to me.<br />
<br />
As a final cleanup step (for now), I'm going to remove the simple truth test from the out-of-the-box test file that mix creates, because I don't really care for the clutter.<br />
<div><br />
</div><h2>Experiment</h2><div>We left a bit of an unpleasant side effect in place with our test. Hint: our timestamp writer spits out a timestamp <i>somewhere</i>. Figure out where it's landing, then peruse the <a href="http://elixir-lang.org/docs/stable/ex_unit/ExUnit.Case.html">ExUnit docs</a> and see if you can figure out how to make that stop happening. No need for that clutter! The next blog post will cover how to fix this.</div><div><br />
</div><h2>Next Time</h2><div>We're not quite doing <a href="https://en.wikipedia.org/wiki/Test-driven_development">TDD</a>, but hey, it's a start! Next time 'round we'll clean all of this up a bit more (as mentioned in the Experiment above) with some <a href="http://elixir-lang.org/getting-started/typespecs-and-behaviours.html">typespecs</a> and <a href="http://elixir-lang.org/docs/v1.2/elixir/writing-documentation.html">docs</a>. Exciting stuff, eh? At the end of the day, it's worlds easier to do these things up front, rather than trying to retrofit them later - plus we can make use of them for some testing convenience (or at least that's the hope!)<br />
<br />
For now, you can peruse the source for this post at: <a href="https://github.com/strofcon/stats-yard/tree/lets-build-3" target="_blank">https://github.com/strofcon/stats-yard/tree/lets-build-3</a></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6854573657784240555.post-8279877379167471252016-03-30T21:42:00.000-05:002016-05-16T21:54:03.966-05:00Let's Build Something: Elixir, Part 2 - Supervising Our GenServerIn <a href="http://tech.strofcon.org/2016/03/lets-build-elixir-part-1.html">Part 1</a> we built a simple GenServer and worked our way through making it write a timestamp to a local file.<br />
<br />
During that process, we ran into a usage issue - we had to explicitly start the GenServer before we could invoke its <span style="font-family: "courier new" , "courier" , monospace;">write_timestamp/2</span><span style="font-family: inherit;"> function. It makes sense, but it's inconvenient and dependent on human interaction. We should (and can!) fix that.</span><br />
<br />
<h2>Automatically Starting Our GenServer</h2><div></div><span style="font-family: inherit;"><br />
</span> The file <span style="font-family: "courier new" , "courier" , monospace;">lib/stats_yard.ex</span><span style="font-family: inherit;"> is the main entry point for our app, so that seems like a good place to add a call to </span><span style="font-family: "courier new" , "courier" , monospace;">StatsYard.TimestampWriter.start_link/0</span><span style="font-family: inherit;"> (ignoring the supervisor bits for now):</span><br />
<div><!-- lets-build-elixir-2_1 --><br />
<script src="https://gist.github.com/cboggs/f81db9a330191d9d521224919b85f01a.js"></script><br />
</div><div>Now to verify that it acts as we'd expect:<br />
<div><!-- lets-build-elixir-2_2 --><br />
<script src="https://gist.github.com/cboggs/be80a78117442739cef9a5569acd16ea.js"></script><br />
</div>Much better! Now let's see what happens when it dies, as most everything on a computer is likely to do at the worst possible time.<br />
<br />
There are a few ways to easily kill a process in Elixir, but I always enjoy throwing a value into a function that it can't handle and watching the poor process get shredded. I'm a jerk like that. Let's pass a tuple via the <span style="font-family: "courier new" , "courier" , monospace;">timestamp</span> argument and see what happens:<br />
<div><!-- lets-build-elixir-2_3 --><br />
<script src="https://gist.github.com/cboggs/d440c570d7e4a9f883d76597637cd6e8.js"></script><br />
</div>Excellent! Our GenServer is dead as a hammer. Soooo now what?<br />
<br />
Well, we can restart it manually in our <span style="font-family: "courier new" , "courier" , monospace;">iex</span> session, but that's a non-starter when you consider your app running unattended. We could also just drop out of our session and then start it back up, but that's still pretty lame.<br />
<br />
Enter supervisors! If we take another glance at the chunk of code we saw above, paying special attention to the bits I said to ignore, we'll see the structure we need to use to start up our GenServer as a supervised process:<br />
<br />
<ul><li>Line 11-14: A list of your Supervisor's child processes, which can either be regular process or another Supervisor (sub-Supervisor)</li>
<ul><li>Each item in this list will contain the module name for a process and will call that module's <span style="font-family: "courier new" , "courier" , monospace;">start_link</span><span style="font-family: inherit;"> function</span></li>
<li><span style="font-family: inherit;">Notice that the arguments passed to your </span><span style="font-family: "courier new" , "courier" , monospace;">start_link</span><span style="font-family: inherit;"> function are provided here as a list. This is done so that the Supervisor's </span><span style="font-family: "courier new" , "courier" , monospace;">start_link </span>function doesn't have to handle arbitrary argument counts (arities) based on <i>your</i> process's <span style="font-family: "courier new" , "courier" , monospace;">start_link </span>arity</li>
</ul><li>Line 18: A list of Supervisor options. For now, the provided values are more than sufficient, and we'll talk more about what each one means in a later post</li>
<li>Line 19: The start_link call for your Supervisor process, which (shockingly!) looks awfully similar to the call we make for our own GenServer</li>
</ul><div>Now let's move our GenServer startup bits into the Supervisor's <span style="font-family: "courier new" , "courier" , monospace;">children</span><span style="font-family: inherit;"> list, and see what we can make of it:</span></div><div></div><div><!-- lets-build-elixir-2_4 --><br />
<script src="https://gist.github.com/cboggs/df54bf574c1b2ee9ecf77deb2aa0855e.js"></script><br />
</div>And the test:<br />
<div></div><div><!-- lets-build-elixir-2_5 --><br />
<script src="https://gist.github.com/cboggs/1d4bc50f6a9c4d61b186423899eabd1a.js"></script><br />
</div>Woohoo! Our application is now considerably more "self-healing" than it was before. We can start up <span style="font-family: "courier new" , "courier" , monospace;">iex</span><span style="font-family: inherit;">, and immediately start writing timestamps without explicitly starting our GenServer. Then we can crash the GenServer process, and immediately continue on writing timestamps without having to explicitly restart the process. Excellent!</span><br />
<span style="font-family: inherit;"><br />
</span> <br />
<h2>Experiments</h2><div>If you're the curious sort (and you really should be! It's awesome!), you may want to try and poke around with this a bit more by trying to crash our GenServer several times and seeing if the behavior persists. See if you can find a limit to what we've done here, and see if the docs for <a href="http://elixir-lang.org/docs/stable/elixir/Supervisor.html">Elixir's Supervisors</a> can guide you toward a fix.</div><div><br />
</div><h2>Next Time</h2><div>We've got a nice simple GenServer automated and supervised, but there's no guarantee that our next batch of code changes won't break something in some unforeseen way. We'll poke around with <a href="http://elixir-lang.org/docs/stable/ex_unit/ExUnit.html">ExUnit</a> next time, and see if we can start a basic unit test suite for our project.<br />
<br />
For now, you can peruse the source for this post at: <a href="https://github.com/strofcon/stats-yard/tree/lets-build-2" target="_blank">https://github.com/strofcon/stats-yard/tree/lets-build-2</a></div></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6854573657784240555.post-22194548738685799262016-03-21T22:59:00.001-05:002016-05-16T21:54:03.951-05:00Let's Build Something: Elixir, Part 1 - A Simple GenServerI'm an "ops guy" by trade. That's my main strength, tackling things at all layers of a SaaS stack to fix broken things and automate my way out of painful places. That said, I find that my mind wanders into the realm of, "wouldn't be cool if I could build <thing><insert cool thing here>?" I like to tinker, and I love solving problems, so I tend to poke around with some development projects on the side for fun and for my own education. I also find that I'm better at my "ops day job" for having understood some of what makes the software development machine tick.</thing><br />
<thing><br />
I've been playing with Elixir a bit lately, and it's incredibly attractive for a lot of reasons that I won't get into here. I learn faster when others peruse my code, and I like sharing what I've learned as I go along... Sooooo, let's build something! </thing><br />
<thing><br />
</thing> <thing>Erlang (and by extension, Elixir) lends itself very well to highly-concurrent and distributed systems, so I want to build something that leverages that strength and bends my brain in some weird ways. </thing>To that end, I want to build a time series database. A very simple one, mind you, but one that provides some measure of useful functionality and takes advantage of the things Elixir, OTP, and BEAM bring to the table.<br />
<thing><br />
</thing> <thing>This first part will be pretty simple - we'll spin up a new project with <span style="font-family: "courier new" , "courier" , monospace;">mix</span><span style="font-family: inherit;">, define a GenServer that writes the current timestamp to a file, supervise it, and then manually test out the supervision. We'll get into </span><span style="font-family: "courier new" , "courier" , monospace;">exunit</span><span style="font-family: inherit;"> and such next time.</span></thing><br />
<thing><span style="font-family: inherit;"><br />
</span></thing> <thing><span style="font-family: inherit;"><b>Note:</b> My bash commands aren't terribly copy-pasta-friendly in these posts, and I'm OK with that. By simply pasting my actual shell output, you can more easily figure out what directory I'm in at any given time. Fortunately we don't spend much time in bash, at least not in the code snippets.</span></thing><br />
<thing><span style="font-family: inherit;"><br />
</span></thing> <br />
<h2>Create a New Project</h2><div>I'm going to call my awesome new world-changing time series database "StatsYard". We'll use <span style="font-family: "courier new" , "courier" , monospace;">mix</span> to get kicked off:</div><!-- lets-build-elixir-1_1 --><br />
<script src="https://gist.github.com/cboggs/f59fc5c57e8dd9a209d692e8c339c40b.js"></script><br />
<div>This lays down the bones we need to start building stuff. (<b>Note:</b> Even though we named our project "stats_yard", the actual namespace will be StatsYard as shown below.<br />
<br />
<h2>Define Our GenServer</h2><div>Let's create a GenServer that writes the current timestamp plus some message to a file upon request.</div><div><br />
</div><div>Mosey on over to <span style="font-family: "courier new" , "courier" , monospace;">stats_yard/lib </span><span style="font-family: inherit;">and create a new directory with the same name as our project, </span><span style="font-family: "courier new" , "courier" , monospace;">stats_yard</span><span style="font-family: inherit;">. Within that directory, create a file named timestamp_writer.ex :</span></div><div><!-- lets-build-elixir-1_2 --><br />
<script src="https://gist.github.com/cboggs/8f03685d6f108907b11ac23818d2f3dc.js"></script><br />
<div>Crack that bad boy open and let's build a GenServer!</div><br />
<strike>I like to name my files the same as the module they contain wherever possible, but it isn't required - I just find it eases troubleshooting.</strike> The convention I've seen everywhere else is to name your Elixir files after the modules they contain, but in all lower case and with words separated by an underscore ( _ ). Now let's define our module <span style="font-family: "courier new" , "courier" , monospace;">StatsYard.TimestampWriter</span><span style="font-family: inherit;">:</span><br />
<!-- lets-build-elixir-1_3 --><br />
<script src="https://gist.github.com/cboggs/5c9e087ed4af58f9e784694193720840.js"></script><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">Not too much special here if you're generally familiar with GenServer:</span><br />
<ul><li>Line 8: A public function to make it easier to invoke the GenServer's callbacks </li>
<li>Line 12: <span style="font-family: "courier new" , "courier" , monospace;">start_link</span> is the function we'll use to startup our process running the GenServer</li>
<li>Line 20: <span style="font-family: "courier new" , "courier" , monospace;">init</span><span style="font-family: inherit;"> is called by </span><span style="font-family: "courier new" , "courier" , monospace;">start_link </span><span style="font-family: inherit;">and is the way we set the initial state of our GenServer (for now, just the integer 0)</span></li>
<li><span style="font-family: inherit;">Line 24: Our cast callback for actually doing the work. We could have made the </span><span style="font-family: "courier new" , "courier" , monospace;">timestamp</span><span style="font-family: inherit;"> a value that's calculated every time we cast to the GenServer, but for the sake of a TSDB we'll want to be able to accept arbitrary timestamps, not solely "right now" timestamps</span></li>
</ul><div>Let's see if it works! From the top level of our project, we'll hop into <span style="font-family: "courier new" , "courier" , monospace;">iex</span>:<br />
</div><!-- lets-build-elixir-1_4 --><br />
<script src="https://gist.github.com/cboggs/dda01979a0484c1f5a8531ec43936f52.js"></script><br />
<div>Now let's run the appropriate commands and hopefully we'll see a timestamp written to the file we specify:<br />
</div><!-- lets-build-elixir-1_5 --><br />
<script src="https://gist.github.com/cboggs/9b2db398a7eb13909bc188661d06081b.js"></script><br />
<br />
<span style="font-family: inherit;">Oops. We must have missed something - </span><span style="font-family: inherit;"><span style="font-family: "courier new" , "courier" , monospace;">GenServer.do_send/2</span><span style="font-family: inherit;"> is unhappy with our arguments. This particular function accepts two arguments: the Process ID (PID) of the GenServer whose callback you're trying to invoke, and the body of the message you want to send. In our case, our public </span><span style="font-family: "courier new" , "courier" , monospace;">write_timestamp/2</span><span style="font-family: inherit;"> function is actually not calling a private function in our GenServer, but is instead sending a message to the GenServer's PID. That message contains a tuple that the GenServer should pattern match appropriately upon receipt.</span></span><br />
<span style="font-family: inherit;"><span style="font-family: inherit;"><br />
</span></span> <span style="font-family: inherit;"><span style="font-family: inherit;">So, where did we go wrong? The message payload ( </span></span><span style="font-family: "courier new" , "courier" , monospace;">{:write_timestamp, "/tmp/foo.tstamp", 1459134853371}}</span> ) <span style="font-family: inherit;">certainly looks correct, so it seems to be the first argument, the GenServer's PID.</span><br />
<span style="font-family: inherit;"><br />
</span> <span style="font-family: inherit;">When you name a GenServer you're effectively mapping an atom to a PID, which allows you to reference the process without having to know it's PID in advance. In our case, </span><span style="font-family: "courier new" , "courier" , monospace;">__MODULE__</span><span style="font-family: inherit;"> equates to </span><span style="font-family: "courier new" , "courier" , monospace;">:"StatsYard.TimestampWriter"</span><span style="font-family: inherit;">, which should then map to our PID.... O</span><span style="font-family: inherit;">h, right! We don't have a PID, because we never started our GenServer process. Easy fix!</span><br />
<br />
<!-- lets-build-elixir-1_6 --><br />
<script src="https://gist.github.com/cboggs/37e575b9aa2c60d98c275adf9ae51851.js"></script><br />
<br />
Now we just need to check our output file and make sure it actually did what it was supposed to:<br />
<!-- lets-build-elixir-1_7 --><br />
<script src="https://gist.github.com/cboggs/100fd0b2d1f5ca9d11a86834b6678ecc.js"></script><br />
<br />
Success!<br />
<br />
<h2>Next Time</h2><div>That's it for now, nothing too interesting just yet. Next time we'll see what happens when our GenServer dies, and what we can do about that.<br />
<br />
For now, you can peruse the source for this post at: <a href="https://github.com/strofcon/stats-yard/tree/lets-build-1" target="_blank">https://github.com/strofcon/stats-yard/tree/lets-build-1</a></div></div></div>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-6854573657784240555.post-51267424113334943532016-03-15T00:36:00.000-05:002016-04-04T23:09:23.987-05:00Sysdig Cloud - Monitoring Made Awesome Part 1: Metrics CollectionJust in <u><a href="http://tech.strofcon.org/2016/03/deploying-private-paas-good-meh-and-aw.html" target="_blank">case</a></u> <u><a href="http://tech.strofcon.org/2015/05/stats-should-be-commodity-introducing.html" target="_blank">you</a></u> <u><a href="http://tech.strofcon.org/2015/03/why-your-young-start-up-needs.html" target="_blank">hadn't</a></u> <u><a href="http://tech.strofcon.org/2015/02/getting-gmond-metrics-into-influxdb.html" target="_blank">noticed</a></u> <u><a href="http://tech.strofcon.org/2014/08/influxdb-and-grafana-scripted-dashboards.html" target="_blank">before</a></u> <u><a href="http://tech.strofcon.org/2013/02/nagios-and-ganglia-checkgangliametricsh.html" target="_blank">now</a></u>, I'm a <i>tiny bit</i> obsessed with ops metrics. Most of my preoccupation with metrics seems to stem from the fact that... well... it's <i>hard</i>. "Metrics," for all that term entails, can be a difficult problem to solve. That might sound a bit odd given that there are a thousand-and-one tools and products available to tackle metrics, but with a sufficiently broad perspective the problem becomes pretty clear.<br />
<br />
I've worked with a lot of different metrics tools while trying to solve lots of different problems, but I always tend to come across the same pain points no matter what tool it is that I'm using. As a result, I'm pretty hesitant to really recommend most of those tools and products.<br />
<br />
Fortunately, that may very well be changing with the advent of Sysdig Cloud. Sysdig has <i>really </i>nailed the collection mechanism and is doing some great work on the storage and visualization fronts. This is the first in a series of posts describing how I think Sysdig is changing the game when it comes to metrics and metrics-based monitoring.<br />
<br />
<b>Disclaimer: </b>I am not currently (or soon to be), employed by Sysdig Cloud, nor am I invested in Sysdig Cloud. I'm just genuinely impressed as an ops guy who's got a thing for metrics. I've been using Sysdig for a few months, and I only plan to brag on features that I've used myself. I'll also point out areas where Sysdig is a bit weak and could improve, because let's face it - no one's perfect. :-)<br />
<div><br />
</div><h4>The Problem(s)</h4><div>There are essentially three stages to the life-cycle of a metric: Collection, Storage, and Visualization. Believe it or not, they're all pretty hard to solve in light of the evolving tech landscape (farewell static architectures!). This post will tackle Collection, and is a bit long since they do it <i>so</i> well.</div><div><br />
</div><h4>When Metrics Collection Gets Painful</h4><div>For metrics to be of any use you need some mechanism by which to extract / catch / proxy / transport them. The frustrating part is that there are actually several layers of collection that need to be considered. If we take the most complex case - a container-based Platform as a Service - three categories of metric should do the trick: host, container, and application. Handling all of these categories well is difficult - I tried with <a href="http://tech.strofcon.org/2015/05/stats-should-be-commodity-introducing.html" target="_blank">Stat Badger</a>, and it was... well... a bit unpleasant.<br />
<br />
Host metrics are generally fairly easy to collect and most collectors grab the same stuff, albeit with varying degrees of efficiency and ease of configuration.<br />
<br />
Container metrics aren't too terribly hard to collect, though there are certainly fewer collectors available. The actual values we care about here are usually a subset of our host metrics (CPU % util, memory used, network traffic, etc) scoped to each individual container. This starts to uncover the need for orchestration metadata <a href="http://tech.strofcon.org/2016/03/deploying-private-paas-good-meh-and-aw.html" target="_blank">when considered within a PaaS environment</a>.<br />
<br />
Application metrics can easily become the bane of your existence. Do we have a way to reliably poll metrics out of the app (for example, JMX)? If so, does our collector handle this well or do we need to shove a sidecar container into our deployments to provide each instance with its own purpose-built collector? If not, do we try to get the developers to emit metrics in some particular format? Should they emit straight to the back-end, or should they go through a proxy? If they push to some common endpoint(s), how best can we configure that endpoint per environment? Then once we've answered these questions, how on earth do we correlate the metrics from a particular application instance with the container it's running in, the host the container is running on, the service the app instance belongs to, the deployment and replication / HA policies associated with that service, and so on and so on...???<br />
<br />
Enter the Sysdig agent.</div><div><br />
</div><h4>How Sysdig Makes Metrics Collection Easy</h4><div>The Sysdig agent is, right out of the gate, uncommon in its ambition in approaching to tackling all layers of the collection problem.<br />
<br />
Most collectors rely exclusively on polling mechanisms to get their data, whether it's by reading <span style="font-family: "courier new" , "courier" , monospace;">/proc</span> data on some interval, hitting an API endpoint to grab stats, or running some basic Linux command from which it scrapes details. This works, but is generally prone to errors when things get upgrades / tweaked, and can be fairly inefficient.<br />
<br />
Sysdig does have the ability to do some of those things to monitor systems such as HAProxy and whatnot, but that's not its main mechanism. Instead, the Sysdig agent watches the constant stream of system events as they flow through the host's kernel and gleans <i>volumes </i>of information from said events. Pretty much anything that happens on a host, with the exception of apps that run on a VM such as the JVM or BEAM, will be result in (or be the result of) a system event that the host kernel handles. This has a couple of huge benefits: it's very low-overhead, and it's immensely hard to hide from the agent. These two core benefits of the base collection mechanism allow for a number of pretty cool features.</div><br />
<h2>Fine-Grained Per-Process Metrics</h2><div>Watching system events allows the Sysdig agent to avoid having to track the volatile list of PIDs in <span style="font-family: "courier new" , "courier" , monospace;">/proc</span> and traverse that virtual filesystem to get the data you want. All of the relevant information is already present in the system events and this opens the door to some really nifty visualization capabilities.</div><div><br />
</div><div><h2>"Container Native" Collection</h2><div>By inspecting every system event that flows by, there's no middle man in snagging container metrics. No need to hit the Docker <span style="font-family: "courier new" , "courier" , monospace;">/stats</span> endpoint and process its output, no worries about Docker version changes breaking your collection, and ultimately no need to relegate yourself to Docker for your container needs. This also combines beautifully with the fine-grained per-process metrics to give visibility into the processes within your containers in addition to basic container-wide metrics. It's pretty awesome.</div></div><br />
<h2>Automatic Process Detection</h2><div>The above two features combine very nicely to allow the Sysdig agent to automatically detect a wide variety of services by their process attributes, simply by having seen a relevant system event flow past on its way to the host kernel. This allows some amazing convenience in monitoring applications since the agent immediately sees when a recognized process has started up - even when it's inside a container.</div><div><br />
</div><div>For example, if you're running a Kafka container, the Sysdig agent will detect the container starting up, see the JVM process start up, notice that the JVM is exposing port 9092, spin up the custom Sysdig Java agent, inject it into the container's namespace, attach directly to the Kafka JVM process (from<i> within the container</i>, mind you), and start collecting some basic JVM JMX metrics (heap usage, GC stats, etc) along with some Kafka-specific JMX metrics - all for free, and without you needing to intervene at all. That's <i>awesome</i>.</div><div><br />
</div><h2>StatsD Teleport</h2><div>I'm not going to dig into this one here since this is already a lengthy post. Just read <a href="https://sysdig.com/blog/see-statsd-custom-metrics-inside-containers-automagically-with-sysdig-cloud/" target="_blank">this post from Sysdig's blog</a> - and be amazed.</div><div><br />
</div><h2>Orchestration Awareness Baked In</h2><div>Orchestration metadata is 100% crucial to monitoring any PaaS or PaaS-like environment. One simply cannot have any legitimate confidence in their understanding of the health of the services running in their stack without being able to trace where any given instance of a service is running where it lives in the larger ecosystem. Sysdig seems to have a strong focus on integration with a number of orchestration mechanisms. If you configure your orchestration integration correctly, then any metric collected via <i>any</i> of the paths will automatically be tagged with ALL of that metadata on its way to the Sysdig back-end. Even better? The agent configuration on individual nodes in a cluster, for instance with Kubernetes, don't need to know that they're part of the cluster, only the masters need to know. The masters will ship orchestration metadata to the back-end, and then when metrics come in from a node that is part of that master's cluster, it will automatically be correlated and immediately visible in the appropriate context. Seriously, that's been making me a VERY happy metrics geek.</div><div><br />
</div><h4>To Be Continued (Later)</h4><div>At some point after this series expounding the ways Sysdig is trying to tackle metrics and monitoring the right way (in my opinion), I'll get around to posting some more technical how-to pieces showing how best to make use of some of these features.</div><div><br />
</div><div>For now, happy hacking!</div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6854573657784240555.post-19275248565499168052016-03-07T01:02:00.000-06:002016-04-04T23:09:33.776-05:00Deploying a Private PaaS: The Good, the Meh, and the Aw CrapMoving your organization's dev and prod deployment architecture to a <a href="https://en.wikipedia.org/wiki/Platform_as_a_service" target="_blank">PaaS</a> model can bring a lot of benefits, and a good chunk of these benefits can be realized very quickly by leaning on public PaaS providers such as RedHat's <a href="https://www.openshift.com/" target="_blank">OpenShift Online</a>, Amazon's <a href="https://aws.amazon.com/ecs/" target="_blank">ECS</a>, or Google's <a href="https://cloud.google.com/container-engine/" target="_blank">Container Engine</a>. Sometimes though, a public PaaS offering isn't a viable option, possibly due to technical limitations, security concerns, or enterprise policies.<br />
<br />
Enter the private PaaS! There are a number of options here, all of which offer varying degrees of feature abundance, technical capability, baked-in security goodies, operational viability, and other core considerations. As with anything in the tech world, the only clear winner among the various tools is the one that best fits your environment and your needs. No matter what you choose, however, there are some key things to consider during evaluation and implementation that span the PaaS field independent of any particular tool.<br />
<br />
Let's walk through some of the ups and down of deploying a private PaaS. Please keep in mind that this post isn't about "Private PaaS vs. Public PaaS". Instead it assumes you're in a situation where public cloud offerings aren't a viable option, so it's more about "Private PaaS vs. No PaaS."<br />
<h4><br />
</h4><h4>The Good</h4><div>A full listing of all the good things a PaaS can bring would be pretty lengthy, so I'll focus on what I think are the most high-value benefits compared to a "legacy" deployment paradigm.</div><h3><br />
</h3><h2>Increased Plasticity and Reduced Mean Time to Recovery</h2><div>Plasticity: the degree to which workloads can be moved around a pool of physical resources with minimal elapsed time and resource consumption</div><div>Mean Time to Recovery (MTTR): the time taken to recover from a failure, particularly failure of hardware resources or bad configurations resulting in a set of inoperable hosts (at least in the context of this post)</div><div><br />
</div><div>Legacy architectures typically see one application instance (sometimes more, but it's rare) residing on a single (generally static) host. Even with some manner of private IaaS in place, you're still deploying each app instance to a full-blown VM running its own OS. This makes it difficult to attain any reasonable plasticity in your production environments due to the time and resources needed to migrate a compute workload, which in turn forces MTTR upward as you scale up.</div><div><br />
</div><div>PaaS workloads generally eschew virtualization in favor of containerization, which can dramatically reduce the time and resources needed to spin up and tear down application instances. This allows for greatly increased plasticity, which consequently drags MTTR down and helps keep it reasonably low as you scale up.</div><h3><br />
</h3><h2>Reduced Configuration Management Surface Area</h2><div>Configuration management is not only sufficient for handling large swaths of your core infrastructure, but has effectively become necessary for such. That said, there's a lot of value in reducing the surface area touched by whatever config management tool(s) you're using. A particularly unhealthy pattern that some organizations find themselves following is one in which every application instance host (virtual or not) is "config managed." In the case of bare metal hosting this makes good sense, but in the event that you're deploying to VMs... it's no good. At all. </div><div><br />
</div><div>Reducing this surface area can act as a major painkiller by requiring much simpler and more consistently applied configuration management... er... configurations. With a PaaS, you only need to automate configuration for the hosts that run the PaaS (and hosts that run non-PaaS-worthy workloads), not the individual application instance containers. This makes life suck a lot less.</div><h3><br />
</h3><h2>Consistency and Control</h2><div>No one likes an overzealous gatekeeper - they're generally considered antithetical to continuous integration and deployment. On the other hand, an environment that isn't in a position to deploy to a public cloud platform is also very unlikely to be in a position to live without some fairly high level of control over its code deployments. The key here is automated gate-keeping, and a PaaS gives you a decent path to accomplishing this.</div><div><br />
</div><div>Running a private PaaS for both staging and production environments gives you ample opportunity to funnel all code deploys through a consistent pipeline that has sufficient (and ideally minimal-friction) automated controls in place to protect production. This allows Infrastructure Engineers to provide a well-defined and mutually-acceptable contract for getting things into the staging PaaS, and consequently provides a consistent and high-confidence path to production - all without necessitating manual intervention by us pesky humans. Basically, if the developer's code and enveloping container are up to snuff by the staging environment's standards, they're clear to push on to production.</div><div><br />
</div><div>Regarding consistency, developer's also gain confidence in their ability to deploy services into an existing ecosystem by being able to rely on a staging environment that mirrors production with far fewer variables than might otherwise have been present in a legacy architecture. Dependencies should all be continually deployed to staging, and thus assurance of API contracts should be much closer to trivial than not.<br />
<br />
<h2>Making DevOps a Real Thing</h2><div>Everyone's throwing around DevOps these days and it's exhausting, I know - but I'm still going to shamelessly throw my definition on the pile.</div><div><br />
</div><div>My current take on DevOps is that an engineering organization most likely contains ops folks who are brilliant within the realm of infrastructure, and developer folks who are brilliant within the realm of application code. If you consider a Venn diagram with two circles - one for infrastructure and one for code - most organizations are likely to see those circles sitting miles apart from one another, or overlapped so heavily as to be indecent. The former diagram could be called "the DevOps desert," and the latter "choked with DevOps". Neither of these are particularly attractive to me.</div><div><br />
</div><div>A well-devised and even-better-implemented PaaS has the potential to adjust that diagram such that the two circles overlap, but only narrowly. Ops people focus hard on infrastructure, and dev people focus hard on code, with a narrow and remarkably well-defined interface between the two disciplines. There's still plenty of room for collaboration and mutually-beneficial contracts, but dev doesn't have to muddy their waters with infrastructure concerns, and ops doesn't have to muddy their waters with code concerns. I think that could be a beautiful thing.</div><div><br />
</div><h4>The Meh</h4></div><div>There aren't a ton of objectively "bad" things about running a PaaS vs. a legacy architecture, but there are a few necessary unpleasantries to consider.</div><div><br />
</div><h2>Introducing Additional Complexity</h2><div>A PaaS is an inherently complex beast. There are mitigating factors here and the complexity is generally worth it, but it's additional complexity all the same. Adding moving parts to any system is, by default, a risky move. But in today's market and at today's scale, complexity is a manageable necessity - provided that you choose your PaaS wisely and hire truly intelligent engineers over those who can throw out the most buzzwords.</div><div><br />
</div><h2>Network configurations</h2><div>Containers need to be able to route to one another across hosts in addition to being able to reach (and be reachable by) other off-PaaS systems. Most PaaS products handle this or provide good patterns for it, but it still introduces a new layer of networking to consider during troubleshooting, automation, and optimization.</div><div><br />
</div><h2>Maintenance and Upgrades</h2><div>If you've build your PaaS and its integration points well, you'll end up with a compelling deployment target on which lots and lots of folks are going to run their code. This can make it tricky to handle host patching and upgrades without impacting service availability. That staging environment (and maybe even a pre-staging environment solely used for infrastructure automation tests) becomes very important here.</div><div><br />
</div><h2>Data Stores</h2><div>Anything that needs local persistent storage (think databases and durable message buses) are unlikely to be good candidates for PaaS workloads. It can be made to work in some cases, but unless you're a big fan of your SAN, you're likely best to keep these kinds of things off the PaaS. Even if you can make it work, I'm not yet convinced that there is much value in having such a workload reside in such a fluid environment.</div><div><br />
</div><h2>Capacity Planning and Resource Limits</h2><div>Container explosions can bite you very quickly, most likely due to a PaaS' self-healing mechanism glitching out. This is a complex component and you're guaranteed to find really... um... "special"... ways of blowing out your resource capacity until you get a good pattern figured out. A clear pattern for determining and enforcing resource limits is going to be incredibly helpful.</div><div><br />
</div><div>Capacity planning also demands relevant data, which means you'll need some solid metrics, and that brings us to...<br />
<div><br />
</div><h4>The "Aw, Crap..."</h4><div>There are a couple of problems that a PaaS introduces which will, at some point, demand your attention - and punish you harshly should you neglect them.</div><div><br />
</div><h2>Single-Point-of-Failure Avoidance</h2><div>SPoFs are anathema to high availability. If you're going to introduce something like a PaaS, you would do well to think very hard about every piece of the system and how its sudden absence might impact the larger platform. Is the cluster management API highly available? Then you'll need to stick a load balancer in front of those nodes. What happens if the LB goes down though?</div><div><br />
</div><div>Or what if you're running your PaaS nodes as VMs? That's fine, until you factor in hypervisor affinities - can't afford to have too many PaaS nodes residing on a single hypervisor. Even if you account for node -> hypervisor affinities, can you ensure that your PaaS won't shovel a large portion of a service's containers onto a small subset of nodes that <i>do</i> reside on the same hypervisor? The extra layer of abstraction away from bare metal here is likely to introduce new SLA-impacting failure scenarios that you may not have considered.</div><div><br />
</div><div>You're very likely to be able to mitigate any SPoF you come across, but they're likely to be slightly different in some cases than what you've handled before, and it's worth considering how to avoid them up front.</div><div><br />
</div><h2>Metrics and Monitoring</h2><div>Disclaimer: I obsess over metrics. It's almost unhealthy, really, but hey - it works for me. So it's not too surprising that introducing a PaaS makes me <b>very</b> curious about how one can go about gathering relevant metrics from the various layers involved. There are effectively three layers or classes of metrics you need to concern yourself with in a typical PaaS: host, container, and application.</div><div><br />
</div><div>Host metrics are just as easy as they've ever been, so that's not of much concern. Collection and storage of metrics for semi-static hosts is largely a solved problem.</div><div><br />
</div><div>Container metrics (things like CPU, memory, network, etc. used per container) isn't too terribly difficult to collect, though storage and effective visualization can be more difficult due to the transient nature of containers - particularly when those containers are orchestrated across multiple hosts.</div><div><br />
</div><div>Application metrics (metrics exposed by the actual application process within each container) are potentially a real bear of a problem. Polling metrics out of each instance from a central collection tool isn't too attractive since it's fairly difficult to track transient instances such as those that reside in a PaaS. On the flip side, having each application instance emit metrics to some central collector or proxy is feasible, but storage and visualization are still difficult since you'll inevitably need to see those metrics in the context of orchestration metadata that is unlikely to be available to your application process.</div></div><div><br />
</div><div>These are not entirely insurmountable problems, but there are definitely not many viable products currently available that have a good grasp on how best to present the data generated within a PaaS. In fact, so far the only "out of the box" solution I've found so far that handles this well is <a href="https://sysdig.com/landing-page/" target="_blank">Sysdig Cloud</a>. These folks are onto something pretty awesome, and I plan to elaborate on that in my next post.</div><div><br />
</div><h2>Most People Are Doing Containers All Wrong</h2><div>Containers should run a single process wherever possible, and should be treated as entirely immutable application instances. There should be no need to log into a container and run commands once it's running on the production PaaS. Images should be built layer upon layer, where each layer has been "blessed" by the security and compliance gods. Ignoring these principles is worthy of severe punishment in the PaaS world, and your PaaS will be all too happy to dish out said punishment. Don't go in unarmed.</div><div><br />
</div><div><br />
</div><h4>Summary</h4><div>Implementing a PaaS in your environment is most likely well worth the effort and risk involved, just so that your organization can push forward into a more scalable and modern architecture. While there are certainly a few potential "gotchas," there's a lot to be gained by going into this kind of project with a level head and enough information to make wise decisions. Just keep in mind that this landscape is changing pretty rapidly, and moving to a PaaS is certainly not a decision to be made lightly - but it's still probably the <i>right</i> decision.</div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6854573657784240555.post-25669682267276324112015-05-05T00:03:00.000-05:002015-05-05T23:00:46.442-05:00Stats Should Be A Commodity: Introducing Stat BadgerMy favorite part of any job is finding the resident data addict(s) and finding out what their favorite visualizations are (and what in the world they mean). If you want to be told some incredible stories about a company's tech stack, find the people who live and breath stats and logs. They'll blow your freaking mind, I promise.<br />
<br />
Needless to say, I'm a bit obsessed with stats myself. Without a nigh-unmanageable volume of data points with which I can paint pictures of what's going on in my stack, I start to feel like a kid at Christmas with a mountain of presents and no name tags. That said, I'm not picky about how I get those stats or where they're stored. I have some opinions, sure, but if I can make use of what I've got and get what I need, I'm fine.<br />
<br />
The problem is that all the tools that exist to <i>collect</i> those stats are extremely opinionated. In fact, almost every stats collection tool out there today:<br />
<ul>
<li>comes as part of a larger metrics aggregation and analysis suite / platform</li>
<li>comes without the bits necessary to grab basic system stats (cpu % util, disk and network IO, mem util, etc)</li>
<li>is not easily extensible to gather custom "non-system" stats (redis, JMX, apache, you name it)</li>
<li>is a nightmare to build / install / configure</li>
<li>makes heavy-weight assumptions about where (and in what format) stats are going to be shipped</li>
<li>does all of the above</li>
</ul>
<br />
This becomes a hindrance when you try to provide multiple teams in an organization with the flexibility to manage their tools and data in their own way. Teams are left to one of a few unpleasant choices:<br />
<ul>
<li>conform to whatever stats collection tool the <a href="http://tech.strofcon.org/2015/03/why-your-young-start-up-needs.html" target="_blank">InfraNerds</a> are using</li>
<li>use whatever stats collection tool they want and:</li>
<ul>
<li>leave the InfraNerds blind (read: "induce much pain")</li>
<li>stick the InfraNerds with the task of integrating multiple tools</li>
</ul>
<li>deploy multiple stats collection tools to feed the backend they like in addition to the one used by the InfraNerds</li>
</ul>
It also becomes a major effort to swap out stats backends, should the need arise, as you have to figure out what the new collection tool can or can't do and how best to configure it.<br />
<br />
This sucks. A lot.<br />
<br />
So, Stat Badger is my attempt to avoid such issues entirely. It imposes zero opinions on your stats and their destinations. The core loop gathers no stats on its own, and defines no outputs on its own. Instead, those decisions are made by the user, by way of defining one or more Modules and Emitters.<br />
<br />
The basic philosophy of Stat Badger is that stats should be a commodity - a raw "material" that can be extracted in large volumes and sent to any place a consumer might want it - intended to be manipulated, refined, and produced by any number of processes into myriad useful products and services.<br />
<br />
Now, even though Stat Badger makes no assumptions about your stats, it does ship with a standard set of modules and emitters to get you started. Specifically, it ships with modules to gather basic detailed system stats (cpu, memory, network, disk, load, and per-process memory / cpu... so far), and emitters to spit stats out to a number of back-ends (InfluxDB 0.8, Graphite, Kafka, stdout... so far).<br />
<br />
Getting started should be as simple as:<br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">git clone <a href="https://github.com/cboggs/stat-badger">https://github.com/cboggs/stat-badger</a></span><br />
<span style="font-family: Courier New, Courier, monospace;">cd stat-badger</span><br />
<span style="font-family: Courier New, Courier, monospace;">python badger_core.py -f config.json</span><br />
<br />
<span style="font-family: Times, Times New Roman, serif;">This should start up a foreground process that spits pretty-fied JSON to stdout. For more interesting experiments, edit config.json and add to the list of emitters to ship your data to InfluxDB, Graphite, Kafka, or all of the above - all at the same time.</span><br />
<span style="font-family: Times, Times New Roman, serif;"><br /></span>
<span style="font-family: Times, Times New Roman, serif;">Stat Badger is not yet a fully-polished product. It's lacking the required wiring for tests (I'm not a dev by trade, so I've been cheating so far). It's got a few limitations (addressed in the "More to Come" section of the readme in Github). It also needs more modules and emitters added to make it as universal as I envision. </span><br />
<span style="font-family: Times, Times New Roman, serif;"><br /></span><span style="font-family: Times, Times New Roman, serif;">All that said, I hope you like Stat Badger, and I hope even more so that you'll contribute and help make it a strong, solid tool for stats gathering!</span>Unknownnoreply@blogger.comtag:blogger.com,1999:blog-6854573657784240555.post-8996901019175187162015-02-18T22:02:00.000-06:002016-04-18T22:02:47.722-05:00Why Your Early Start-Up Needs an Infrastructure EngineerMost early stage start-ups dive straight into hiring as many top-notch Software Engineers as they can get their hands on, and then watch the river of code that flows from their fingertips. What they might not realize is that their engineering teams are missing an important component - an Infrastructure Engineer! To elaborate...<br />
<br />
<h4>
The Situation</h4>
<div>
Your tech start-up has a marginally functional prototype of your latest brainchild, and it just helped you land your first major round of funding. Whoo!!! Now you're on the hunt for scary good talent to start banging out code so as to commence the Journey to Alpha. It's a good day.</div>
<div>
<br /></div>
<h4>
The Need</h4>
<div>
Rockstar 10x Software Engineers are an absolute must-have. You are, after all, in the business of writing software - who better to hire than, ya know, the people who write software? These fine folks are going to translate your vision from pure thought to actual, usable stuff that you can sell to customers. So, that's your recruitment strategy: go for the super devs and call it a day.</div>
<div>
<br /></div>
<h4>
The <i>Actual</i> Need</h4>
<div>
A group of engineers who can collaboratively hammer out a product that is worthy of putting in front of your potential customers is the real goal here - not simply gathering a collection of awesome Software Engineers. Building out your team early on with nothing but SE's tends to leave you lopsided, weighted too heavily toward the "crank out product code" side. To balance out this early engineering collective, you really want to include an Infrastructure Engineer, which I'll fondly refer to as an InfraNerd from here on out (because, honestly, no one wants to be called an "IE").</div>
<div>
<br /></div>
<h4>
What Usually Happens</h4>
<div>
Your start-up ramps up to a team of 5-10 SE's, and they get busy building out your product to (mostly) the spec they're handed. Here are a few generalizations about how things might operate in the early stages of product development:</div>
<div>
<ul>
<li>Each SE is likely using the toolchain and build routine that has always treated them well in the past</li>
<li>Some write tests to cover a respectable percentage of their code</li>
<li>The rest write somewhere between 0 and ∅ tests</li>
<li>More SE's come on board</li>
<li>Everyone pushes to a dev branch, which is merged to master inconsistently at best</li>
<li>Master breaks. Every. Damn. Time.</li>
<li>Moar SE's!</li>
<li>Codebase begins to win recognition as the best <a href="https://en.wikipedia.org/wiki/Spaghetti#/media/File:Spaghettata.JPG" target="_blank">Italian food</a> in town</li>
<li>Alpha (and possibly Beta) products fall flat on their face when deployed for customer POC / demo</li>
<li>Massive architecture rework is considered, then immediately ignored in the interest of <a href="http://whatspinksthinks.com/2013/11/04/get-shit-done-the-worst-startup-culture-ever/" target="_blank">Getting Shit Done</a></li>
<li>You realize that the product isn't even <i>remotely</i> Production-ready and begin attempts to recruit the mythical "DevOps Engineer"</li>
<li>After determining that "DevOps Engineers" are mostly made up of dreams and disappointment, you end up with an Ops Person (who may or may not be an InfraNerd)</li>
</ul>
<div>
The nastiest pain point of "the usual way" is generally when the Ops Person is given the unenviable task of "fixing" all the things. It's all too easy for ill will to be felt toward the person who is coming in and telling a sizable group of intelligent, competent, experienced SE's that they need to stop doing X and start doing Y. Right now. Because their stuff is BUSTED. </div>
<div>
<br /></div>
<div>
This is generally a bad time for everyone involved.<br />
<br /></div>
</div>
<div>
<h4>
How an InfraNerd Changes the Equation</h4>
</div>
<div>
Your friendly neighborhood InfraNerd can, given the appropriate care and feeding, avoid a lot of the heartache-inducing moments enumerated above. Bringing an InfraNerd on-board early in the life of your start-up means you get to deal with an engineering team moving faster than the founders can keep up, rather than trying to un-break the world.</div>
<div>
<br />
Here's the straight dope: your SE's are immensely smart folks, and they can work freaking <i>magic</i> at the keyboard pouring forth rivers of feature-packed and shockingly cool software. They casually tweak algorithms that the masses only speak of in hushed reverent tones, they grok unnervingly complex systems that most people couldn't navigate with a GPS and a tour guide, and they build amazing things out of thin air. They know their world forward and backward. But they (usually) don't know the world that sits just below theirs.<br />
<br />
Your InfraNerd, however, will know that shadowy world intimately, and they'll draw on that knowledge to augment your engineering team such that it will be able to sustain a much higher level of Awesome. It's not that an InfraNerd can do things that a SE can't grasp, but rather that they think of solutions that your SE's are not accustomed to thinking of (and often don't have time for).<br />
<br />
<h4>
What an InfraNerd Brings to the Table</h4>
</div>
<div>
Without an InfraNerd, your SE's will very likely have something like Jenkins in place to handle some ad-hoc test runs and the like, but I can almost promise it will be broken and underutilized. After all, who has time to fuss with automating the build pipeline? Answer: your InfraNerd does! Their purpose in life is to automate the world, and your team's velocity will ramp up quickly because of it.</div>
<div>
<br /></div>
<div>
An InfraNerd is going to bring in tools of which your SE's may not even be aware, all for the sake of automating the world. They'll help avoid the nightmare of shady shell scripts being used for deployment by calling on one or more of the many available Configuration Management & Orchestration tools (Ansible, Fabric, Chef, Capistrano, Puppet, SaltStack, etc). There's a good chance your SE's know about such things, but have never really seen a need to use them in their day-to-day workflow. There's a 100% chance your SE's will fall in love with such things when they see how easily they can deploy their code changes to arbitrary environments with minimum fuss.</div>
<div>
<br /></div>
<div>
Your InfraNerd will do their damnedest to reduce the amount of code your SE's have to write and maintain. SE's tend to be inventors by nature, and with that nature comes the tendency to see a wheel-shaped hole and immediately set out to find a suitable chunk of rock from which to sculpt a mighty fine wheel. They know the problem and they're capable of building what's needed to fix it, so they set out to do so. Usually the problem that needs solving has already been solved by some existing tools, and your InfraNerd will jump at the chance to use those tools to reduce the team's impromptu Wheel Re-Invention Drills. The real beauty here is that most of the tools an InfraNerd ushers in will quite often open the door to some really cool enhancements and features that might never have crossed anyone's mind up to now.<br />
<br />
In the midst of all their other work, your InfraNerd will also weave in tasks to build up what is undeniably the single most important part of your systems: infrastructure you can use to measure <i style="font-weight: bold;">everything</i>. They're going to find every logfile your SE's have ever dumped anywhere and get them automatically ingested into a real-time, centralized, and searchable format by way of a service like ELK or Splunk. If there's a metric that can be extracted from anything, they're going to find a way to get to it and pump it into a time series data store like InfluxDB or Graphite. Then they'll take all those data and craft dashboards to display them in all their unapologetic and illuminating glory with something like Grafana. They'll use all of this to tell you stories about your product that you'd never imagined possible. It's kinda great.<br />
<br />
<h4>
The Wrap-Up</h4>
<div>
In a young start-up looking to hire on Engineer #10, it's going to feel pretty wrong to fill that seat with anyone other than a Software Engineer. Chances are that it's going to feel wrong anytime before #50, to be honest. You should consider, however, that as expensive as that #10 slot might seem, filling it with an InfraNerd can really be a game-changer. You'll end up with an engineering team that turns out a better product in less time and likely doesn't need to rebuild everything from the ground up before you can ship Beta. You'll be faster to market with a superior product and happy SE's - and that slot will suddenly seem so cheap that you'll want 2 or 3 more InfraNerds before you know it.</div>
<br />
So now you need to find some InfraNerds and get them on the payroll! How do you find them, and what should they be doing once they're on-board? Tune in next time!</div>
Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-6854573657784240555.post-35111758507944031852015-02-03T16:31:00.000-06:002015-02-03T16:47:15.740-06:00Getting gmond metrics into InfluxDBI'm a huge fan of InfluxDB + Grafana. <a href="http://influxdb.com/" target="_blank">InfluxDB</a> is on a good path toward making metrics not suck, and <a href="http://grafana.org/" target="_blank">Grafana</a> has a great vision for interactive and exceedingly scriptable dashboards.<br />
<br />
One problem I've run into recently is that while I can get all kinds of cool custom metrics into InfluxDB without much struggle, I don't have a tool that will spit out nice system metrics into InfluxDB.<br />
<br />
My favorite tool for getting useful system metrics so far has been Ganglia's gmond. It gives some decent disk and memory data, but my favorite bit is the built-in CPU % utilization and network bytes in/out metrics. You'd be surprised how few tools actually give that information out-of-the-box.<br />
<br />
However, the back-end I don't really want to use for gmond is Ganglia. What it does, it does exceedingly well. What I <i>want</i> it to do, it does terribly. So I either needed to adjust my expectations, or find some way to get gmond metrics into InfluxDB. Turns out there were tools available to get data from most other monitoring and metrics tools into InfluxDB (collectd, statsd, fluentd, graphite, etc) but nothing that would work with gmond.<br />
<br />
Now, you can make gmetad output data in Graphite format and point it at InfluxDB's Graphite input plugin. I don't much care for that approach, for a few reasons:<br />
<br />
<ul>
<li>it forces an awkward (and generally performance-hindering) data layout in InfluxDB</li>
<li>it requires you to run gmetad, which feels a bit heavy when it would just act as a proxy</li>
<li>it introduces more layers when I'd much rather simplify things</li>
</ul>
<br />
I'd much rather have some simple tool that polls gmond and puts the metrics into InfluxDB in a sane way.<br />
<br />
So, I built one. Get it at: <a href="https://github.com/cboggs/gmond-influxdb-bridge">https://github.com/cboggs/gmond-influxdb-bridge</a><br />
<br />
It's not *quite* where I want it to be yet, but it should be of some use right now. I've listed some enhancements I want to make in the Readme.<br />
<br />
My thinking behind this tool (and any others I might build) is that, as an infrastructure tool, it should be easy to get started with, easy to automate once you've got the hang of it, and easy to forget about once you've handed if off to your config management tools.<br />
<br />
To that end, if any one of those qualities is not present, consider it a bug and file an issue on Github - or send me a pull request, I love those things.<br />
<br />
Happy graphing!Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-6854573657784240555.post-20059006014502795872015-01-16T00:07:00.003-06:002015-01-16T00:07:42.308-06:00Omnibus Cache ExploitationI've been playing with Omnibus a bit lately, in hopes of making our product <a href="http://tech.strofcon.org/2015/01/omnibus-for-offline-installs.html" target="_blank">installable in locations where internet access is not an option</a>. We have a decent list of dependencies that we install at deploy time, and we need some consistent way to get those items put in place when public repos aren't available.<br />
<br />
If you've never seen <a href="https://github.com/opscode/omnibus" target="_blank">Omnibus</a> before, it's pretty great (though there is a bit of a learning curve).<br />
<br />
Expectedly, our dependency list gets hairy when we start pulling in the dependencies of our dependencies. That's not so bad, by itself. What sucks is that any change to your project definition (omnibus-my-project/config/projects/my-project.rb) - even whitespace - causes Omnibus to invoke its HopelesslyPessimistic mechanism which in turn invalidates your software cache faster than you can say, "but it's just a pip install!".<br />
<br />
This is done for perfectly rational reasons, sure, but it's annoying when you're working in a container whose only purpose in life is to run Omnibus builds, and your build times start to creep into the tens of minutes.<br />
<br />
Fortunately you can route around this pessimism while you're getting your dependency list ironed out by adding your dependencies to a dependency. To elaborate...<br />
<br />
You normally add your run-time dependencies to your project definition (my-project.rb), but you can also add those dependencies to another software definition that you depend on in omnibus-my-project/config/software.<br />
<br />
So for example, instead of:<br />
<script src="https://gist.github.com/cboggs/cd6581ebdccf9c1da3f5.js"></script><br />
You'd have a more minimal definition:<br />
<script src="https://gist.github.com/cboggs/1e45e342c89a905d516e.js"></script><br />
Then your pycrypto definition might look something like this:<br />
<script src="https://gist.github.com/cboggs/d2aec5fd5cc8df192c94.js"></script><br />
Now you can add dependencies to the bottom of pycrypto.rb and take full advantage of the Omnibus software cache! Saves a lifetime of waiting on builds to complete.<br />
<br />
NOTE: You really shouldn't leave your project like this, it'd be terribly irresponsible of you, and then you'll say that I told you to do it, and I'll catch all kinds of grief for it. So don't do that.<br />
<br />
When you're done adding all your dependencies, you can simply shovel all the <span style="font-family: Courier New, Courier, monospace;">dependency "<name>"</span><software> lines out of (in this example) pycrypto.rb back into my-project.rb, and kick off one more (very long) build to get a sane final package.</software><br />
<software><br />
</software> <software>Happy packaging!</software>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6854573657784240555.post-86700999816013727812015-01-15T23:55:00.001-06:002015-01-16T00:09:37.422-06:00Omnibus for Offline InstallsSoftware repos are wonderful things, man.<br />
<br />
Until your servers are offline, of course... that's no good.<br />
<br />
Or if they're online, but not allowed to update but once a year.<br />
<br />
Or worse yet, they only get whatever updates are considered mission-critical by IT, and there's no way in hell you're going to get the right versions of your dependencies installed at deploy time.<br />
<br />
At the end of the day, trying to get anything installed at a customer site that has externally-hosted dependencies is almost guaranteed to be more of a pain than it's really worth.<br />
<br />
So what can be done? Well, a few ideas come to mind...<br />
<br />
<ul>
<li>Tar up your top-level dependencies, push it out at deploy time, and hope for the best. Then watch as all the installs fail because dpkg or rpm can't satisfy the dependencies of your dependencies</li>
<li>Script out something like '<span style="background-color: white; color: #222222; font-family: monospace, monospace; font-size: 13px;">apt-get --print-uris --yes install <stuff>' </span>and wget the results so you can tar 'em up, push 'em, and install 'em. Then watch as your <i>next</i> deploy fails miserably because the customer did their quarterly security patches and broke your dependency chain all to pieces</li>
<li>Containerize all the things! This actually might work just dandy depending on what your software does. In some cases though, your software needs more intimacy with the hardware than a container can offer (mostly ultra-low-latency IO and the like). Plus it might mean introducing technologies that your customer is unwilling to accept on their systems...</li>
<li>Stick it all in a single package and jail it all in /opt. <b>Winner!</b></li>
</ul>
<div>
Enter <a href="https://github.com/opscode/omnibus" target="_blank">Omnibus</a>. You get to include every dependency above libc in the correct order and specify how to build each one, jailing them in /opt/<name> (or wherever you want, really). Then you get a nice shiny package out of it (pretty much any platform you want) and can confidently install it nearly anywhere without having to worry about whether or not you can get to any external repos.</div>
<div>
<br />
I'll see about putting together a Cody-fied (read: spoon-fed) how-to on getting started with Omnibus shortly.<br />
<br />
In the meantime, if you're already using Omnibus and your builds are taking forever due to your software cache getting nuked every time you try to add a dependency, check <a href="http://tech.strofcon.org/2015/01/omnibus-cache-exploitation.html" target="_blank">this</a> out.</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6854573657784240555.post-72779425857303923612014-08-10T23:36:00.000-05:002014-08-24T23:30:07.203-05:00InfluxDB and Grafana Scripted DashboardsI've been spending a lot of time lately playing with <a href="https://github.com/influxdb/influxdb" target="_blank">InfluxDB</a>, using <a href="https://github.com/grafana/grafana" target="_blank">Grafana</a> has the front-end dashboarding tool. If you haven't checked these out, you really should - they're a lot of fun and have a ton of potential.<br />
<br />
One of the most attractive features of Grafana to me is that of "Scripted Dashboards." Basically you can write up your own semi-arbitrary JavaScript to generate a dashboard that meets almost any need you might have.<br />
<br />
I spent some time trying to figure out how to build something fun with my InfluxDB-backed Grafana, and after some stumbling about (thanks for the pointers <a href="https://github.com/torkelo" target="_blank">@torkelo</a>!), I came up with a decent starting point!<br />
<br />
NOTE: I are not developer - if my JS is shoddy, feel free to give me some pointers! Always happy to learn how to write better code.<br />
<br />
You can see the code at <a href="https://github.com/cboggs/grafana-templates">https://github.com/cboggs/grafana-templates</a>. It's basically a simple wrapper script that sources in whatever script you specify in the <b>template</b> URL arg. (I know, sourcing in arbitrary scripts is a major security no-no, but I'm mostly operating under the assumption that your Grafana + InfluxDB installation will be inside a properly locked-down network and used by non-malicious folks within your organization.)<br />
<br />
If you want to know where to find various options for your panels, there a couple of places that will likely have what you seek:<br />
<br />
<ul>
<li>the exported JSON representation of what you're trying to accomplish (mock it up in the GUI and export it)</li>
<li>if not in the JSON export, then in `/src/app/services/influxdb/influxdbDatasource.js`</li>
</ul>
<div>
I think there is some major potential for tools like these in "building a better metrics trap," as I like to say. Anomaly detection becomes dead simple with tools like InfluxDB, and something like scripted dashboards in Grafana lets you play with the idea of "situational dashboards" that might only show you the graphs that are relevant to some anomalous behavior in your cluster. No more 30-node nightmare dashboards, I'd hope!</div>
<div>
<br /></div>
<div>
Enjoy!</div>
Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-6854573657784240555.post-27369585362500378852013-02-24T16:58:00.000-06:002013-02-24T20:55:20.917-06:00Nagios and Ganglia: check_ganglia_metric.sh FailingAt work, we use Ganglia to meet the majority of our metrics needs. One handy thing that Ganglia brings with it is a set of scripts and PHP pages that allow Nagios to check metrics in a variety of ways and alert as necessary.<br />
<div>
<br /></div>
<div>
Last week I stumbled across an odd situation - our primary datacenter's Nagios instance was reporting "null" for all of our Ganglia checks (most of which call check_ganglia_metric.sh). This was particularly bad, because the null values were considered "OK" by Nagios, which left us blind to some potentially harmful circumstances.<br />
<br />
<h3>
<b>Short Version</b></h3>
Bump the amount of memory a single PHP script can consume by editing /etc/php.ini and looking for something like <span style="font-family: 'Courier New', Courier, monospace;">memory_limit = 128M. </span>This is most likely why the check_metric.php page is choking and returning nothing.<br />
<br />
<h3>
Long Version</h3>
</div>
<div>
To see what was going on under the covers, I ran check_ganglia_metric.sh on each host to see what it did. One side ran fine and gave me legitimate Nagios-style feedback, the other gave me nothing at all. The shell script was simple enough - it just cobbled the passed arguments together into a URL that it then curl'd. The page it was curl'ing was check_metric.php, so that's where I dug next. After throwing in a bunch of extra debug statements, I was at least able to figure out where the PHP was dying:</div>
<div>
<br /></div>
<div>
<b><span style="font-family: Courier New, Courier, monospace;">check_metric.php</span></b></div>
<br />
<script src="https://gist.github.com/cboggs/383bf6f2e562c84f51aa.js"></script><br />
<br />
The if statement that checks the nagios_cache_time was the last successful statement in my debug output. So I started digging through to see where and how stale the nagios_cache_file was.<br />
<br />
The default file name is nagios_ganglia.cache in your Ganglia install's conf_dir (check out your conf.php in ganglia-web's root directory). In my case, the file was a couple weeks old, which is considerably more than the default 45 second age it was configured for! When I looked at the working system in the other DC, I saw the file being updated multiple times per minute - more often than the 45 seconds configured in conf.php - so I began to suspect that the Nagios checks themselves were triggering the cache refresh. A quick check of <span style="font-family: Courier New, Courier, monospace;">/var/log/httpd/ganglia_error_log</span> made it blindingly obvious:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><b>[Thu Feb 21 03:23:26 2013] [error] [client 10.1.0.34] PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 1625427 bytes) in /var/www/html/ganglia-qd/nagios/check_metric.php on line 62</b></span><br />
<br />
What that line does isn't terribly important at this point (serializing the data from the cache file), but what <b>is</b> important is that the global PHP memory limit is insufficient. Checking <span style="font-family: Courier New, Courier, monospace;">/etc/php.ini </span>revealed: <span style="font-family: Courier New, Courier, monospace;">memory_limit = 128M</span><br />
<br />
Evidently, that just won't do for a > 7M cache file. Bumping it to 256M and bouncing httpd did the trick!Unknownnoreply@blogger.com0