Tuesday, March 15, 2016

Sysdig Cloud - Monitoring Made Awesome Part 1: Metrics Collection

Just in case you hadn't noticed before now, I'm a tiny bit obsessed with ops metrics. Most of my preoccupation with metrics seems to stem from the fact that... well... it's hard. "Metrics," for all that term entails, can be a difficult problem to solve. That might sound a bit odd given that there are a thousand-and-one tools and products available to tackle metrics, but with a sufficiently broad perspective the problem becomes pretty clear.

I've worked with a lot of different metrics tools while trying to solve lots of different problems, but I always tend to come across the same pain points no matter what tool it is that I'm using. As a result, I'm pretty hesitant to really recommend most of those tools and products.

Fortunately, that may very well be changing with the advent of Sysdig Cloud. Sysdig has really nailed the collection mechanism and is doing some great work on the storage and visualization fronts. This is the first in a series of posts describing how I think Sysdig is changing the game when it comes to metrics and metrics-based monitoring.

Disclaimer: I am not currently (or soon to be), employed by Sysdig Cloud, nor am I invested in Sysdig Cloud. I'm just genuinely impressed as an ops guy who's got a thing for metrics. I've been using Sysdig for a few months, and I only plan to brag on features that I've used myself. I'll also point out areas where Sysdig is a bit weak and could improve, because let's face it - no one's perfect. :-)

The Problem(s)

There are essentially three stages to the life-cycle of a metric: Collection, Storage, and Visualization. Believe it or not, they're all pretty hard to solve in light of the evolving tech landscape (farewell static architectures!). This post will tackle Collection, and is a bit long since they do it so well.

When Metrics Collection Gets Painful

For metrics to be of any use you need some mechanism by which to extract / catch / proxy / transport them. The frustrating part is that there are actually several layers of collection that need to be considered. If we take the most complex case - a container-based Platform as a Service - three categories of metric should do the trick: host, container, and application. Handling all of these categories well is difficult - I tried with Stat Badger, and it was... well... a bit unpleasant.

Host metrics are generally fairly easy to collect and most collectors grab the same stuff, albeit with varying degrees of efficiency and ease of configuration.

Container metrics aren't too terribly hard to collect, though there are certainly fewer collectors available. The actual values we care about here are usually a subset of our host metrics (CPU % util, memory used, network traffic, etc) scoped to each individual container. This starts to uncover the need for orchestration metadata when considered within a PaaS environment.

Application metrics can easily become the bane of your existence. Do we have a way to reliably poll metrics out of the app (for example, JMX)? If so, does our collector handle this well or do we need to shove a sidecar container into our deployments to provide each instance with its own purpose-built collector? If not, do we try to get the developers to emit metrics in some particular format? Should they emit straight to the back-end, or should they go through a proxy? If they push to some common endpoint(s), how best can we configure that endpoint per environment? Then once we've answered these questions, how on earth do we correlate the metrics from a particular application instance with the container it's running in, the host the container is running on, the service the app instance belongs to, the deployment and replication / HA policies associated with that service, and so on and so on...???

Enter the Sysdig agent.

How Sysdig Makes Metrics Collection Easy

The Sysdig agent is, right out of the gate, uncommon in its ambition in approaching to tackling all layers of the collection problem.

Most collectors rely exclusively on polling mechanisms to get their data, whether it's by reading /proc data on some interval, hitting an API endpoint to grab stats, or running some basic Linux command from which it scrapes details. This works, but is generally prone to errors when things get upgrades / tweaked, and can be fairly inefficient.

Sysdig does have the ability to do some of those things to monitor systems such as HAProxy and whatnot, but that's not its main mechanism. Instead, the Sysdig agent watches the constant stream of system events as they flow through the host's kernel and gleans volumes of information from said events. Pretty much anything that happens on a host, with the exception of apps that run on a VM such as the JVM or BEAM, will be result in (or be the result of) a system event that the host kernel handles. This has a couple of huge benefits: it's very low-overhead, and it's immensely hard to hide from the agent. These two core benefits of the base collection mechanism allow for a number of pretty cool features.

Fine-Grained Per-Process Metrics

Watching system events allows the Sysdig agent to avoid having to track the volatile list of PIDs in /proc and traverse that virtual filesystem to get the data you want. All of the relevant information is already present in the system events and this opens the door to some really nifty visualization capabilities.

"Container Native" Collection

By inspecting every system event that flows by, there's no middle man in snagging container metrics. No need to hit the Docker /stats endpoint and process its output, no worries about Docker version changes breaking your collection, and ultimately no need to relegate yourself to Docker for your container needs. This also combines beautifully with the fine-grained per-process metrics to give visibility into the processes within your containers in addition to basic container-wide metrics. It's pretty awesome.

Automatic Process Detection

The above two features combine very nicely to allow the Sysdig agent to automatically detect a wide variety of services by their process attributes, simply by having seen a relevant system event flow past on its way to the host kernel. This allows some amazing convenience in monitoring applications since the agent immediately sees when a recognized process has started up - even when it's inside a container.

For example, if you're running a Kafka container, the Sysdig agent will detect the container starting up, see the JVM process start up, notice that the JVM is exposing port 9092, spin up the custom Sysdig Java agent, inject it into the container's namespace, attach directly to the Kafka JVM process (from within the container, mind you), and start collecting some basic JVM JMX metrics (heap usage, GC stats, etc) along with some Kafka-specific JMX metrics - all for free, and without you needing to intervene at all. That's awesome.

StatsD Teleport

I'm not going to dig into this one here since this is already a lengthy post. Just read this post from Sysdig's blog - and be amazed.

Orchestration Awareness Baked In

Orchestration metadata is 100% crucial to monitoring any PaaS or PaaS-like environment. One simply cannot have any legitimate confidence in their understanding of the health of the services running in their stack without being able to trace where any given instance of a service is running where it lives in the larger ecosystem. Sysdig seems to have a strong focus on integration with a number of orchestration mechanisms. If you configure your orchestration integration correctly, then any metric collected via any of the paths will automatically be tagged with ALL of that metadata on its way to the Sysdig back-end. Even better? The agent configuration on individual nodes in a cluster, for instance with Kubernetes, don't need to know that they're part of the cluster, only the masters need to know. The masters will ship orchestration metadata to the back-end, and then when metrics come in from a node that is part of that master's cluster, it will automatically be correlated and immediately visible in the appropriate context. Seriously, that's been making me a VERY happy metrics geek.

To Be Continued (Later)

At some point after this series expounding the ways Sysdig is trying to tackle metrics and monitoring the right way (in my opinion), I'll get around to posting some more technical how-to pieces showing how best to make use of some of these features.

For now, happy hacking!

No comments:

Post a Comment