Google+ Badge

Thursday, September 7, 2017

An Ops Nerd Learns Probabilistic Data Structures, Part 1: Intro

I have a friend who is scary smart. I mean, I'm a reasonably smart guy, sure... but whoa. As it turns out (very much in my favor) he's also super nice and happy to help me learn new things. So when he said something about a "probabilistic data structure," and I promptly responded with, "right, I know all three of those words... but... huh?", he kindly gave me a gentle introduction to what I would soon learn is a super cool category of Computer Science.

What the heck is a "probabilistic data structure?"

Well, to be glib, it's a data structure that relies on probabilities and stuff. More usefully, it's (usually) an efficient way to store massive quantities of data in such a way that you get, at most, "fairly confident" answers about that data.

I know, that's not terribly helpful either, but we'll get there! I aim to make this series easy to digest and relate to. I'm not a math whiz nor am I a software engineer, which means I have to think about things in way simpler terms than either of those people would. Then when I go to explain it to folks smarter than myself (read: you), I like to do it in terms that are less muddied than the symbol-filled abstract articles and videos I've seen so far. I figure if I can't explain it to someone else in simpler terms than those in which I learned it, I probably don't understand it as well as I think.

Let's get some context: Why would I need such magic?

Shell Game - Distributed Data Stores

Imagine that we have a pile of records that you need to store. We want the full data set to be stored on disk, but we can't possibly fit it all on one machine. So we split it up across 3 machines, and life's good!

Eventually though, we'll want to retrieve some of those records. It would be pretty wasteful to go to each machine, ask for a record, and force the machine to make an expensive series of disk accesses just to tell you that the princess is another castle. We might get lucky and land on the right machine the first time, but my (sketchy) math skills tell me that this is unlikely to happen with more than, roughly, one-third of our attempts.

Instead it would be handy if, as we ingest each record into our multi-machine data store, we write a highly-compressed version of that record to an in-memory data structure that can tell us - with some tune-able degree of confidence - wheter the data we want does or does not reside on a particular machine. One way of doing this with an impressive level of space efficiency is by implementing a Bloom filter, which we'll cover in the next post in this series.

#Hashtag #Frequency #Tracking #For #Fun #And (#Probably #No) #Profit

Imagine that we wanted to track the approximate frequency of various hashtags from the live Twitter stream. We can't really expect to store a full set of this data in memory with perfect fidelity, as we'd never have enough space. We also aren't likely to be successful dumping it all to disk, as that's a lot of writes that never freaking slow down.

We can, however, sacrifice a small bit of accuracy for a massive increase in space efficiency. We would do this with a data structure called a count-min sketch, which can tell us that a particular hashtag has most certainly not been seen more than N times, but may have been seen slightly fewer than N times. We'll cover this one in Part 3 of the series.

An understanding of HyperLogLog is its own reward

Don't worry, we'll get there. For now, it should suffice to say "it's weird, and it lets us achieve nigh-ludicrous space efficiency."

Up Next!

Part 2 will cover the Bloom filter. We'll talk about its perks, drawbacks, mechanisms, and use-cases. Then we'll build one and benchmark it! It should be all kinds of fun. See ya then!

Sunday, January 8, 2017

Retiring StatsYard... But Starting Something New!

Short and sweet for this post: I'm retiring the StatsYard project for the following reasons:

  1. It's a bit too ambitious for a project that I don't plan to ever turn into a viable product
  2. "Next steps" are hard to settle on (for me), largely because of point #1
  3. It doesn't stand to help society in any meaningful way
So, instead, I'm going to pivot to a new project, which I'll discuss here as I go. It's something I hope will provide at least a little bit of societal value, and will be much more relevant to a wider audience of developers looking to learn more about Elixir and Phoenix!

With a little luck, I'll have a bit more to show and tell in the coming weeks. I just wanted to officially close out the StatsYard project for anyone who might have been following along. Sorry for any inconvenience, but hopefully this next project will be considerably easier for me to keep moving forward!

'Til next time, happy hacking!