Thursday, September 7, 2017

An Ops Nerd Learns Probabilistic Data Structures, Part 1: Intro

I have a friend who is scary smart. I mean, I'm a reasonably smart guy, sure... but whoa. As it turns out (very much in my favor) he's also super nice and happy to help me learn new things. So when he said something about a "probabilistic data structure," and I promptly responded with, "right, I know all three of those words... but... huh?", he kindly gave me a gentle introduction to what I would soon learn is a super cool category of Computer Science.

What the heck is a "probabilistic data structure?"

Well, to be glib, it's a data structure that relies on probabilities and stuff. More usefully, it's (usually) an efficient way to store massive quantities of data in such a way that you get, at most, "fairly confident" answers about that data.

I know, that's not terribly helpful either, but we'll get there! I aim to make this series easy to digest and relate to. I'm not a math whiz nor am I a software engineer, which means I have to think about things in way simpler terms than either of those people would. Then when I go to explain it to folks smarter than myself (read: you), I like to do it in terms that are less muddied than the symbol-filled abstract articles and videos I've seen so far. I figure if I can't explain it to someone else in simpler terms than those in which I learned it, I probably don't understand it as well as I think.

Let's get some context: Why would I need such magic?

Shell Game - Distributed Data Stores

Imagine that we have a pile of records that you need to store. We want the full data set to be stored on disk, but we can't possibly fit it all on one machine. So we split it up across 3 machines, and life's good!

Eventually though, we'll want to retrieve some of those records. It would be pretty wasteful to go to each machine, ask for a record, and force the machine to make an expensive series of disk accesses just to tell you that the princess is another castle. We might get lucky and land on the right machine the first time, but my (sketchy) math skills tell me that this is unlikely to happen with more than, roughly, one-third of our attempts.

Instead it would be handy if, as we ingest each record into our multi-machine data store, we write a highly-compressed version of that record to an in-memory data structure that can tell us - with some tune-able degree of confidence - wheter the data we want does or does not reside on a particular machine. One way of doing this with an impressive level of space efficiency is by implementing a Bloom filter, which we'll cover in the next post in this series.

#Hashtag #Frequency #Tracking #For #Fun #And (#Probably #No) #Profit

Imagine that we wanted to track the approximate frequency of various hashtags from the live Twitter stream. We can't really expect to store a full set of this data in memory with perfect fidelity, as we'd never have enough space. We also aren't likely to be successful dumping it all to disk, as that's a lot of writes that never freaking slow down.

We can, however, sacrifice a small bit of accuracy for a massive increase in space efficiency. We would do this with a data structure called a count-min sketch, which can tell us that a particular hashtag has most certainly not been seen more than N times, but may have been seen slightly fewer than N times. We'll cover this one in Part 3 of the series.

An understanding of HyperLogLog is its own reward

Don't worry, we'll get there. For now, it should suffice to say "it's weird, and it lets us achieve nigh-ludicrous space efficiency."

Up Next!

Part 2 will cover the Bloom filter. We'll talk about its perks, drawbacks, mechanisms, and use-cases. Then we'll build one and benchmark it! It should be all kinds of fun. See ya then!

3 comments:

  1. RACING IN CASINO | DR MCD
    RACING 영천 출장샵 IN CASINO. 대전광역 출장안마 RACING IN CASINO in Columbus 평택 출장안마 is an exciting casino 아산 출장마사지 resort located off Interstate 15. 전주 출장마사지 It features a variety of slots, live action,

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. To gain in-depth knowledge on the same, let’s understand everything about Dashboard software with a special note on the free and open source Dashboard software solutions. https://www.inetsoft.com

    ReplyDelete