Although the shop I’m at now handles a lot of data, we don’t qualify as a big data shop. That’s because we’re running constant analyses of big feeds coming in to our system. Once we’ve handled the data, we’re done with it. Keeping it around just costs money and brings no benefit.
Now, we could pour all that data into a big, open-ended datastore like Hadoop, and crunch it there. But the fact is, once we’ve analysed the data, we don’t really care about it anymore. We’ve got it characterized and encoded, and from that point forward, that’s what we really care about.
So the Big Data label doesn’t suit us. Instead, we fit the model of Big Streaming.
Just as there are a bunch of surprising discoveries you find with Big Data, there are a lot of surprising discoveries with Big Streaming. Some of them are:
Deleting Data is Hard
Because we don’t want to accumulate a bunch of documents the way we would in a Big Data solution, we want to receive a document, process it, then after some time get rid of it to free up the space. Big static datastores are powerful and wonderful, but they are also very expensive and slow. Extremely slow — when I was working with Hadoop a while back, a simple “hello world” map-reduce could take over a minute to run.
We need data that drops into a fast datastore, we pull it out in batches for processing, we may keep it a while to refer back to in the UI, but then we want to get rid of it.
We’re using Mongo. The consequence we had in our first go at it was that we had as much contention from deletes as we did from inserts — in a steady state, we were constantly deleting records from the collections as fast as we were inserting them. There was a brief and painful attempt to use ttl indexes to time the records out, but in each case the insert/delete load on the collection was so high that the ttl fell behind. At one point, a 3-day collection was holding 10 days’ worth of data, and losing ground. Ack!
That led to a couple of scrambles where we had to implement a new solution in place as the flames grew — there was so much contention that any attempt to delete data just added more contention and made things worse. Fortunately, we pulled out of it each time.
We could have moved to a capped collection, but because of the difficulty estimating the size of collection we’d need, we didn’t go down that path. Which is too bad, because capped collections in Mongo are completely awesome.
We needed a batch delete option, and ended up settling on a scheme where we elimiate entire collections at a time, wiping away hours worth of data in a single stroke.
The upshot: if you’re going Big Streaming, you need a way to wipe out a lot of data fast.
Inserting Data is Hard
The other aspect that we struggle with is just getting all the data from the different sources into the system. After a while, all that insert activity does start to create write contention.
So far, we’ve been lucky with this one, and we’re still able to use a naive solution of pouring it all into a big collection. But we’re going to hit the wall on that in the next few months as our data stream grows.
Backing up Data is … Interesting
Backing up transitory data becomes an interesting problem. Even if we did push the data onto some more permanent medium, if we ever needed to restore it, it would be hugely outdated.
So, instead of relying on backups, we just use replication (in our case, provided by Mongo) to make sure we always have a moving copy of the information we’re getting. We back up some of the results of the analysis, but that wouldn’t be appropriate for the transitory data that we want to get, process, then get rid of.
Batch Processing Works OK
Right now the system that I walked into uses a process of saving the data down to a datastore, then pulling it out in batches for the different processing steps. Each of the batch processes is (fortunately) designed to scale horizontally, so that each processes plays nice and does things like create pessimistic lock artifacts, and create shared batch documents. So the system is coded for the kind of distributed processing that Hadoop generalizes with map-reduce.
I think, however, that in the long run that approach is going to hit a limit. My hope is that we’ll be able to shift to a sort of pipeline processing, where inputs come in through a bank of homogenous processing boxes, and then come out the other side along with any trimming, characterzing, and analysis along the way.
Just like in the Big Data world, where Write Once / Read Many is a central paradigm, that model helps in the Big Streaming world. Updates are hard, slow, and expensive. So it works out well to write temporary artifacts as read-only artifacts, then pull them out for processing, and delete them (in broad strokes) when they’re not useful any more.
Particularly in Mongo, inserts are fast and can be done in batch, whereas updates specific to a document require a write lock, a lookup, and a write — even if you take advantage of Mongo’s update mechanisms to speed things up.
Everything is a Time Interval
When data like this pours in, it’s all about breaking the data into time intervals, and handling the intervals in the right way. So there’s a lot of date math involved. Also deciding what “now” means at any given moment in the processing can be tricky.
Latency and Capacity Rule
The upshot of the whole thing is that working on a Big Streaming system makes you really aware of the clock, and whether your processing is keeping pace with your inputs. The ideas of *latency* and *capacity* suddenly jump to the forefront, because that’s what determines if you can handle more, or if you’re falling behind, even more than the traditional “processing time”.
It’s really fun, though. Standing in front of a firehose of data and running analysis on the data is exhilarating — when it works.