Tag Archives: mongodb compression python

Squeeze More into Mongodb with Compressed Docs

If you’re using the write-once strategy for a collection in MongoDB, then there’s an opportunity to optimize space usage by compressing the documents in that collection. Compressing the documents has some nice consequences:

  • faster transmission time over the network
  • smaller footpint in RAM (i.e. a smaller working set)
  • faster reads/writes to disk
  • less space taken up on disk — although, practically, we hardly care about that any more

There are some tradeoffs, of course

  • can’t use the nice update operations like set or push (but we’ve already stipulated this is write-once data)
  • troubleshooting is a little harder because you can’t just find() the docs and look at them — you have to use a script to extract and decompress them.
  • more CPU usage — essentially this is a strategy for trading RAM for CPU. (In our case, CPU is cheaper and more easily scaled.)

Fortunately, python makes it easy to pickle (serialze) and zlib.compress (compress) documents going into the database. And the same tools can restore the original documents.

We want the process to be smart to provide for an easy transition. If we fetch a document and it’s not compressed, just use it as-is. If it is compressed, then open it up. We accomplish that by pushing the compressed data into a field with a special name: “!”. There’s a low chance of collision with that. 😉 If we pull a document and it contains a bang, we decompress that field and return it as the document. If the bang attribute is missing, then we return the document as-is. That allows compressed and non-compressed documents to live side by side, and hopefully over time the collection will contain more compressed documents as we write them in.

Two details I caught in testing were:

  • If the document has an _id field, we want to keep that visible outside the compressed field, because we want the document’s identifier to remain the same.
  • We have to provide for a push_batch operation for the times when we’re pushing an array of documents and not just individuals.

This python module is written in a generic way. It’s easy to apply it anytime you want to compress a stream of dicts, or any other python object that can be pickled.

Usage with a mongo collection looks like this:

compressor = DocCompressor()

# inserting documents

collection.insert(compressor.compress(one_doc))

# ...or...

docs = [some list of docs]
compressor.push(collection.insert, docs)

# getting documents

docs = collection.find()
for doc in compressor.pull(docs):
    ....

And here’s the source code for the module:

import cPickle as pickle
import zlib
import base64 as b64

class DocCompressor(object):
 """
 Responsible for compressing and decompressing documents to and from
 Mongo (or any datastore, really).

The class also provides a generator, decompressing documents as they
 come from the datastore as well.
 """

COMPRESSED_KEY = '!'

def compress(self, doc):
 """
 Compress a document and return a new document with the compressed one.
 """
 # squash and encode it
 pickled = pickle.dumps(doc, pickle.HIGHEST_PROTOCOL)
 squished = zlib.compress(pickled)
 encoded = b64.urlsafe_b64encode(squished)

# return a doc containing the compressed doc; the id gets a free ride
 compressed = {self.COMPRESSED_KEY: encoded}
 self._copy_id(doc, compressed)

return compressed

def decompress(self, doc):
 """
 If the document contains a compressed document, return that. Otherwise return the original.
 """
 if not self.COMPRESSED_KEY in doc:
 # not compressed; it is what it is
 return doc

# the compressed doc is there, pull it out
 unencoded = b64.urlsafe_b64decode(str(doc[self.COMPRESSED_KEY]))
 decompressed = zlib.decompress(unencoded)
 orig_doc = pickle.loads(decompressed)
 self._copy_id(doc, orig_doc)

# return the original doc
 return orig_doc

def _copy_id(self, a, b):
 """If the id is there, it gets a free ride from a to b"""
 if '_id' in a:
 b['_id'] = a['_id']

def pull(self, source):
 """When acting as a generator, get the next doc and decompress it.

Usage:
 for doc in compressor().pull(source):
 ...
 """
 for doc in source:
 yield self.decompress(doc)

def push(self, target, source):
 """
 Pull all the docs from the source and pass them as the (single)
 parameter to the target function. Follows the semantics of map.

Usage:
 target = lambda x: ...
 compressor().push(target, source)
 """
 for doc in source:
 target(self.compress(doc))

def push_batch(self, target, source):
 """
 Pull all the docs from the source and send them as a list as the
 (single) parameter to the target function.

Usage:
 target = lambda x: ...
 compressor().push(target, source)
 """
 batch = [self.compress(doc) for doc in source]
 target(batch)

Leave a comment

Filed under computer algorithms, computer scaling, mongodb, python