Clean up in Hadoop!

Or, as my little girl would say, “You need a vacuum!”

We have a logging service using Scribe to accept huge amounts of data into hadoop. Well — potentially huge, when someone starts using it. The problem is, because appending mode is broken in the current version, we have to roll the logs every minute, the way the guys at FaceBook do.

That means we end up with a ton of small or zero-byte files in the logging topics that don’t get much traffic.

I’ve come up with a python script that finds and deletes these little files from a named directory. Eventually I want to evolve this to a script that smashes the messages together into large files that match the block size, write that out, and delete the source files.

But first things first 😉

It was too slow deleting the files one at a time, so I modified the script to do 10 at a time. The python around that batching could be better, but it works, so I should get some points for that.

Usage: clean.py <path>

import sys
import commands

def hadoop(args) :
    return commands.getstatusoutput('/usr/bin/hadoop fs %s' % args )


def next_empty() :

    path = sys.argv[1]
    status,result = hadoop('-ls %s' % (path))

    if status != 0 :
        print 'failed:',result
    else :
        lines = result.splitlines()
        for line in lines[1:len(lines) - 30] :
            word = line.split()
            type = word[0][0]
            size = word[4]
            name = word[7]

            if type == '-' and int(size) == 0 :
                yield name
            else :
                print 'skip',type,size,name


if __name__ == '__main__' :

    bunch = []
    for name in next_empty() :
        bunch.append(name)
        if len(bunch) >= 10 :
            hadoop('-rm ' + ' '.join(bunch))
            bunch = []
Advertisements

Leave a comment

Filed under hadoop

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s