Tag Archives: hadoop

Clean up in Hadoop!

Or, as my little girl would say, “You need a vacuum!”

We have a logging service using Scribe to accept huge amounts of data into hadoop. Well — potentially huge, when someone starts using it. The problem is, because appending mode is broken in the current version, we have to roll the logs every minute, the way the guys at FaceBook do.

That means we end up with a ton of small or zero-byte files in the logging topics that don’t get much traffic.

I’ve come up with a python script that finds and deletes these little files from a named directory. Eventually I want to evolve this to a script that smashes the messages together into large files that match the block size, write that out, and delete the source files.

But first things first 😉

It was too slow deleting the files one at a time, so I modified the script to do 10 at a time. The python around that batching could be better, but it works, so I should get some points for that.

Usage: clean.py <path>

import sys
import commands

def hadoop(args) :
    return commands.getstatusoutput('/usr/bin/hadoop fs %s' % args )


def next_empty() :

    path = sys.argv[1]
    status,result = hadoop('-ls %s' % (path))

    if status != 0 :
        print 'failed:',result
    else :
        lines = result.splitlines()
        for line in lines[1:len(lines) - 30] :
            word = line.split()
            type = word[0][0]
            size = word[4]
            name = word[7]

            if type == '-' and int(size) == 0 :
                yield name
            else :
                print 'skip',type,size,name


if __name__ == '__main__' :

    bunch = []
    for name in next_empty() :
        bunch.append(name)
        if len(bunch) >= 10 :
            hadoop('-rm ' + ' '.join(bunch))
            bunch = []
Advertisements

Leave a comment

Filed under hadoop