Category Archives: computer algorithms

Genetic vs. Exhaustive

Genetic algorithms are basically a guided trial-and-error methodology. The only advantage I can think of for a GA over a exhaustive search is that since GA optimizes a fitness function in steps, you might get to an optimal solution faster, because the GA will favor solutions that are incrementally better. An exhaustive search that’s guaranteed to find a solution might go the long way around.

  • If “computation resources” means CPU but not memory, then the GA gene pool might have a smaller memory footprint, requiring less memory.
  • On the other hand, because of hill climbing, a GA might get stuck on a local maximum & any mutations might not be sufficient to shake it loose.
  • Also, the search time for GA grows exponentially with the size of the input, so an exhaustive search may end up being faster after all, depending on the size of the space you’re searching, and if you can constrain the size of the space by excluding possibilities.

Lately I’ve been thinking about GAs in terms of “entropy per second” and measuring the progress of my GA as a measure of how many distinct possibilities it runs through per second. Then I can compare that to the total entropy of the problem space. If a brute force search can beat that entropy per second with parallelization or fast processors, then a GA’s “best score” isn’t any better than a discovered “best score”.

But I’ve just been noodling that over; I haven’t actually instrumented a GA that way yet.

(Entropy is ln(N) for N possible states, or ln(N) - sum(n * ln(n) for all n) / N where n is the possible ways to achieve one result out of N possible outcomes.)

Advertisements

Leave a comment

Filed under computer algorithms, machine learning, opinionizing

Monadic Downcast

lol that’s kind of a cool title.

Downcasts are always a pain in the butt in an OO language. If you’ve designed your code correcly, you should never need to do a downcast. But sometimes you get stuck where you have to.

An upcast is the good kind of object cast, for example casting an ArrayList up to a List so you can handle it more generically. That’s a good practice in OO, something you should aim for.

A downcast goes the other way, for example saying, “I know this Object is really a BigInteger, so I’m going to cast it down the class hierarchy to a more specific type.” The problem is, the compiler can’t guarantee that an Object is a BigInteger, so you’re taking a chance that an Integer or String or whatever slipped in at runtime. You’re risking a Class Cast Exception.

The classic downcast situation is parsing a JSON structure into a nested Map heirarchy, and then pulling values out of it. Each Object in the map could be a String, List, or another Map. Sometimes, rather than encapsulating the JSON structure as a class structure, you encapsulate the data operation as an operation on a nested Map structure as a convenience.

Anyway…

C# has a nice “as” operator that will perform a cast, and return null if the cast isn’t valid. I always really liked that operator. So, inspired by that and my latest investigation of Monads, I came up with this utility.

It’s a monadic downcast utility class. The idea is that you instantiate one with the class that you want to cast to, and then invoke that instance to perform the down-cast.

It’s monadic because:

  • It “lifts” any arbitrary class into a monadic casting space.
  • It also provides a bind operation that allows you to perform arbitrary functions on the data, without recompiling the original code.

I’ve taken shortcuts with convention to make the code shorter. Also, note the @SuppressWarnings that suppresses the compiler warning you about the downcast you’re making.

The usage is to instantiate an As object, and then use the of to perform the cast. The result has an “isA” property (here it’s just an attribute, for brevity) if the cast is possible, and a “present” property if the object is not null. The “get” property returns the object as the type that you want.

Example:

// use a prototype for a custom class
    As<Thing> asThing = As.proto(Thing.class);
    As<Thing> t = asThing.of(o);
    if (t.isA) {
        t.get.doSomething();
    }
// or use the one of the convenience classes:
    As<Map<String, Object>> a = As.MAP.of(source);
    if (a.isA) {
        a.get.put("zoom", "pop");
    }
// or do a type conversion
    As<String> s = As.STRING.of("123");
    assert s.isA;

    As<Integer> i = s.bind(TO_INTEGER);
    assert i.isA;
    assert i.get == 123;

    // parse int
    static final Function<String, As<Integer>> TO_INTEGER = new Function<String, As<Integer>>() {
        public As<Integer> apply(String b) {
            try {
                return As.INTEGER.of(Integer.parseInt(b));
            }
            catch (NumberFormatException e) {
                return As.no();
            }
        }
    };

And here’s the implementation.

public class As<A> {

    // info we're tracking
    public final A get;
    public final boolean isA;
    public final Class<?> clazz;

    // some utility prototypes
    public static final As<Map<String, Object>> MAP = As.proto(Map.class);
    public static final As<List<Object>> LIST = As.proto(List.class);
    public static final As<String> STRING = As.proto(String.class);
    public static final As<Integer> INTEGER = As.proto(Integer.class);

    /** create a prototype */
    public static <A> As<A> proto(Class<?> clazz) {
        return new As<A>(clazz, null, false);
    }

    /** no of a certain type */
    public static final <X> As<X> no() {
        return new As<X>(null, null, false);
    };

    /** generic function interface -- note no monadic restrictions here */
    public interface Function<A, B> {
        B apply(A a);
    }

    /** internal constuctor */
    As(Class<?> clazz, A get, boolean isA) {
        this.get = get;
        this.isA = isA;
        this.clazz = clazz;
    }

    /** of === unit ... lift a value into monadic space */
    @SuppressWarnings("unchecked")
    public As<A> of(Object a) {
        if (clazz == null) {
            return no();
        }

        boolean is = clazz.isInstance(a);
        As<A> that = new As<A>(clazz, is ? (A)a : null, is);
        return that;
    }

    /** bind :: m a -> (a -> m b) -> m b */
    public <B> As<B> bind(Function<A, As<B>> f) {
        if (! isA) {
            // I'd like to return a prototype, but can't because of type erasure
            return As.no();
        }

        return f.apply(get);
    }

    /** a nice to string */
    public String toString() {
        StringBuilder builder = new StringBuilder();
        builder.append("{As");

        builder.append(" " + get);
        builder.append(" isA=");
        builder.append(isA);

        builder.append(" }");
        return builder.toString();
    }
}

And here’s a sample, using the classic case of traversing a generic map hierarchy to pull out data.

public class Sample0 {

    public static void main(String[] args) {
        Map<String, Object> source = sample();

        assert !As.MAP.of(null).isA;

        assert !As.MAP.of("booya").isA;

        As<Map<String, Object>> c = As.MAP.of(source.get("ma"));
        if (c.isA) {
            c.get.put("po", "op");
            System.err.println(c);

            As<String> d = As.STRING.of(c.get.get("li"));
            assert !d.isA;

            As<String> e = As.STRING.of(c.get.get("str"));
            if (e.isA) {
                System.err.println("String: " + e.get.substring(0, 1));
            }

            As<List<Object>> f = As.LIST.of(c.get.get("li"));
            if (f.isA) {
                System.err.println("List: " + f.get.get(0));
            }
        }
    }

    private static Map<String, Object> sample() {
        Map<String, Object> source = new HashMap<String, Object>();
        Map<String, Object> hoppop = new HashMap<String, Object>();
        List<String> zing = Arrays.asList("1", "2", "3");
        String booya = "booya";
        String zoopup = "zoopup";

        source.put("str", booya);
        source.put("li", zing);
        source.put("ma", hoppop);
        hoppop.put("str", zoopup);
        hoppop.put("li", zing);
        return source;
    }
}

BTW here is a demonstration that the implementation meets the Monadic Rules. Maybe I’ll post more about that later.

public class Rules {

    /** int to string */
    static final Function<Integer, As<String>> f = new Function<Integer, As<String>>() {
        public As<String> apply(Integer b) {
            return As.STRING.of("" + b);
        }
    };

    /** string to int */
    static final Function<String, As<Integer>> g = new Function<String, As<Integer>>() {
        public As<Integer> apply(String b) {
            return As.INTEGER.of(Integer.parseInt(b));
        }
    };

    /** unit int */
    static final Function<Integer, As<Integer>> u = new Function<Integer, As<Integer>>() { 
        public As<Integer> apply(Integer b) {
            return As.INTEGER.of(b);
        }
    };

    /** associative */
    static final Function<Integer, As<Integer>> r = new Function<Integer, As<Integer>>() { 
        public As<Integer> apply(Integer b) {
            return f.apply(b).bind(g);
        }
    };

    /** test the monadic rules */
    public static void main(String[] args) {

        As<Integer> m = As.INTEGER.of(1);

        leftIdentity(m);
        rightIdentity(m);
        associativity(m);
    }

    /** Left identity: return a >>= f ≡ f a */
    private static void leftIdentity(As<Integer> m) {
        As<String> left = m.bind(f);
        As<String> right = f.apply(1);
        System.err.println(left + " " + right);
        assert left.toString().equals(right.toString());
    }

    /** Right identity: m >>= return ≡ m */
    private static void rightIdentity(As<Integer> m) {
        As<Integer> left = m.bind(u);
        As<Integer> right = m;
        System.err.println(left + " " + right);
        assert left.toString().equals(right.toString());
    }

    /** Associativity: (m >>= f) >>= g ≡ m >>= (\x -> f x >>= g) */
    private static void associativity(As<Integer> m) {
        As<Integer> left = m.bind(f).bind(g);
        As<Integer> right = m.bind(r);
        System.err.println(left + " " + right);
        assert left.toString().equals(right.toString());
    }

}

Leave a comment

Filed under computer algorithms, computer OOD, design patterns, utility

Squeeze More into Mongodb with Compressed Docs

If you’re using the write-once strategy for a collection in MongoDB, then there’s an opportunity to optimize space usage by compressing the documents in that collection. Compressing the documents has some nice consequences:

  • faster transmission time over the network
  • smaller footpint in RAM (i.e. a smaller working set)
  • faster reads/writes to disk
  • less space taken up on disk — although, practically, we hardly care about that any more

There are some tradeoffs, of course

  • can’t use the nice update operations like set or push (but we’ve already stipulated this is write-once data)
  • troubleshooting is a little harder because you can’t just find() the docs and look at them — you have to use a script to extract and decompress them.
  • more CPU usage — essentially this is a strategy for trading RAM for CPU. (In our case, CPU is cheaper and more easily scaled.)

Fortunately, python makes it easy to pickle (serialze) and zlib.compress (compress) documents going into the database. And the same tools can restore the original documents.

We want the process to be smart to provide for an easy transition. If we fetch a document and it’s not compressed, just use it as-is. If it is compressed, then open it up. We accomplish that by pushing the compressed data into a field with a special name: “!”. There’s a low chance of collision with that. 😉 If we pull a document and it contains a bang, we decompress that field and return it as the document. If the bang attribute is missing, then we return the document as-is. That allows compressed and non-compressed documents to live side by side, and hopefully over time the collection will contain more compressed documents as we write them in.

Two details I caught in testing were:

  • If the document has an _id field, we want to keep that visible outside the compressed field, because we want the document’s identifier to remain the same.
  • We have to provide for a push_batch operation for the times when we’re pushing an array of documents and not just individuals.

This python module is written in a generic way. It’s easy to apply it anytime you want to compress a stream of dicts, or any other python object that can be pickled.

Usage with a mongo collection looks like this:

compressor = DocCompressor()

# inserting documents

collection.insert(compressor.compress(one_doc))

# ...or...

docs = [some list of docs]
compressor.push(collection.insert, docs)

# getting documents

docs = collection.find()
for doc in compressor.pull(docs):
    ....

And here’s the source code for the module:

import cPickle as pickle
import zlib
import base64 as b64

class DocCompressor(object):
 """
 Responsible for compressing and decompressing documents to and from
 Mongo (or any datastore, really).

The class also provides a generator, decompressing documents as they
 come from the datastore as well.
 """

COMPRESSED_KEY = '!'

def compress(self, doc):
 """
 Compress a document and return a new document with the compressed one.
 """
 # squash and encode it
 pickled = pickle.dumps(doc, pickle.HIGHEST_PROTOCOL)
 squished = zlib.compress(pickled)
 encoded = b64.urlsafe_b64encode(squished)

# return a doc containing the compressed doc; the id gets a free ride
 compressed = {self.COMPRESSED_KEY: encoded}
 self._copy_id(doc, compressed)

return compressed

def decompress(self, doc):
 """
 If the document contains a compressed document, return that. Otherwise return the original.
 """
 if not self.COMPRESSED_KEY in doc:
 # not compressed; it is what it is
 return doc

# the compressed doc is there, pull it out
 unencoded = b64.urlsafe_b64decode(str(doc[self.COMPRESSED_KEY]))
 decompressed = zlib.decompress(unencoded)
 orig_doc = pickle.loads(decompressed)
 self._copy_id(doc, orig_doc)

# return the original doc
 return orig_doc

def _copy_id(self, a, b):
 """If the id is there, it gets a free ride from a to b"""
 if '_id' in a:
 b['_id'] = a['_id']

def pull(self, source):
 """When acting as a generator, get the next doc and decompress it.

Usage:
 for doc in compressor().pull(source):
 ...
 """
 for doc in source:
 yield self.decompress(doc)

def push(self, target, source):
 """
 Pull all the docs from the source and pass them as the (single)
 parameter to the target function. Follows the semantics of map.

Usage:
 target = lambda x: ...
 compressor().push(target, source)
 """
 for doc in source:
 target(self.compress(doc))

def push_batch(self, target, source):
 """
 Pull all the docs from the source and send them as a list as the
 (single) parameter to the target function.

Usage:
 target = lambda x: ...
 compressor().push(target, source)
 """
 batch = [self.compress(doc) for doc in source]
 target(batch)

Leave a comment

Filed under computer algorithms, computer scaling, mongodb, python

What is a Genetic Algorithm?

“Genetic algorithm” is also called “genetic search”. It’s modeled after an idealized view of natural selection, but very, very loosely. Really it’s method of making guesses at an answer to a question based on combinations of previous answers and how they did.

 A genetic search starts with two things:

 1) A population of solutions encoded as “gene” strings. These are usually randomly generated to start with.

 2) One or more methods for combining and mutating those strings.

Implicit in both of those is a way to “score” the strings, so that you can tell good answers from bad ones.

The genetic algorithm is just a process:

a) select two gene strings from the population

b) combining them into a new solution

c) score that solution to see how good it is

d) put that solution into the population & remove one that has a lower score

So, over time, the population contains strings that reflect the best parts of their ancestors & have the highest scores. If there’s a single “right” solution, eventually the algorithm will probably find it. If there’s no “right” solution, the algorithm will ruthlessly optimize the scores to find ones that work best.

It all sounds vaguely racist, maybe not so vaguely, but you have to remember that a) it’s just a computer algorithm and not a metaphor for natural processes, and b) animals and humans don’t have a “score” associated with them.

The search can get hung up on a local maximum, so sometimes you have to blow away the population and start over. But there’s no rule against keeping a log of the best solutions so you can see how they did over time.

 One of the neat things about genetic algorithms is because it’s mindlessly and ruthlessly optimizing that score, it will often discover solutions you didn’t consider, or even don’t like, but which meet the criteria best. That can be fun, and it can also be frustrating. But I think it’s more fun than frustrating.

 

Leave a comment

Filed under computer algorithms, machine learning

Jumping through Large Database for a Fair Sample

I have a really big datbase that I want to sample from. The traditional way to sample a set of values goes like this:

To sample n values from a set of values of size T:

+ if uniform() < n / T, take the current value, and subtract 1 from n
+ subtract 1 from T
+ repeat until either n or T is 0

That will produce a “fair” or “unbiased” sample.

The problem is, if my original set is really huge (mine is 60 million and growing), I don’t have time to roll the die 60 million times. (!) Instead, I want to roll the die and pick an item — exactly n times. Then keep skipping and picking until I have all n values.

So I thought about it, and got some help from http://math.stackexchange.com

It boils down to this:

+ If I picked n items randomly *all at once*, where would the first one land? That is, min({r_1 … r_n}). A helpful fellow at math.stackexchange boiled it down to this equation:

x = 1 – (1 – r) ** (1 / n)

The long story (ommitted here) is that the distribution would be 1 – (1 – x) to the nth power. Then solve for x. Pretty easy.

Then….

+ If I generate a uniform random number and plug it in for r, this is distributed the same as min({r_1 … r_n}) — the same way that the lowest item would fall. Voila! I’ve just simulated picking the first item as if I had randomly selected all n.

+ So I skip over that many items in the list, pick that one, and then….

+ Repeat until n is 0

That way, if I have a big database (like Mongo), I can skip, find_one, skip, find_one, etc. Until I have all the items I need.

The only problem I’m having is that my implementation favors the first and last element in the list. But I can live with that.

In Python 2.7, my implementation looks like:

def skip(n):
    """
    Produce a random number with the same distribution as
    min({r_0, ... r_n}) to see where the next smallest one is
    """
    r = numpy.random.uniform()
    return 1.0 - (1.0 - r) ** (1.0 / n)

def sample(T, n):
    """
    Take n items from a list of size T
    """
    t = T
    i = 0
    while t > 0 and n > 0:
        s = skip(n) * (t - n + 1)
        i += s
        yield int(i) % T
        i += 1
        t -= s + 1
        n -= 1

if __name__ == '__main__':

    t = [0] * 100
    for c in xrange(10000):
        for i in sample(len(t), 10):
            try:
                t[i] += 1
            except:
                print c, i

    pprint.pprint(t)

Leave a comment

Filed under computer algorithms, computer scaling, mongodb, python, utility

Genetic Algorithms: The Scitter Combiner

So I’ve been playing with a genetic algorithm to find small neural networks for simple problems. Mostly I can’t get the damn things to converge, but it’s still fun to watch.

I noticed one thing; that when the list is getting close to a solution, the gene pool is filled with very similar results. The pool has mostly settled on a solution — one that scores very high — but it’s trying to find the last piece that will complete the picture. It can spend thousands or tens of thousands of cycles waiting for a mutation that corrects the last bit.

So I came up with a “scitter” combination for just that situation. The idea is that given two chromosomes, where the genes match I leave them alone. But where they differ is where the action is. So I just randomize those.

Hm. Maybe I should have called it “directed mutation”.

I was amazed at how rapidly it converges for certain types of problems. I’m in the process at trying it on more of an “artificial life” problem next.

There’s one catch. Let’s take a simple example where I’m using a genetic algorithm to guess a word. Let’s say the word is “booga”. Let’s say the gene pool selects these two words to combine:

target: booga
chromo0: xooga
chromo1: tooga

Now, the scitter combiner is going to randomize the first gene (i.e. the first letter), and leave the rest alone. But if it randomly selects “x” or “t” instead of “b”, then instead of finding a solution, it will lose information. And it’s twice as likely to do that than guess right! I found that scitter can flood the gene pool with identical strings. What that happens, that, of course, is the end of that.

So what I do is, if I hand two identical strings to scitter, it returns a completely random string. It sort of re-rolls all the dice to try to figure out the one that counts.

In Python, scitter looks like this:

def scitter(brain0, brain1):
    """Randomize the genes that don't match.

    The idea is, as the solution converges, there will be a small number of
    holdouts that we need to guess. By keeping the matches the same and
    scrambling the mis-matches, we're taking a shot at guessing the correct
    values.
    """
    if brain0 == brain1:
        # otherwise the pool becomes a single pattern
        return random()

    # build the new brain
    new_genes = []

    for i in xrange(length):
        if brain0[i] == brain1[i]:
            g = brain0[i]
        else:
            g = random_gene()
        new_genes.append(g)

    return ''.join(g for g in new_genes)

Leave a comment

Filed under computer algorithms, utility

Percentile Estimation on a Stream

For a while I’ve had a cheesy way of estimating percentile values for measurements taken on streaming data. That is, as data streams in, I want to be able to periodically get estimates of the percentile values for values coming in on the stream.

For example, what is the 90th percentile value of byte sizes for messages coming into my service?

Or another example is, what is the 99th percentile of elapsed times to process these messages?

It’s easy capturing min, max, and average for a set of values over some time range. But percentiles, including the median, or 50th percentile, are much harder. To be correct, you have to keep all the individual values, sort them, and then find the values 50%, 90% or 99% of the way through the sorted list.

It might be impractical to set aside memory or disk space to hold all the values coming in — there could be thousands or tens of thousands per second. And even a fast sort could be prohibitive as well. Who has that kind of time on a loaded system? That’s why some sort of estimation technique for percentiles is useful.

So far, the implementation I’ve used (suggested by one of my favorite managers), is to set aside a set of buckets, as for a histogram. As you capture values coming in, increment a counter for the right bucket. When you’re ready to estimate the percentiles, sum up all the buckets, then walk through the buckets adding the values, and when you’ve reached 50% of the total, use the value associated with that bucket as the 50% percentile. Keep going until you hit all the percentiles that you care about.

There are some problems with that, but for quick, broad estimates of percentile values, it works OK. You just have to decide how many buckets you need and carry that many longs or ints in memory — depending on how worried you are about overflow.

My next enhancement was going to provide for larger buckets at the bottom than at the top, since usually we don’t care about percentile values below the median. Also, combine the buckets with some sort of circular array to hold the highest set of values, since the 99.99th or 99.999th value is probably going represent a small number of measurements. Rather than carrying buckets to go that high, it’s probably cheaper to just keep the top handful of values.

In my Google prep, I did a heap implementation, and that looked like a very promising possibility. Rather than have fixed buckets, I thought I might have buckets that live in a heap. Adding a value means either adding a new bucket if the heap isn’t full, or else incrementing an existing bucket if it is. The buckets lower in the heap could be bigger, and I might even have some logic to merge buckets if one gets pushed past the middle of the heap.

All that said….

This paper looks like it addresses that problem directly. Now I just need to understand it enough to convert it into code 😉

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.4993

1 Comment

Filed under computer algorithms, computer reading, utility