After a long hiatus, I’ve added some more items to the service feature checklist. A couple of them are still very blurry, so I might pin them down better later.
I added some items around monitoring. It starts with the basic “alive” monitor that just acts like a hello world link. Note that the version link serves nicely for this. Then the monitoring can get deeper into internal processes, to make sure all the necessary queues, threads, and such are up and running correctly.
The next one might bear further discussion. The idea is to provide monitors for remote systems our system is dependent on. For services that we own ourselves, we only need to provide a connection check. The idea is, we should be setting up the same sort of monitoring for all our services, so if one starts misbehaving, its monitors should speak up. If a queue dies or somesuch, I don’t need my entire infrastructure to light up and start flashing alarms.
Now, if I have a dependency on a remote system that I don’t own and don’t monitor, then it might make sense to have a little deeper check. But hey, the checklist items are prescriptions, not proscriptions.
A really good service should be able to report on error conditions that it’s seeing. You shouldn’t have to scrape a log file to find out if something’s going wrong. The error reports could be a part of monitoring. Or they could be for human consumption, as a part of the troubleshooting process. Keeping some sort of rolling log in memory would help reveal some history of the problems.
Consider enabling your system to push metrics and alarms to a remote system. That way, in a huge bank of servers, you don’t have to do brutal configuration of monitoring for hundreds of machines.
Whether the numbers are gathered inside the application, or dumped to an external system (like a database) and crunched, you really need a way to see the performance of the system over time. So a service should report numbers about what it’s doing. The two basic numbers it should provide are counts and timings. Counts are just the number of times something happened: e.g. how many POST calls we made to a remote system. Timings are how long it took to do it.
For timing, it used to be that Milliseconds were just fine. But in the modern world, thinks happen much faster. For example, we had a hash-based authorization system that I timed to run at 250-400 *micro*seconds.
Also, watch out for CPUs. If the CPUs on your box have different times (and that can happen), then you can actually get negative elapsed times if your process jumps from one CPU to another.
I found another measure to be really useful: a “count per second” measure. For extremely fast transactions, it can be useful to count the number of transactions that happened in the past second, and then add that number to your metrics store. It’s interesting to see the “count per second” numbers jump up or down.
While most metrics can be aggregated outside the service, aggregating some critical metrics right in memory and making them available for viewing is a really handy thing to do. So even if you’re going to dump your numbers into a datawarehouse and crunch them there, you can get some really important statistics by tallying the numbers right in memory and have them current to within microseconds.
Over time, your service is going to pick up a lot of mechanisms. Some of these mechanisms need to start up in an orderly way, and need to be closed down in an orderly way. For example, execution thread pools need to have a time to do a graceful shutdown so that items still in their queue remain in an acceptable state.
A more subtle aspect of this is remote systems that know about your service. For example, your service might tell your monitoring service that it’s going down for maintenance, and save everyone some trouble an grief. Especially as your number of services increase.
Async and Caching
I’m surprised how often the need for these items comes up in my work, and it seems like when the need for asynchronous processing, or data caching comes up, it almost instantly sparks a whole series of religious wars. The NoSql guys and the Sql guys go at it. The Stateful vs. Stateless war rises in flames. And above the din of battle, the “We don’t need that!” crowd can always be heard.
Come on. A mature system needs asynchronous processing. And a mature system needs a way to maintain lightweight state, without jamming it into a full-blown relational database. So figure it out early, and then use it a lot. My experience is that once a shop makes that leap, then suddenly things become a lot quieter. Tough issues can just be handled matter-of-fact, because the service already has the infrastructure it needs.
Note both asynchronous processing and distributed caching do add overhead. You have to write code for it, monitor it, and build your data center it. Nothing is free. But once it’s in place, then you can use it for all kinds of things.
That’s it for now. I might come back and refine the async and caching bits.