Service Base Feature Checklist

Verision Reporting

Logging

provide a trace ID either generated or from header for each log message to correlate a thread’s activities
for monitoring & alerting, use levels as
- INFO *sparse* informative messages e.g. one-time initialization, periodic milestones, cycle start/stop, periodic metrics
- WARN events to watch over time or that might bear investigation e.g. client errors, retries,
- ERROR a condition that *requires* someone to look at and evaluate; if you can ignore it, it’s a WARN
- CRITICAL a system failure that needs immediate resolution

Monitoring

provides an “alive” URL for monitoring
provide URL to validate deeper monitoring (“healthy” internal processes)
provide URL to validate *connectivity* remote sytem dependencies
Note that we just care about connectivity here. The remote system should have its own health monitor.
provide URL for report of error conditions, current and past
report errors (push) to a remote system (in addition to logging)
for monitoring & alerting, use levels as
- INFO *sparse* informative messages e.g. one-time initialization, periodic milestones, cycle start/stop, periodic metrics
- WARN events to watch over time or that might bear investigation e.g. client errors, retries,
- ERROR a condition that *requires* someone to look at and evaluate; if you can ignore it, it’s a WARN
- CRITICAL a system failure that needs immediate resolution
a good rule of thumb is, if it sets off someone’s pager, it’s ERROR, otherwise it’s a WARN

Runtime Metrics

identify tasks. Each task will have an elapsed time, and at least one count associated with it
provide running counts & elapsed time measurements of important system functions
provide running counts of error conditions
provide running counts aggregated into categories based on processing data
provide in-memory aggregation & statistics about key running counts
system-wide count of errors and warns

Configuration

a single mechanism for configuration values, documentation, and ingestion/usage so they don’t get out of sync
Don’t repeat yourself!

Documentation

LifeCycle Management

All instantiated objects should have lifecycle managed by the application context — create, initialize, destroy
Should be easy to hook objects into the application context lifecycle, either decoratively or through discovery.
Note: things like object pools or execution pools need to be shutdown gracefully and cleaned up!

Runtime Asynchronous Processing

WIP the system needs a plan for how to handle asynchronous processing in a way that’s standard throughout the system.
- e.g. JMS Queue? Execution thread pool? Batch from database table? In-memory? Persistent?

Offline Batch Processing

standard mechasim for parallelizing offline tasks by breaking tasks into batches that can be distributed across machines

State Management / Distributed Caching

WIP the system should have a highly available, horizontally scalable mechanism for saving processing state for longer transactions (such as user sessions or inputs to a long-lived messaging conversation)
- e.g. external (memchached), internal with broadcasting (ehcache), NoSql, relational, in-memory, persistent