Service Base Feature Checklist

Verision Reporting

  • report version dynamically
  • report version in a text file in the deployment directory — if applicable
  • report version inside the .war file — probably in the Manifest
  • embed version in any generated client-facing pages

Logging

  • provide a trace ID either generated or from header for each log message to correlate a thread’s activities
  • for monitoring & alerting, use levels as
    • INFO *sparse* informative messages e.g. one-time initialization, periodic milestones, cycle start/stop, periodic metrics
    • WARN events to watch over time or that might bear investigation e.g. client errors, retries,
    • ERROR a condition that *requires* someone to look at and evaluate; if you can ignore it, it’s a WARN
    • CRITICAL a system failure that needs immediate resolution

Monitoring

  • provides an “alive” URL for monitoring
  • provide URL to validate deeper monitoring (“healthy” internal processes)
  • provide URL to validate *connectivity* remote sytem dependencies
  • Note that we just care about connectivity here. The remote system should have its own health monitor.
  • provide URL for report of error conditions, current and past
  • report errors (push) to a remote system (in addition to logging)
  • for monitoring & alerting, use levels as
    • INFO *sparse* informative messages e.g. one-time initialization, periodic milestones, cycle start/stop, periodic metrics
    • WARN events to watch over time or that might bear investigation e.g. client errors, retries,
    • ERROR a condition that *requires* someone to look at and evaluate; if you can ignore it, it’s a WARN
    • CRITICAL a system failure that needs immediate resolution
  • a good rule of thumb is, if it sets off someone’s pager, it’s ERROR, otherwise it’s a WARN

Runtime Metrics

  • identify tasks. Each task will have an elapsed time, and at least one count associated with it
  • provide running counts & elapsed time measurements of important system functions
  • provide running counts of error conditions
  • provide running counts aggregated into categories based on processing data
  • provide in-memory aggregation & statistics about key running counts
  • system-wide count of errors and warns

Configuration

  • a single mechanism for configuration values, documentation, and ingestion/usage so they don’t get out of sync
  • Don’t repeat yourself!

Documentation

  • documentation in the code, not separate. Don’t repeat yourself!
  • generated API documentation
  • generated configuration documentation
  • generated metrics documentation
  • generated logging documentation

LifeCycle Management

  • All instantiated objects should have lifecycle managed by the application context — create, initialize, destroy
  • Should be easy to hook objects into the application context lifecycle, either decoratively or through discovery.
  • Note: things like object pools or execution pools need to be shutdown gracefully and cleaned up!

Runtime Asynchronous Processing

  • WIP the system needs a plan for how to handle asynchronous processing in a way that’s standard throughout the system.
    • e.g. JMS Queue? Execution thread pool? Batch from database table? In-memory? Persistent?

Offline Batch Processing

  • standard mechasim for parallelizing offline tasks by breaking tasks into batches that can be distributed across machines

State Management / Distributed Caching

  • WIP the system should have a highly available, horizontally scalable mechanism for saving processing state for longer transactions (such as user sessions or inputs to a long-lived messaging conversation)
    • e.g. external (memchached), internal with broadcasting (ehcache), NoSql, relational, in-memory, persistent

Leave a comment