Verision Reporting
- report version dynamically
- report version in a text file in the deployment directory — if applicable
- report version inside the .war file — probably in the Manifest
- embed version in any generated client-facing pages
Logging
- provide a trace ID either generated or from header for each log message to correlate a thread’s activities
- for monitoring & alerting, use levels as
- INFO *sparse* informative messages e.g. one-time initialization, periodic milestones, cycle start/stop, periodic metrics
- WARN events to watch over time or that might bear investigation e.g. client errors, retries,
- ERROR a condition that *requires* someone to look at and evaluate; if you can ignore it, it’s a WARN
- CRITICAL a system failure that needs immediate resolution
Monitoring
- provides an “alive” URL for monitoring
- provide URL to validate deeper monitoring (“healthy” internal processes)
- provide URL to validate *connectivity* remote sytem dependencies
- Note that we just care about connectivity here. The remote system should have its own health monitor.
- provide URL for report of error conditions, current and past
- report errors (push) to a remote system (in addition to logging)
- for monitoring & alerting, use levels as
- INFO *sparse* informative messages e.g. one-time initialization, periodic milestones, cycle start/stop, periodic metrics
- WARN events to watch over time or that might bear investigation e.g. client errors, retries,
- ERROR a condition that *requires* someone to look at and evaluate; if you can ignore it, it’s a WARN
- CRITICAL a system failure that needs immediate resolution
- a good rule of thumb is, if it sets off someone’s pager, it’s ERROR, otherwise it’s a WARN
Runtime Metrics
- identify tasks. Each task will have an elapsed time, and at least one count associated with it
- provide running counts & elapsed time measurements of important system functions
- provide running counts of error conditions
- provide running counts aggregated into categories based on processing data
- provide in-memory aggregation & statistics about key running counts
- system-wide count of errors and warns
Configuration
- a single mechanism for configuration values, documentation, and ingestion/usage so they don’t get out of sync
- Don’t repeat yourself!
Documentation
- documentation in the code, not separate. Don’t repeat yourself!
- generated API documentation
- generated configuration documentation
- generated metrics documentation
- generated logging documentation
LifeCycle Management
- All instantiated objects should have lifecycle managed by the application context — create, initialize, destroy
- Should be easy to hook objects into the application context lifecycle, either decoratively or through discovery.
- Note: things like object pools or execution pools need to be shutdown gracefully and cleaned up!
Runtime Asynchronous Processing
- WIP the system needs a plan for how to handle asynchronous processing in a way that’s standard throughout the system.
- e.g. JMS Queue? Execution thread pool? Batch from database table? In-memory? Persistent?
Offline Batch Processing
- standard mechasim for parallelizing offline tasks by breaking tasks into batches that can be distributed across machines
State Management / Distributed Caching
- WIP the system should have a highly available, horizontally scalable mechanism for saving processing state for longer transactions (such as user sessions or inputs to a long-lived messaging conversation)
- e.g. external (memchached), internal with broadcasting (ehcache), NoSql, relational, in-memory, persistent