Software Architecture Books Summary And Highlights -- Part 5 Availability
Availability
Highlights
How to detect faults?
many ways, interesting one:
voting (3 ways)
- replication
- functioning redundancy: same function implemented by different pieces of code
- analytics redundancy: One function’s one input can come from different sources so that it can pick most reliable source to use
SAP Ch 4 Availability
How to recover from faults?
preparation and repair:
many ways, some interesting ones:
- ignore fault behavior. e.g. ignore bad sensor output
- graceful degradation: This tactic maintains the most critical system functions in the presence of component failures, while dropping less critical functions.
- reconfiguration resources
Reintroduction tactics:
- Re-sync a previous failed component in shadow for a predefined duration
- state re-sync
- escalating restart: different level of restarts
- nonstop forwarding
- if a router’s supervisor fails, router can continue forwarding information based on known path
SAP Ch 4 Availability
How to Prevent faults?
many ways, some interesting ones:
- temporarily remove a service to prevent the service’s own fault affecting others.
- transactions
- predict system faults based on system status
SAP Ch 4 Availability
How to improve availability?
- redundancy
- active redundancy: all nodes receive and process identical inputs
- passive redundancy: some nodes are loosely coupled and synced
- spare (cold spare)
The tradeoff among the three alternatives is the time to recover from a failure versus the runtime cost incurred to keep a spare up-to-date. A hot spare carries the highest cost but leads to the fastest recovery time, for example.
- replicas voting
- circuit breaker which stops callers from too many retrying until reset
SAP Ch 4 Availability
Elasticity
the ability of a system to remain responsive during significantly high instantaneous and erratic spikes in user load.
How to improve elasticity?
- have very small, fine-grained services so that the mean time to startup (MTTS) is very small
- keep synchronous communication among services to a minimum.
- The more services communicate with one other to complete a single business transaction, the greater the negative impact on scalability and elasticity.
SAH Ch 3 Architectural Modularity
Circuit breaker
- With a circuit breaker, after a certain number of requests to the downstream resource have failed, the circuit breaker is blown.
- The failure can be timeout or others
- It is tricky to decide the criteria. you don’t want to blow too frequently or wait too long to blow
- All further requests fail fast while the circuit breaker is in its blown state.
- for async call, just queue up the requests
- for sync call, just fail fast
- After a certain period of time, the client sends a few requests through to see if the downstream service has recovered, and if it gets enough healthy responses it resets the circuit breaker.
Bulkhead
- In shipping, a bulkhead is a part of the ship that can be sealed off to protect the rest of the ship. So if the ship springs a leak, you can close the bulkhead doors. You lose part of the ship, but the rest of it remains intact.
- e.g. separate connection pool
MSV Ch 11 Microservices at Scale
Types of Failure in cloud and how to handle
- timeout: if service timeout, may need to request a new VM in cloud which is expensive
- Most systems do not trigger failure recovery after a single missed response.
- the typical approach is to look for some number of missed responses over a longer time interval
- long tail latency; solution:
- Hedged requests. Make more requests than are needed and then cancel the requests (or ignore responses) after sufficient responses have been received.
SAP ch 17 Cloud and distributed computation
Architecting for Failure
treating your containers or servers as cattle means that your service can get back to a healthy state automatically, but additional effort is needed to make sure that it can function smoothly while experiencing a moderate rate of failures. because scheduler can kill nodes
Solution: smaller tasks and better load balancer
SWG Ch 25 compute as a service
An essential part of building a resilient system, especially when your functionality is spread over a number of different microservices that may be up or down, is the ability to safely degrade functionality. e.g. fallback mechanism
MSV Ch 11 Microservices at Scale
Related Chapters
SAP Ch 4 Availability
SAP ch 17 Cloud and distributed computation
SAH Ch 3 Architectural Modularity
MSV Ch 11 Microservices at Scale
SWG Ch 25 compute as a service