Software Architecture Books Summary And Highlights -- Part 5 Availability

 

Availability

Highlights

How to detect faults?

many ways, interesting one:

voting (3 ways)

  1. replication
  2. functioning redundancy: same function implemented by different pieces of code
  3. analytics redundancy: One function’s one input can come from different sources so that it can pick most reliable source to use

SAP Ch 4 Availability


How to recover from faults?

preparation and repair:

many ways, some interesting ones:

  • ignore fault behavior. e.g. ignore bad sensor output
  • graceful degradation: This tactic maintains the most critical system functions in the presence of component failures, while dropping less critical functions.
  • reconfiguration resources

Reintroduction tactics:

  1. Re-sync a previous failed component in shadow for a predefined duration
  2. state re-sync
  3. escalating restart: different level of restarts
  4. nonstop forwarding
    1. if a router’s supervisor fails, router can continue forwarding information based on known path

SAP Ch 4 Availability


How to Prevent faults?

many ways, some interesting ones:

  1. temporarily remove a service to prevent the service’s own fault affecting others.
  2. transactions
  3. predict system faults based on system status

SAP Ch 4 Availability


How to improve availability?

  • redundancy
  1. active redundancy: all nodes receive and process identical inputs
  2. passive redundancy: some nodes are loosely coupled and synced
  3. spare (cold spare)

The tradeoff among the three alternatives is the time to recover from a failure versus the runtime cost incurred to keep a spare up-to-date. A hot spare carries the highest cost but leads to the fastest recovery time, for example.

  • replicas voting
  • circuit breaker which stops callers from too many retrying until reset

SAP Ch 4 Availability


Elasticity

the ability of a system to remain responsive during significantly high instantaneous and erratic spikes in user load.

How to improve elasticity?

  1. have very small, fine-grained services so that the mean time to startup (MTTS) is very small
  2. keep synchronous communication among services to a minimum.
    1. The more services communicate with one other to complete a single business transaction, the greater the negative impact on scalability and elasticity.

SAH Ch 3 Architectural Modularity


Circuit breaker

  • With a circuit breaker, after a certain number of requests to the downstream resource have failed, the circuit breaker is blown.
    • The failure can be timeout or others
    • It is tricky to decide the criteria. you don’t want to blow too frequently or wait too long to blow
  • All further requests fail fast while the circuit breaker is in its blown state.
    • for async call, just queue up the requests
    • for sync call, just fail fast
  • After a certain period of time, the client sends a few requests through to see if the downstream service has recovered, and if it gets enough healthy responses it resets the circuit breaker.

Bulkhead

  • In shipping, a bulkhead is a part of the ship that can be sealed off to protect the rest of the ship. So if the ship springs a leak, you can close the bulkhead doors. You lose part of the ship, but the rest of it remains intact.
  • e.g. separate connection pool

MSV Ch 11 Microservices at Scale


Types of Failure in cloud and how to handle

  1. timeout: if service timeout, may need to request a new VM in cloud which is expensive
    1. Most systems do not trigger failure recovery after a single missed response.
    2. the typical approach is to look for some number of missed responses over a longer time interval
  2. long tail latency; solution:
    1. Hedged requests. Make more requests than are needed and then cancel the requests (or ignore responses) after sufficient responses have been received.

SAP ch 17 Cloud and distributed computation


Architecting for Failure

treating your containers or servers as cattle means that your service can get back to a healthy state automatically, but additional effort is needed to make sure that it can function smoothly while experiencing a moderate rate of failures. because scheduler can kill nodes

Solution: smaller tasks and better load balancer

SWG Ch 25 compute as a service

An essential part of building a resilient system, especially when your functionality is spread over a number of different microservices that may be up or down, is the ability to safely degrade functionality. e.g. fallback mechanism

MSV Ch 11 Microservices at Scale


Related Chapters

SAP Ch 4 Availability

SAP ch 17 Cloud and distributed computation

SAH Ch 3 Architectural Modularity

MSV Ch 11 Microservices at Scale

SWG Ch 25 compute as a service

Popular posts from this blog

Does Free Consciousness exist ?

Software Architecture Books Summary And Highlights -- Part 1 Goal, Introduction And Index

拉美500年,荆棘丛生的自由繁荣之路