Software Architecture Books Summary And Highlights -- Part 5 Availability

- January 15, 2023

Availability

Highlights

How to detect faults?

many ways, interesting one:

voting (3 ways)

replication
functioning redundancy: same function implemented by different pieces of code
analytics redundancy: One function’s one input can come from different sources so that it can pick most reliable source to use

SAP Ch 4 Availability

How to recover from faults?

preparation and repair:

many ways, some interesting ones:

ignore fault behavior. e.g. ignore bad sensor output
graceful degradation: This tactic maintains the most critical system functions in the presence of component failures, while dropping less critical functions.
reconfiguration resources

Reintroduction tactics:

Re-sync a previous failed component in shadow for a predefined duration
state re-sync
escalating restart: different level of restarts
nonstop forwarding
1. if a router’s supervisor fails, router can continue forwarding information based on known path

SAP Ch 4 Availability

How to Prevent faults?

many ways, some interesting ones:

temporarily remove a service to prevent the service’s own fault affecting others.
transactions
predict system faults based on system status

SAP Ch 4 Availability

How to improve availability?

redundancy

active redundancy: all nodes receive and process identical inputs
passive redundancy: some nodes are loosely coupled and synced
spare (cold spare)

The tradeoff among the three alternatives is the time to recover from a failure versus the runtime cost incurred to keep a spare up-to-date. A hot spare carries the highest cost but leads to the fastest recovery time, for example.

replicas voting
circuit breaker which stops callers from too many retrying until reset

SAP Ch 4 Availability

Elasticity

the ability of a system to remain responsive during significantly high instantaneous and erratic spikes in user load.

How to improve elasticity?

have very small, fine-grained services so that the mean time to startup (MTTS) is very small
keep synchronous communication among services to a minimum.
1. The more services communicate with one other to complete a single business transaction, the greater the negative impact on scalability and elasticity.

SAH Ch 3 Architectural Modularity

Circuit breaker

With a circuit breaker, after a certain number of requests to the downstream resource have failed, the circuit breaker is blown.
- The failure can be timeout or others
- It is tricky to decide the criteria. you don’t want to blow too frequently or wait too long to blow
All further requests fail fast while the circuit breaker is in its blown state.
- for async call, just queue up the requests
- for sync call, just fail fast
After a certain period of time, the client sends a few requests through to see if the downstream service has recovered, and if it gets enough healthy responses it resets the circuit breaker.

Bulkhead

In shipping, a bulkhead is a part of the ship that can be sealed off to protect the rest of the ship. So if the ship springs a leak, you can close the bulkhead doors. You lose part of the ship, but the rest of it remains intact.
e.g. separate connection pool

MSV Ch 11 Microservices at Scale

Types of Failure in cloud and how to handle

timeout: if service timeout, may need to request a new VM in cloud which is expensive
1. Most systems do not trigger failure recovery after a single missed response.
2. the typical approach is to look for some number of missed responses over a longer time interval
long tail latency; solution:
1. Hedged requests. Make more requests than are needed and then cancel the requests (or ignore responses) after sufficient responses have been received.

SAP ch 17 Cloud and distributed computation

Architecting for Failure

treating your containers or servers as cattle means that your service can get back to a healthy state automatically, but additional effort is needed to make sure that it can function smoothly while experiencing a moderate rate of failures. because scheduler can kill nodes

Solution: smaller tasks and better load balancer

SWG Ch 25 compute as a service

An essential part of building a resilient system, especially when your functionality is spread over a number of different microservices that may be up or down, is the ability to safely degrade functionality. e.g. fallback mechanism

MSV Ch 11 Microservices at Scale

Related Chapters

SAP Ch 4 Availability

SAP ch 17 Cloud and distributed computation

SAH Ch 3 Architectural Modularity

MSV Ch 11 Microservices at Scale

SWG Ch 25 compute as a service

Search This Blog

Swortal

Software Architecture Books Summary And Highlights -- Part 5 Availability

Availability

Highlights

Related Chapters

Popular posts from this blog

拉美500年，荆棘丛生的自由繁荣之路

Software Architecture Books Summary And Highlights -- Part 1 Goal, Introduction And Index

以小见大，从国父的故事窥见美国独立建国的大历史