Philosophical Summary of Book ‘Software Engineering At Google’

Recently I finished reading the book Software Engineering At Google (Winters, Manshreck and Wright, 2020). It is a very interesting book. It shares a lot of experiences, learned lessons, ideas and practices in software development culture, processes and tools. The parts that impressed me most are the high level philosophical ideas scattering around different chapters of this book. So instead of a regular summary of the book which you can easily obtain from each chapter’s tl;dr, this article will try to reorganize (as what Google often does) this book’s content and summarize it in a philosophical sense.

Scalability

Scalability is very important for large organizations with a large code base, workforce and long functioning period.

Scale in long functioning period

Software Lifespan

  • The Lifespan of a software (whether long or short) makes a big difference in design.
  • Software with a long lifespan should be made easy to adapt to new dependency versions or changes.
  • Software development is managing the whole life cycle of software.
  • Unit test is valuable to short lived code
  • Larger test is valuable to long running code for long-term healths

Maintenance

  • When importing a dependency, we need to think about the maintenance cost, the reliability of the dependency (e.g. tests, author’s credibility…), and the cost of developing it ourselves.

Migration and Iteration

  • Migration could be expensive. For example, in build system, artifact based approach (e.g. Bazel) ****is further scalable but less flexible than task based approach (e.g. Gradle). Migration from an existing task-based system can be difficult and is not always worth it if the build isn’t already showing problems in terms of speed or correctness.
  • It’s always easier to design a better interface in the absence of legacy constraints.

Make the system evolve over time.

For example

  • Drive decision making by data. Change decision when underlying data changes over time.
  • Code style rule can be changed based on usage pattern change
  • A reader could update or correct a doc when reading it even if he is not the author


My takeaways and thoughts:

Consider the lifespan of a thing, its maintenance cost and use different strategies accordingly.

For example, you need to carefully consider the HOA fee and its future increase when buying a condo and planning to hold it for a long time. Otherwise, if you are just renting a condo, you probably don’t need to pay too much attention to future maintenance costs.

Migration can be expensive.

For example, moving from one country to another can be expensive in terms of physical moving cost, energy spent, new environment ramp up and so on.

Sometimes abandoning the old thing and starting a completely new thing is easier than maintaining the old thing

For example, if a house is too old, it might be better to tear it down and construct a new one.

Keep improving and adjusting based on the latest conditions. ****

For example, when planning for a long period e.g. one year or multiple years, it might be better to define an open and abstract goal over a long period, only define detailed tasks for the first short term e.g. one quarter and review and adjust the detail tasks for latter short terms.


Scale with huge codebase

Consistency

  • Making code style consistent across the whole organization would ease the development of common code formatter, static analysis tool and large scale code refactoring.
  • Use monorepo which only allows a single version of all code and dependencies existing at the same time.

At larger scales, the cost of manually managing dependencies is much less than the cost of dealing with issues caused by automatic dependency management.

  1. Distributed version control system like GIT can allow each organization to have their own source of truth branch. For example, Google and Oracle have different master branches of Linux. And different orgs can have their own policy and process on the branch.
  2. Most problems in dependency management stop being problems when you can see exactly how your code is being used and know exactly the impact of any given change.
  3. Branches are a drag on productivity. In many cases we think complex branch and merge strategies are a perceived safety crutch—an attempt to keep the trunk stable. As we see throughout this book, there are other ways to achieve that outcome.

Simple and less is better for a component

  • If it is hard to make a code file’s overview comment short, then probably the api design in the file is bad and needs to break the file into multiple components.
  • We should focus on how much functionality it can deliver per unit of code and try to maximize that metric. More code, more complicated.
  • To improve a document, often less is better.
  • Narrow scoped tests are more encouraged.
  • In tests, to reduce the number of fakes that need to be maintained, a fake should typically be created only at the root api level.
  • In tests, if a fake’s clients are across multiple languages, then create a single fake service implementation and have tests configure the client libraries to send requests to this fake service.
  • For build system, one of the more surprising lessons that Google has learned is that limiting engineers’ power and flexibility can improve their productivity. So Google uses Bazel like build tool which is an artifact based build system.
  • We should make resources such as deployment containers like cattle which can be managed with scale rather than pet.

Treating your containers or servers as cattle means that your service can get back to a healthy state automatically. But additional effort is needed to make sure that it can function smoothly while experiencing a moderate rate of failures. because scheduler can kill nodes

  • So we need to have smaller tasks and better load balancers for the container.


My takeaways and thoughts:

Keeping things consistent could improve efficiency

For example, when you want to have a new mobile phone, it is always easier to migrate your data and application information from your existing phone to a phone with the same brand than to a different brand’s phone.

Communicate in a simple and less way

For example, when giving other people tasks or explaining something to other people, it is often better to give concise points to others though it may lose some details. Otherwise, too much information would confuse others.

Divide your tasks into simple ones and better organize them.

Large and ambiguous tasks are not only hard to handle as a whole but also create a lot of mental burden to work on them. So dividing your tasks into simple and small ones and organizing them in an orderly manner could make it easy for you to finish the large task.


Scale in human productivity

For teams

  • Set clear goals for team
  • Setup a central team which has more incentive and is more efficient to tackle difficult or boring projects such as building tech infra, removing tech debt and building engineering productivity tools.

For individuals

  • Failing to keep a test suite deterministic and fast ensures it will become roadblock to productivity.
  • Engineering productivity tools should aim to create recommendations that are built into the developer workflow and incentive structures.
  • Build actionable and relevant deprecation warnings.

Actionable means the warning should offer developers easy ways to replace the deprecated library or function with the latest library or functions.

Relevant means the warning should only occur when a developer makes any change in places related to the deprecated code. Otherwise, the warning will usually be ignored.

  • Make static code analysis part of the core developer workflow
  • Focus on developer happiness for static code analysis
  • Make it easy for developers to contribute to custom static analysis plugins.


My takeaways and thoughts:

Responsibility with incentive

When we want to ask a person or a team to take on some responsibility, we need to provide them incentives. Incentive not only contains reward and ownership but also is affected by the following factors.

  • If the task is simple to do

For example, if you want other people to clean the room, give them enough cleaning tools.

  • If the task is actionable and relevant

For example, if you don’t want to go to the grocery and want someone to buy some food for you, you had better tell him what food you exactly need and ask him to buy from a grocery which is on his regular commute route.

  • If the underlying resources are deterministic and cheap to use

For example, if you want someone to deliver packages, choose a delivery route with deterministic and cheap buses for him.

  • If the task is required in major workflow

For example, if you want your children or students to practice handwriting, you can require good handwriting in their regular homework.

  • If a clear goal is set for the task

If you want your kids to behave better in school, you need to give them clear goals such as making more friends, learning more math etc.

  • If a task give executors happiness

If you want your kids to develop some special skills, you should choose a skill your children are interested in and have talents in.


Reliability

Hyrum’s law

Definition: With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

For example

  • API, container, service, library can all have implicit dependency.
  • Containers can also have many implicit dependencies which are antipattern and hard to remove. For example, currently it is hard to migrate Google’s Borg (task scheduler and container) to behave differently and support more different features.
  • Some teams don’t even know they depend on some unknown tools/library/APIs. So one way is to find them dynamically through tests of increasing frequency and duration during which the old system is turned off temporarily. These intentional changes provide a mechanism for discovering unintended dependencies by seeing what breaks, thus alerting teams to a need to prepare for the upcoming deadline.


My takeaways and thoughts:

Everything which is possible even with a tiny probability will eventually happen.

For example, stock market crash…


Fast and short feedback loop makes things more reliable

Collecting feedback as much and early as possible would make things more reliable.

Development

  • More exposure and feedback in development flow results in better design and implementation.

Developers tend to hide their work until it is launched due to their strong ego. But more exposure in all stages of their development actually improves their work’s quality.

  • Allow your team members to try and fail small things but be careful on important things.
  • Aim to create engineering productivity recommendations that are built into the developer workflow and incentive structures. So developers learn best practices when they do regular work.
  • The earlier a bug is captured, the lower the cost to fix that.
  • Visibility into test history empowers engineers to share and collaborate on feedback, an essential requirement for disparate teams to diagnose and learn from integration failures between their systems.

Release

  • The more change made into a codebase, the stronger it becomes.
  • More small releases are better.
  • Faster is safer: Ship early and often and in small batches to reduce the risk of each release and to minimize time to market.
  • At scale, increased complexity usually manifests as increased release latency. Frequent release trains allow for minimal divergence from a known good position, with the recency of changes aiding in resolving issues.


My takeaways and thoughts:

Look for frequent feedback with few new changes

For example, when learning a new language, it is better to learn and immediate use by either writing or speaking and collect feedback right after the usage.

You become stronger if you exercise more

For example, though you may not become a top player even if you play a lot of basketball, your body will at least become stronger and more reliable as long as you don’t over exercise.

Try and fail small things

For example, if you like a community, before you buy a house there, maybe live there for a short period or even visit there several times first to check whether you really like it.


Trade-off

As software engineers, we are asked to make more complex decisions with higher-stakes outcomes, often based on imprecise estimates of time and growth. So there are many trade-offs we need to make in our development workflow.

Documentation

  • Documentation audience: novice vs expert

Novice audience wants details to understand your documentation while expert audience wants concise doc to quickly skim through it. So you need to make trade-offs on your document’s length to satisfy both parties. One way to achieve that is to write your doc in 2 passes.

  1. write descriptively to explain complex topic clearly for novice
  2. removing duplicate information where you can to shorten the doc for expert
  • Trade-off among completeness, accuracy, clarity based on document purpose.

For example, if the document is a guideline, then clarity should be prioritized. If it is a reference, then completeness should be prioritized.

Leadership

  • Work in isolation is needed for coding but will make people think you are hard to communicate with. So be careful about the trade-off
  • Leadership Always Be Deciding: Ambiguous problems have no magic answer; they’re all about finding the right trade-offs of the moment, and iterating.

Tests

  • Before writing a fake, a trade-off needs to be made on whether the productivity improvements that will result from the use of the fake outweigh the costs of writing and maintaining it.

Release

  • There is no perfect binary—decisions and trade-offs have to be made every time a new change is released into production. Key performance indicator metrics with clear thresholds allow features to launch even if they aren’t perfect and can also create clarity in otherwise contentious launch decisions.


My takeaways and thoughts:

To make a good decision based on trade-offs, probably the following steps are needed.

  1. Clearly define your trade-off qualities. For example, availability vs consistency; Clearly define the metrics for the 2 qualities.
  2. Match the bottom line of each quality. For example, determine the minimum level of availability and consistency your system needs and satisfy them.
  3. Pick and Maximize one quality for your major goal. For example, if your system is a messaging app, then availability should be prioritized so that messages are delivered immediately.

Popular posts from this blog

Does Free Consciousness exist ?

Software Architecture Books Summary And Highlights -- Part 1 Goal, Introduction And Index

拉美500年,荆棘丛生的自由繁荣之路