Recently I finished reading the book Software Engineering At Google (Winters, Manshreck and Wright, 2020). It is a very interesting book. It shares a lot of experiences, learned lessons, ideas and practices in software development culture, processes and tools. The parts that impressed me most are the high level philosophical ideas scattering around different chapters of this book. So instead of a regular summary of the book which you can easily obtain from each chapter’s tl;dr, this article will try to reorganize (as what Google often does) this book’s content and summarize it in a philosophical sense.

Scalability

Scalability is very important for large organizations with a large code base, workforce and long functioning period.

Scale in long functioning period

Software Lifespan

The Lifespan of a software (whether long or short) makes a big difference in design.
Software with a long lifespan should be made easy to adapt to new dependency versions or changes.
Software development is managing the whole life cycle of software.
Unit test is valuable to short lived code
Larger test is valuable to long running code for long-term healths

Maintenance

When importing a dependency, we need to think about the maintenance cost, the reliability of the dependency (e.g. tests, author’s credibility…), and the cost of developing it ourselves.

Migration and Iteration

Migration could be expensive. For example, in build system, artifact based approach (e.g. Bazel) ****is further scalable but less flexible than task based approach (e.g. Gradle). Migration from an existing task-based system can be diﬀicult and is not always worth it if the build isn’t already showing problems in terms of speed or correctness.
It’s always easier to design a better interface in the absence of legacy constraints.

Make the system evolve over time.

For example

Drive decision making by data. Change decision when underlying data changes over time.
Code style rule can be changed based on usage pattern change
A reader could update or correct a doc when reading it even if he is not the author

My takeaways and thoughts:

Consider the lifespan of a thing, its maintenance cost and use different strategies accordingly.

For example, you need to carefully consider the HOA fee and its future increase when buying a condo and planning to hold it for a long time. Otherwise, if you are just renting a condo, you probably don’t need to pay too much attention to future maintenance costs.

Migration can be expensive.

For example, moving from one country to another can be expensive in terms of physical moving cost, energy spent, new environment ramp up and so on.

Sometimes abandoning the old thing and starting a completely new thing is easier than maintaining the old thing

For example, if a house is too old, it might be better to tear it down and construct a new one.

Keep improving and adjusting based on the latest conditions. ****

For example, when planning for a long period e.g. one year or multiple years, it might be better to define an open and abstract goal over a long period, only define detailed tasks for the first short term e.g. one quarter and review and adjust the detail tasks for latter short terms.

Scale with huge codebase

Consistency

Making code style consistent across the whole organization would ease the development of common code formatter, static analysis tool and large scale code refactoring.
Use monorepo which only allows a single version of all code and dependencies existing at the same time.

At larger scales, the cost of manually managing dependencies is much less than the cost of dealing with issues caused by automatic dependency management.

Distributed version control system like GIT can allow each organization to have their own source of truth branch. For example, Google and Oracle have different master branches of Linux. And different orgs can have their own policy and process on the branch.
Most problems in dependency management stop being problems when you can see exactly how your code is being used and know exactly the impact of any given change.
Branches are a drag on productivity. In many cases we think complex branch and merge strategies are a perceived safety crutch—an attempt to keep the trunk stable. As we see throughout this book, there are other ways to achieve that outcome.

Simple and less is better for a component

If it is hard to make a code file’s overview comment short, then probably the api design in the file is bad and needs to break the file into multiple components.
We should focus on how much functionality it can deliver per unit of code and try to maximize that metric. More code, more complicated.
To improve a document, often less is better.
Narrow scoped tests are more encouraged.
In tests, to reduce the number of fakes that need to be maintained, a fake should typically be created only at the root api level.
In tests, if a fake’s clients are across multiple languages, then create a single fake service implementation and have tests conﬁgure the client libraries to send requests to this fake service.
For build system, one of the more surprising lessons that Google has learned is that limiting engineers’ power and ﬂexibility can improve their productivity. So Google uses Bazel like build tool which is an artifact based build system.
We should make resources such as deployment containers like cattle which can be managed with scale rather than pet.

Treating your containers or servers as cattle means that your service can get back to a healthy state automatically. But additional effort is needed to make sure that it can function smoothly while experiencing a moderate rate of failures. because scheduler can kill nodes

So we need to have smaller tasks and better load balancers for the container.

My takeaways and thoughts:

Keeping things consistent could improve efficiency

For example, when you want to have a new mobile phone, it is always easier to migrate your data and application information from your existing phone to a phone with the same brand than to a different brand’s phone.

Communicate in a simple and less way

For example, when giving other people tasks or explaining something to other people, it is often better to give concise points to others though it may lose some details. Otherwise, too much information would confuse others.

Divide your tasks into simple ones and better organize them.

Large and ambiguous tasks are not only hard to handle as a whole but also create a lot of mental burden to work on them. So dividing your tasks into simple and small ones and organizing them in an orderly manner could make it easy for you to finish the large task.

Scale in human productivity

For teams

Set clear goals for team
Setup a central team which has more incentive and is more efficient to tackle difficult or boring projects such as building tech infra, removing tech debt and building engineering productivity tools.

For individuals

Failing to keep a test suite deterministic and fast ensures it will become roadblock to productivity.
Engineering productivity tools should aim to create recommendations that are built into the developer workflow and incentive structures.
Build actionable and relevant deprecation warnings.

Actionable means the warning should offer developers easy ways to replace the deprecated library or function with the latest library or functions.

Relevant means the warning should only occur when a developer makes any change in places related to the deprecated code. Otherwise, the warning will usually be ignored.

Make static code analysis part of the core developer workﬂow
Focus on developer happiness for static code analysis
Make it easy for developers to contribute to custom static analysis plugins.

My takeaways and thoughts:

Responsibility with incentive

When we want to ask a person or a team to take on some responsibility, we need to provide them incentives. Incentive not only contains reward and ownership but also is affected by the following factors.

If the task is simple to do

For example, if you want other people to clean the room, give them enough cleaning tools.

If the task is actionable and relevant

For example, if you don’t want to go to the grocery and want someone to buy some food for you, you had better tell him what food you exactly need and ask him to buy from a grocery which is on his regular commute route.

If the underlying resources are deterministic and cheap to use

For example, if you want someone to deliver packages, choose a delivery route with deterministic and cheap buses for him.

If the task is required in major workflow

For example, if you want your children or students to practice handwriting, you can require good handwriting in their regular homework.

If a clear goal is set for the task

If you want your kids to behave better in school, you need to give them clear goals such as making more friends, learning more math etc.

If a task give executors happiness

If you want your kids to develop some special skills, you should choose a skill your children are interested in and have talents in.

Reliability

Hyrum’s law

Definition: With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

For example

API, container, service, library can all have implicit dependency.
Containers can also have many implicit dependencies which are antipattern and hard to remove. For example, currently it is hard to migrate Google’s Borg (task scheduler and container) to behave differently and support more different features.
Some teams don’t even know they depend on some unknown tools/library/APIs. So one way is to ﬁnd them dynamically through tests of increasing frequency and duration during which the old system is turned oﬀ temporarily. These intentional changes provide a mechanism for discovering unintended dependencies by seeing what breaks, thus alerting teams to a need to prepare for the upcoming deadline.

My takeaways and thoughts:

Everything which is possible even with a tiny probability will eventually happen.

For example, stock market crash…

Fast and short feedback loop makes things more reliable

Collecting feedback as much and early as possible would make things more reliable.

Development

More exposure and feedback in development flow results in better design and implementation.

Developers tend to hide their work until it is launched due to their strong ego. But more exposure in all stages of their development actually improves their work’s quality.

Allow your team members to try and fail small things but be careful on important things.
Aim to create engineering productivity recommendations that are built into the developer workflow and incentive structures. So developers learn best practices when they do regular work.
The earlier a bug is captured, the lower the cost to fix that.
Visibility into test history empowers engineers to share and collaborate on feedback, an essential requirement for disparate teams to diagnose and learn from integration failures between their systems.

Release

The more change made into a codebase, the stronger it becomes.
More small releases are better.
Faster is safer: Ship early and often and in small batches to reduce the risk of each release and to minimize time to market.
At scale, increased complexity usually manifests as increased release latency. Frequent release trains allow for minimal divergence from a known good position, with the recency of changes aiding in resolving issues.

My takeaways and thoughts:

Look for frequent feedback with few new changes

For example, when learning a new language, it is better to learn and immediate use by either writing or speaking and collect feedback right after the usage.

You become stronger if you exercise more

For example, though you may not become a top player even if you play a lot of basketball, your body will at least become stronger and more reliable as long as you don’t over exercise.

Try and fail small things

For example, if you like a community, before you buy a house there, maybe live there for a short period or even visit there several times first to check whether you really like it.

Trade-off

As software engineers, we are asked to make more complex decisions with higher-stakes outcomes, often based on imprecise estimates of time and growth. So there are many trade-offs we need to make in our development workflow.

Documentation

Documentation audience: novice vs expert

Novice audience wants details to understand your documentation while expert audience wants concise doc to quickly skim through it. So you need to make trade-offs on your document’s length to satisfy both parties. One way to achieve that is to write your doc in 2 passes.

write descriptively to explain complex topic clearly for novice
removing duplicate information where you can to shorten the doc for expert

Trade-off among completeness, accuracy, clarity based on document purpose.

For example, if the document is a guideline, then clarity should be prioritized. If it is a reference, then completeness should be prioritized.

Leadership

Work in isolation is needed for coding but will make people think you are hard to communicate with. So be careful about the trade-off
Leadership Always Be Deciding: Ambiguous problems have no magic answer; they’re all about finding the right trade-offs of the moment, and iterating.

Tests

Before writing a fake, a trade-oﬀ needs to be made on whether the productivity improvements that will result from the use of the fake outweigh the costs of writing and maintaining it.

Release

There is no perfect binary—decisions and trade-offs have to be made every time a new change is released into production. Key performance indicator metrics with clear thresholds allow features to launch even if they aren’t perfect and can also create clarity in otherwise contentious launch decisions.

My takeaways and thoughts:

To make a good decision based on trade-offs, probably the following steps are needed.

Clearly define your trade-off qualities. For example, availability vs consistency; Clearly define the metrics for the 2 qualities.
Match the bottom line of each quality. For example, determine the minimum level of availability and consistency your system needs and satisfy them.
Pick and Maximize one quality for your major goal. For example, if your system is a messaging app, then availability should be prioritized so that messages are delivered immediately.

拉美500年，荆棘丛生的自由繁荣之路

- August 18, 2024

缘起最近对拉美的政治经济历史感兴趣，所以读了一些相关书籍，看了一些相关视频，感觉拉美还是一个很有趣的地区：资源丰富，悠久的被殖民的历史，灾难性的通货膨胀，贫民窟，贫富差距大etc。所以把阅读的笔记和思考重新整理如下。注：下面的很多内容都是来自读书笔记，如有雷同，那是真的在抄书 lol 参考材料：從「已開發」倒退回「發展中水準」的國家，經濟學家眼中最離奇的案例（视频）阿根廷国家崩溃报告（视频）《掉队的拉美》 [智]塞巴斯蒂安.爱德华兹（书）《拉丁美洲被切开的血管》 [乌拉圭] 爱德华多·加莱亚诺（书）正文拉美的问题相比其他国家，拉美有很多优势，比如资源丰富，有丰富的矿产资源，气候也很适合农业发展；比如比亚洲和非洲国家更早实现独立和民主制度；比如没有直接卷入一战和二战，二战期间由于欧州陷入战乱无暇输出工业品，拉美的民族工业从而获得了更多市场，并得到了长足发展。但是二战之后拉美的发展速度却远远落后于一片废墟的欧洲，还被东亚诸国后发超车。《掉队的拉美》中把经济的增长转型分为三个阶段：第一个阶段，产量增加和收入提高主要是由生产率增长驱动的。简单来说，第一个阶段的经济增长不是由于使用了更多机器或雇用了更多工人，而是由于做事的效率提高了。第二个阶段，效率的提高和生产率的增长仍然强劲，整体经济持续快速发展。与第一个阶段不同的是，第二个阶段对机器、建筑物、公路和港口的投资成为增长的另一重要来源。第三个阶段，包括实物资本和人力资本在内的资本积累成为增长最主要的来源，有助于维持相对较快的经济扩张。有时第三个阶段会引起新的结构或技术变革，使生产率有新的跃升，于是上述过程进入一个层次更高的新周期。作者认为绝大多数拉美国家并没有跨越增长转型的第一个阶段。从各项经济、社会指标上，拉美的各个国家也很落后。比如拉美的贫困人口多。1970年，在实施进口替代发展战略整整30年之后，所有拉美家庭中仍有40%生活在贫困线以下，农村地区的贫困发生率达到令人震惊的62%。还比如拉美的人均收入低。1975年拉美平均人均收入相当于美国的24%，至2006年，这一数值跌至19%。再比如拉美的贫富差距很大，受教育程度普遍偏低，失业率高企，通胀失控等等。根据经济学研究，一个国家的自由繁荣主要取决于以下几个因素： ...

Search This Blog

Swortal

Philosophical Summary of Book ‘Software Engineering At Google’

Scalability

Scale in long functioning period

My takeaways and thoughts:

Scale with huge codebase

My takeaways and thoughts:

Scale in human productivity

My takeaways and thoughts:

Reliability

Hyrum’s law

My takeaways and thoughts:

Fast and short feedback loop makes things more reliable

My takeaways and thoughts:

Trade-off

My takeaways and thoughts:

Popular posts from this blog

拉美500年，荆棘丛生的自由繁荣之路

Software Architecture Books Summary And Highlights -- Part 1 Goal, Introduction And Index

以小见大，从国父的故事窥见美国独立建国的大历史