花了周末的时间，通读了一遍《On Designing and Deploying Internet-Scale Service》，与自己多年经验相互验证，收益颇多。不多说了，下面是摘抄，*表示严重同意，?表示暂时还不太明白。原文下载在PDF。
1 Expect failures.
2 Keep things simple.
3 Automate everything.
- The basic design tenets and considerations we have laid out above are:
1 design for failure
2 implement redundancy and fault recovery
3 depend upon a commodity hardware slice
4 support single-version software .
5 implement multi-tenancy.
- More specific best practices for designing operations-friendly services are:
1 Quick service health check
2 Develop in the full environment
3 Zero trust of underlying components:
common techniques are to:
I) continue to operate on cached data in read-only mode or
II) continue to provide service to all but a tiny fraction of the user base during the short time while the service is accessing the redundant copy of the failed component.
4 Do not build the same functionality in multiple components.
5 One pod or cluster should not affect another pod or cluster.
6 Allow (rate) emergency human intervention. It’s very interesting.
7 Keep things simple and robust.
8 Enforce admission control at all levels.
9 Partition the service.
recommend using a look-up table at the mid-tier that maps fine-grained entities, typically users, to the system where their data is managed.
10 Understand the network design. ???
11 Analyze throughput and latency.****
12 Treat operations utilities as part of the service.
13 Understand access patterns.???
What impacts will this feature have on the rest of the infrastructure?
14 Version everything.
15 Keep the unit/functional tests from the last release.
16 Avoid single points of failure. ****
- Automatic Management and Provisioning:
1 Be restartable and redundant
2 Support geo-distribution.
3 Automatic provisioning and installation
4 Configuration and code as a unit.
5 Manage server roles or personalities rather than servers.
6 Multi-system failures are common. ****
7 Recover at the service level.
8 Never rely on local storage for non-recoverable information.
9 Keep deployment simple.
10 Fail services regularly. ****
- Dependency Management
1 Expect latency.
Ensure all interactions have appropriate timeouts. ***
2 Isolate failures. ***
3 Use shipping and proven components. ???
4 Implement inter-service monitoring and altering.
5 Dependent services require the same design point.
Same SLA as the depending service.
6 Decouple components.
- Release Cycle and Testing
The goal is to minimize the number of engineering and operations interaction.
1 Ship often
2 Use production data to find problems.
3 Invest in engineering
4 Support version roll-back
5 Maintain forward and backward compatibility.
6 Single-server deployment.
7 Stress test for load.
8 Perform capactiy and performance testing prior to new releases.
9 Build and deploy shallowly and iteratively.
10 Test with real data. ***
11 Run system-level acceptance tests.
12 Test and develop in full environments.
- Hardware Selection and Standardization
1 Use only standard SKUs.
2 Purchase full racks.
3 Write to a hardware abstraction.
4 Abstract the network and naming.
- Operations and Capacity Planning
1 Make the development team responsible.
2 Soft delete only.
3 Track resource allocation.
4 Make one change at a time.
5 Make Everything Configurable. ****
- Auditing, Monitoring and Alerting
Alerting is an art.
1 Instrument everything.
2 Data is the most valuable asset.
3 Have a customer view of service.
4 Instrumentation required for production testing.
5 Latencies are the toughest problem.
6 Have sufficient production data.
7 Configurable logging.
8 Expose health information for monitoring
9 Make all reported errors actionable.****
Give enough information to diagnose.
- Graceful Degradation and Admission Control
1 Support a "big red switch".
2 Control admission.
3 Meter admission.
- Customer and Press Communication Plan
Even without a client, if users interact with the system via web pages.