<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE FL_Course SYSTEM "https://www.flane.de/dtd/fl_course095.dtd"><?xml-stylesheet type="text/xsl" href="https://www.fastlanetraining.ca/css/xml-course.xsl"?><course productid="32704" language="en" source="https://www.fastlanetraining.ca/xml-course/cprime-isre" lastchanged="2024-08-22T11:47:32-04:00" parent="https://www.fastlanetraining.ca/xml-courses"><title>Implementing Site Reliability Engineering</title><productcode>ISRE</productcode><vendorcode>CM</vendorcode><vendorname>CPrime</vendorname><fullproductcode>CM-ISRE</fullproductcode><version>1.0</version><objective>&lt;ul&gt;
&lt;li&gt;Identify what SRE is and what it is not.&lt;/li&gt;&lt;li&gt;Compares SRE to DevOps.&lt;/li&gt;&lt;li&gt;Understand the difference between service-level indicators (SLI), service-level objectives (SLO), and service-level agreements (SLA).&lt;/li&gt;&lt;li&gt;Develop the technical and professional skills an SRE needs.&lt;/li&gt;&lt;li&gt;Determine what makes up a good SRE team.&lt;/li&gt;&lt;li&gt;Practice common ceremonies like blameless postmortems and production readiness reviews.&lt;/li&gt;&lt;li&gt;Gain an understanding of error budgets and how to calculate reliability costs.&lt;/li&gt;&lt;li&gt;Embed SREs within development teams to increase operational stability.&lt;/li&gt;&lt;/ul&gt;</objective><audience>&lt;p&gt;This site reliability engineering training course is perfect for anyone in the IT/SDLC field looking to implement SRE teams and practices in their organization.&lt;/p&gt;
&lt;p&gt;Professionals who may benefit include:
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Software Engineers&lt;/li&gt;&lt;li&gt;Systems Engineers&lt;/li&gt;&lt;li&gt;Network Engineers&lt;/li&gt;&lt;li&gt;Technical Program Managers&lt;/li&gt;&lt;li&gt;Anyone in an IT Leadership role&lt;/li&gt;&lt;li&gt;CIOs / CTOs&lt;/li&gt;&lt;li&gt;Anyone involved with IT infrastructure&lt;/li&gt;&lt;li&gt;IT Operations Staff&lt;/li&gt;&lt;/ul&gt;</audience><outline>&lt;h4&gt;Part 1 &amp;ndash; Introduction&lt;/h4&gt;&lt;ul&gt;
&lt;li&gt;1. Introduction&lt;/li&gt;&lt;li&gt;2. The Production Environment at Google, From the Viewpoint of an SRE&lt;/li&gt;&lt;li&gt;3. Exercise: Mapping Your Production Environment&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Part 2 &amp;ndash; Principles&lt;/h4&gt;&lt;p&gt;
&lt;strong&gt;1. Embracing Risk&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Managing Risk&lt;/li&gt;&lt;li&gt;Measuring Service Risk&lt;/li&gt;&lt;li&gt;Risk Tolerance of Services&lt;/li&gt;&lt;li&gt;Motivation for Error Budgets&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;2. Service-Level Objectives&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Service Level Terminology&lt;/li&gt;&lt;li&gt;Indicators in Practice&lt;/li&gt;&lt;li&gt;Objectives in Practice&lt;/li&gt;&lt;li&gt;Agreements in Practice&lt;/li&gt;&lt;li&gt;Exercise: Setting Service-Level Objectives&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;3. Eliminating Toil&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What Is Toil?&lt;/li&gt;&lt;li&gt;Why Less Toil is Better&lt;/li&gt;&lt;li&gt;What Qualifies as Engineering?&lt;/li&gt;&lt;li&gt;Is Toil Always Bad?&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;4. Monitoring Distributed Systems&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Definitions&lt;/li&gt;&lt;li&gt;Why Monitor?&lt;/li&gt;&lt;li&gt;Setting Reasonable Expectations&lt;/li&gt;&lt;li&gt;Symptoms Versus Causes&lt;/li&gt;&lt;li&gt;Black Box Versus White Box&lt;/li&gt;&lt;li&gt;The Four Golden Signals&lt;/li&gt;&lt;li&gt;Worrying About Your Tail&lt;/li&gt;&lt;li&gt;Choosing an Appropriate Resolution for Measurements&lt;/li&gt;&lt;li&gt;As Simple as Possible, No Simpler&lt;/li&gt;&lt;li&gt;Tying These Principles Together&lt;/li&gt;&lt;li&gt;Monitoring for the Long Term&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;5. The Evolution of Automation at Google&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The Value of Automation&lt;/li&gt;&lt;li&gt;The Value for Google SRE&lt;/li&gt;&lt;li&gt;Use Cases for Automation&lt;/li&gt;&lt;li&gt;Automate Yourself Out of a Job&lt;/li&gt;&lt;li&gt;Soothing the Pain: Applying Automation to Cluster Turnups&lt;/li&gt;&lt;li&gt;Borg: Birth of the Warehouse-Scale Computer&lt;/li&gt;&lt;li&gt;Reliability is the Fundamental Feature&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;6. Release Engineering&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The Role of a Release Engineer&lt;/li&gt;&lt;li&gt;Philosophy&lt;/li&gt;&lt;li&gt;Continuous Build and Deployment&lt;/li&gt;&lt;li&gt;Configuration Management&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;7. Simplicity&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;System Stability Versus Agility&lt;/li&gt;&lt;li&gt;The Virtue of Boring&lt;/li&gt;&lt;li&gt;I Won&amp;#039;t Give Up My Code!&lt;/li&gt;&lt;li&gt;The &amp;quot;Negative Lines of Code&amp;quot; Metric&lt;/li&gt;&lt;li&gt;Minimal APIs&lt;/li&gt;&lt;li&gt;Modularity&lt;/li&gt;&lt;li&gt;Release Simplicity&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Part 3 &amp;ndash; Practices&lt;/h4&gt;&lt;p&gt;&lt;strong&gt;1. Practical Alerting&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Time-Series Monitoring Outside of Google&lt;/li&gt;&lt;li&gt;Instrumentation of Applications&lt;/li&gt;&lt;li&gt;Exporting Variables&lt;/li&gt;&lt;li&gt;Collection of Exported Data&lt;/li&gt;&lt;li&gt;Storage in the Time-Series Arena&lt;/li&gt;&lt;li&gt;Rule Evaluation&lt;/li&gt;&lt;li&gt;Alerting&lt;/li&gt;&lt;li&gt;Sharding the Monitoring Topology&lt;/li&gt;&lt;li&gt;Black-Box Monitoring&lt;/li&gt;&lt;li&gt;Maintaining the Configuration&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;2. Being On-Call&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The Life of an On-Call Engineer&lt;/li&gt;&lt;li&gt;Balanced On-Call&lt;/li&gt;&lt;li&gt;Feeling Safe&lt;/li&gt;&lt;li&gt;Avoiding Inappropriate Operational Load&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;3. Effective Troubleshooting&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Theory&lt;/li&gt;&lt;li&gt;In Practice&lt;/li&gt;&lt;li&gt;The Magic of Negative Results&lt;/li&gt;&lt;li&gt;Making Troubleshooting Easier&lt;/li&gt;&lt;li&gt;Exercise: Distributed System Troubleshooting&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;4. Emergency Response&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What to Do When Systems Break&lt;/li&gt;&lt;li&gt;Test-Induced Emergency&lt;/li&gt;&lt;li&gt;Challenge-Induced Emergency&lt;/li&gt;&lt;li&gt;Process-Induced Emergency&lt;/li&gt;&lt;li&gt;Don&amp;#039;t Repeat the Past&amp;mdash;Learn From It&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;5. Managing Incidents&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Unmanaged Incidents&lt;/li&gt;&lt;li&gt;Managed Incidents&lt;/li&gt;&lt;li&gt;When to Declare an Incident&lt;/li&gt;&lt;li&gt;Elements of Incident Management Process&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;6. Postmortem Culture: Learning from Failure&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Google&amp;#039;s Postmortem Philosophy&lt;/li&gt;&lt;li&gt;Collaborate and Share Knowledge&lt;/li&gt;&lt;li&gt;Introducing a Postmortem Culture&lt;/li&gt;&lt;li&gt;Exercise: Blameless Postmortem&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;7. Tracking Outages&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Escalator&lt;/li&gt;&lt;li&gt;Outalator&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;8. Testing for Reliability&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Types of Software Testing&lt;/li&gt;&lt;li&gt;Creating a Test and Build Environment&lt;/li&gt;&lt;li&gt;Testing at Scale&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;9. Software Engineering in SRE&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Why is Software Engineering Within SRE Important?&lt;/li&gt;&lt;li&gt;Auxon Case Study&lt;/li&gt;&lt;li&gt;Intent-Based Capacity Planning&lt;/li&gt;&lt;li&gt;Fostering Software Engineering in SRE&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;10. Load Balancing at the Front End&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Load Balancing Using DNS&lt;/li&gt;&lt;li&gt;Load Balancing at the Virtual IP Address&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;11. Load Balancing in the Datacenter&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Identifying Bad Tasks: Flow Control and Lame Ducks&lt;/li&gt;&lt;li&gt;Limiting the Connections Pool with Subsetting&lt;/li&gt;&lt;li&gt;Load-Balancing Policies&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;12. Handling Overload&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The Pitfalls of &amp;quot;Queries Per Second&amp;quot;&lt;/li&gt;&lt;li&gt;Per-Customer Limits&lt;/li&gt;&lt;li&gt;Client-Side Throttling&lt;/li&gt;&lt;li&gt;Criticality&lt;/li&gt;&lt;li&gt;Utilization Signals&lt;/li&gt;&lt;li&gt;Handling Overload Errors&lt;/li&gt;&lt;li&gt;Load from Connections&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;13. Addressing Cascading Failures&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Causes of Cascading Failures and Designing to Avoid Them&lt;/li&gt;&lt;li&gt;Preventing Server Overload&lt;/li&gt;&lt;li&gt;Slow Startup and Cold Caching&lt;/li&gt;&lt;li&gt;Triggering Conditions for Cascading Failures&lt;/li&gt;&lt;li&gt;Testing for Cascading Failures&lt;/li&gt;&lt;li&gt;Immediate Steps to Address Cascading Failures&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;14. Managing Critical State: Distributed Consensus for Reliability&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Motivating the Use of Consensus: Distributed Systems Coordination Failure&lt;/li&gt;&lt;li&gt;How Distributed Consensus Works&lt;/li&gt;&lt;li&gt;System Architecture Patterns for Distributed Consensus&lt;/li&gt;&lt;li&gt;Distributed Consensus Performance&lt;/li&gt;&lt;li&gt;Deploying Distributed Consensus-Based Systems&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;15. Distributed Periodic Scheduling with Cron&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cron Jobs and Idempotency&lt;/li&gt;&lt;li&gt;Cron at Large Scale&lt;/li&gt;&lt;li&gt;Building Cron at Google&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;16. Data Processing Pipelines&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Origin of the Pipeline Design Pattern&lt;/li&gt;&lt;li&gt;Initial Effect of Big Data on the Simple Pipeline Pattern&lt;/li&gt;&lt;li&gt;Challenges with the Periodic Pipeline Pattern&lt;/li&gt;&lt;li&gt;Trouble Caused by Uneven Work Distribution&lt;/li&gt;&lt;li&gt;Drawbacks of Periodic Pipelines in Distributed Environments&lt;/li&gt;&lt;li&gt;Introduction to Google Workflow&lt;/li&gt;&lt;li&gt;Stages of Execution in Workflow&lt;/li&gt;&lt;li&gt;Ensuring Business Continuity&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;17. Data Integrity: What You Read Is What You Wrote&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Data Integrity&amp;#039;s Strict Requirements&lt;/li&gt;&lt;li&gt;Google SRE Objectives in Maintaining Data Integrity and Availability&lt;/li&gt;&lt;li&gt;How Google SRE Faces the Challenges of Data Integrity&lt;/li&gt;&lt;li&gt;1T Versus 1E: Not &amp;quot;Just&amp;quot; a Bigger Backup&lt;/li&gt;&lt;li&gt;Knowing that Data Recovery Will Work&lt;/li&gt;&lt;li&gt;Case Studies&lt;/li&gt;&lt;li&gt;General Principles of SRE as Applied to Data Integrity&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;18. Reliable Product Launches at Scale&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Launch Coordination Engineering&lt;/li&gt;&lt;li&gt;Setting Up a Launch Process&lt;/li&gt;&lt;li&gt;Developing a Launch Checklist&lt;/li&gt;&lt;li&gt;Selected Techniques for Reliable Launches&lt;/li&gt;&lt;li&gt;Development of LCE&lt;/li&gt;&lt;li&gt;Exercise: Develop a Production Readiness Review&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Part 4 &amp;ndash; Management&lt;/h4&gt;&lt;p&gt;&lt;strong&gt;1. Accelerating SREs to On-Call and Beyond&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You&amp;#039;ve Hired Your Next SRE, Now What?&lt;/li&gt;&lt;li&gt;Initial Learning Experiences: The Case for Structure Over Chaos&lt;/li&gt;&lt;li&gt;Creating Stellar Reverse Engineers and Improvisational Thinkers&lt;/li&gt;&lt;li&gt;Reverse Engineering a Production Service&lt;/li&gt;&lt;li&gt;Five Practices for Aspiring On-Callers&lt;/li&gt;&lt;li&gt;On-Call and Beyond: Rites of Passage and Practicing Continuing Education&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;2. Dealing with Interrupts&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Managing Operational Load&lt;/li&gt;&lt;li&gt;Factors in Determining How Interrupts Are Handled&lt;/li&gt;&lt;li&gt;Imperfect Machines&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;3. Embedding an SRE to Recover from Operational Overload&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Phase 1: Learn the Service and Get Context&lt;/li&gt;&lt;li&gt;Phase 2: Sharing Context&lt;/li&gt;&lt;li&gt;Phase 3: Driving Change&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;4. Communication and Collaboration in SRE&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Communications: Production Meetings&lt;/li&gt;&lt;li&gt;Collaboration Within SRE&lt;/li&gt;&lt;li&gt;Case Study: Viceroy&lt;/li&gt;&lt;li&gt;Collaboration Outside SRE&lt;/li&gt;&lt;li&gt;Case Study: Migrating DFP to F1&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;5. The Evolving SRE Engagement Model&lt;/strong&gt;
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;SRE Engagement: What, How, and Why&lt;/li&gt;&lt;li&gt;The PRR Model&lt;/li&gt;&lt;li&gt;The SRE Engagement Model&lt;/li&gt;&lt;li&gt;Production Readiness Reviews: Simple PRR Model&lt;/li&gt;&lt;li&gt;Evolving the Simple PRR Model: Early Engagement&lt;/li&gt;&lt;li&gt;Evolving Services Development: Frameworks and SRE Platform&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Part 5 &amp;ndash; Conclusions&lt;/h4&gt;&lt;ul&gt;
&lt;li&gt;1. Lessons Learned From Other Industries&lt;/li&gt;&lt;li&gt;2. Conclusion&lt;/li&gt;&lt;/ul&gt;</outline><objective_plain>- Identify what SRE is and what it is not.
- Compares SRE to DevOps.
- Understand the difference between service-level indicators (SLI), service-level objectives (SLO), and service-level agreements (SLA).
- Develop the technical and professional skills an SRE needs.
- Determine what makes up a good SRE team.
- Practice common ceremonies like blameless postmortems and production readiness reviews.
- Gain an understanding of error budgets and how to calculate reliability costs.
- Embed SREs within development teams to increase operational stability.</objective_plain><audience_plain>This site reliability engineering training course is perfect for anyone in the IT/SDLC field looking to implement SRE teams and practices in their organization.

Professionals who may benefit include:



- Software Engineers
- Systems Engineers
- Network Engineers
- Technical Program Managers
- Anyone in an IT Leadership role
- CIOs / CTOs
- Anyone involved with IT infrastructure
- IT Operations Staff</audience_plain><outline_plain>Part 1 – Introduction


- 1. Introduction
- 2. The Production Environment at Google, From the Viewpoint of an SRE
- 3. Exercise: Mapping Your Production Environment
Part 2 – Principles


1. Embracing Risk



- Managing Risk
- Measuring Service Risk
- Risk Tolerance of Services
- Motivation for Error Budgets
2. Service-Level Objectives



- Service Level Terminology
- Indicators in Practice
- Objectives in Practice
- Agreements in Practice
- Exercise: Setting Service-Level Objectives
3. Eliminating Toil



- What Is Toil?
- Why Less Toil is Better
- What Qualifies as Engineering?
- Is Toil Always Bad?
4. Monitoring Distributed Systems



- Definitions
- Why Monitor?
- Setting Reasonable Expectations
- Symptoms Versus Causes
- Black Box Versus White Box
- The Four Golden Signals
- Worrying About Your Tail
- Choosing an Appropriate Resolution for Measurements
- As Simple as Possible, No Simpler
- Tying These Principles Together
- Monitoring for the Long Term
5. The Evolution of Automation at Google



- The Value of Automation
- The Value for Google SRE
- Use Cases for Automation
- Automate Yourself Out of a Job
- Soothing the Pain: Applying Automation to Cluster Turnups
- Borg: Birth of the Warehouse-Scale Computer
- Reliability is the Fundamental Feature
6. Release Engineering



- The Role of a Release Engineer
- Philosophy
- Continuous Build and Deployment
- Configuration Management
7. Simplicity



- System Stability Versus Agility
- The Virtue of Boring
- I Won't Give Up My Code!
- The &quot;Negative Lines of Code&quot; Metric
- Minimal APIs
- Modularity
- Release Simplicity
Part 3 – Practices

1. Practical Alerting



- Time-Series Monitoring Outside of Google
- Instrumentation of Applications
- Exporting Variables
- Collection of Exported Data
- Storage in the Time-Series Arena
- Rule Evaluation
- Alerting
- Sharding the Monitoring Topology
- Black-Box Monitoring
- Maintaining the Configuration
2. Being On-Call



- The Life of an On-Call Engineer
- Balanced On-Call
- Feeling Safe
- Avoiding Inappropriate Operational Load
3. Effective Troubleshooting



- Theory
- In Practice
- The Magic of Negative Results
- Making Troubleshooting Easier
- Exercise: Distributed System Troubleshooting
4. Emergency Response



- What to Do When Systems Break
- Test-Induced Emergency
- Challenge-Induced Emergency
- Process-Induced Emergency
- Don't Repeat the Past—Learn From It
5. Managing Incidents



- Unmanaged Incidents
- Managed Incidents
- When to Declare an Incident
- Elements of Incident Management Process
6. Postmortem Culture: Learning from Failure



- Google's Postmortem Philosophy
- Collaborate and Share Knowledge
- Introducing a Postmortem Culture
- Exercise: Blameless Postmortem
7. Tracking Outages



- Escalator
- Outalator
8. Testing for Reliability



- Types of Software Testing
- Creating a Test and Build Environment
- Testing at Scale
9. Software Engineering in SRE



- Why is Software Engineering Within SRE Important?
- Auxon Case Study
- Intent-Based Capacity Planning
- Fostering Software Engineering in SRE
10. Load Balancing at the Front End



- Load Balancing Using DNS
- Load Balancing at the Virtual IP Address
11. Load Balancing in the Datacenter



- Identifying Bad Tasks: Flow Control and Lame Ducks
- Limiting the Connections Pool with Subsetting
- Load-Balancing Policies
12. Handling Overload



- The Pitfalls of &quot;Queries Per Second&quot;
- Per-Customer Limits
- Client-Side Throttling
- Criticality
- Utilization Signals
- Handling Overload Errors
- Load from Connections
13. Addressing Cascading Failures



- Causes of Cascading Failures and Designing to Avoid Them
- Preventing Server Overload
- Slow Startup and Cold Caching
- Triggering Conditions for Cascading Failures
- Testing for Cascading Failures
- Immediate Steps to Address Cascading Failures
14. Managing Critical State: Distributed Consensus for Reliability



- Motivating the Use of Consensus: Distributed Systems Coordination Failure
- How Distributed Consensus Works
- System Architecture Patterns for Distributed Consensus
- Distributed Consensus Performance
- Deploying Distributed Consensus-Based Systems
15. Distributed Periodic Scheduling with Cron



- Cron Jobs and Idempotency
- Cron at Large Scale
- Building Cron at Google
16. Data Processing Pipelines



- Origin of the Pipeline Design Pattern
- Initial Effect of Big Data on the Simple Pipeline Pattern
- Challenges with the Periodic Pipeline Pattern
- Trouble Caused by Uneven Work Distribution
- Drawbacks of Periodic Pipelines in Distributed Environments
- Introduction to Google Workflow
- Stages of Execution in Workflow
- Ensuring Business Continuity
17. Data Integrity: What You Read Is What You Wrote



- Data Integrity's Strict Requirements
- Google SRE Objectives in Maintaining Data Integrity and Availability
- How Google SRE Faces the Challenges of Data Integrity
- 1T Versus 1E: Not &quot;Just&quot; a Bigger Backup
- Knowing that Data Recovery Will Work
- Case Studies
- General Principles of SRE as Applied to Data Integrity
18. Reliable Product Launches at Scale



- Launch Coordination Engineering
- Setting Up a Launch Process
- Developing a Launch Checklist
- Selected Techniques for Reliable Launches
- Development of LCE
- Exercise: Develop a Production Readiness Review
Part 4 – Management

1. Accelerating SREs to On-Call and Beyond



- You've Hired Your Next SRE, Now What?
- Initial Learning Experiences: The Case for Structure Over Chaos
- Creating Stellar Reverse Engineers and Improvisational Thinkers
- Reverse Engineering a Production Service
- Five Practices for Aspiring On-Callers
- On-Call and Beyond: Rites of Passage and Practicing Continuing Education
2. Dealing with Interrupts



- Managing Operational Load
- Factors in Determining How Interrupts Are Handled
- Imperfect Machines
3. Embedding an SRE to Recover from Operational Overload



- Phase 1: Learn the Service and Get Context
- Phase 2: Sharing Context
- Phase 3: Driving Change
4. Communication and Collaboration in SRE



- Communications: Production Meetings
- Collaboration Within SRE
- Case Study: Viceroy
- Collaboration Outside SRE
- Case Study: Migrating DFP to F1
5. The Evolving SRE Engagement Model



- SRE Engagement: What, How, and Why
- The PRR Model
- The SRE Engagement Model
- Production Readiness Reviews: Simple PRR Model
- Evolving the Simple PRR Model: Early Engagement
- Evolving Services Development: Frameworks and SRE Platform
Part 5 – Conclusions


- 1. Lessons Learned From Other Industries
- 2. Conclusion</outline_plain><duration unit="d" days="3">3 days</duration><pricelist><price country="US" currency="USD">2450.00</price><price country="CA" currency="CAD">3295.00</price></pricelist><miles/></course>