Implementing Site Reliability Engineering

Implementing Site Reliability EngineeringISRECMCPrimeCM-ISRE1.0<ul> <li>Identify what SRE is and what it is not.</li><li>Compares SRE to DevOps.</li><li>Understand the difference between service-level indicators (SLI), service-level objectives (SLO), and service-level agreements (SLA).</li><li>Develop the technical and professional skills an SRE needs.</li><li>Determine what makes up a good SRE team.</li><li>Practice common ceremonies like blameless postmortems and production readiness reviews.</li><li>Gain an understanding of error budgets and how to calculate reliability costs.</li><li>Embed SREs within development teams to increase operational stability.</li></ul>This site reliability engineering training course is perfect for anyone in the IT/SDLC field looking to implement SRE teams and practices in their organization. Professionals who may benefit include: <ul> <li>Software Engineers</li><li>Systems Engineers</li><li>Network Engineers</li><li>Technical Program Managers</li><li>Anyone in an IT Leadership role</li><li>CIOs / CTOs</li><li>Anyone involved with IT infrastructure</li><li>IT Operations Staff</li></ul><h4>Part 1 – Introduction</h4><ul> <li>1. Introduction</li><li>2. The Production Environment at Google, From the Viewpoint of an SRE</li><li>3. Exercise: Mapping Your Production Environment</li></ul><h4>Part 2 – Principles</h4> 1. Embracing Risk <ul> <li>Managing Risk</li><li>Measuring Service Risk</li><li>Risk Tolerance of Services</li><li>Motivation for Error Budgets</li></ul>2. Service-Level Objectives <ul> <li>Service Level Terminology</li><li>Indicators in Practice</li><li>Objectives in Practice</li><li>Agreements in Practice</li><li>Exercise: Setting Service-Level Objectives</li></ul>3. Eliminating Toil <ul> <li>What Is Toil?</li><li>Why Less Toil is Better</li><li>What Qualifies as Engineering?</li><li>Is Toil Always Bad?</li></ul>4. Monitoring Distributed Systems <ul> <li>Definitions</li><li>Why Monitor?</li><li>Setting Reasonable Expectations</li><li>Symptoms Versus Causes</li><li>Black Box Versus White Box</li><li>The Four Golden Signals</li><li>Worrying About Your Tail</li><li>Choosing an Appropriate Resolution for Measurements</li><li>As Simple as Possible, No Simpler</li><li>Tying These Principles Together</li><li>Monitoring for the Long Term</li></ul>5. The Evolution of Automation at Google <ul> <li>The Value of Automation</li><li>The Value for Google SRE</li><li>Use Cases for Automation</li><li>Automate Yourself Out of a Job</li><li>Soothing the Pain: Applying Automation to Cluster Turnups</li><li>Borg: Birth of the Warehouse-Scale Computer</li><li>Reliability is the Fundamental Feature</li></ul>6. Release Engineering <ul> <li>The Role of a Release Engineer</li><li>Philosophy</li><li>Continuous Build and Deployment</li><li>Configuration Management</li></ul>7. Simplicity <ul> <li>System Stability Versus Agility</li><li>The Virtue of Boring</li><li>I Won't Give Up My Code!</li><li>The "Negative Lines of Code" Metric</li><li>Minimal APIs</li><li>Modularity</li><li>Release Simplicity</li></ul><h4>Part 3 – Practices</h4>1. Practical Alerting <ul> <li>Time-Series Monitoring Outside of Google</li><li>Instrumentation of Applications</li><li>Exporting Variables</li><li>Collection of Exported Data</li><li>Storage in the Time-Series Arena</li><li>Rule Evaluation</li><li>Alerting</li><li>Sharding the Monitoring Topology</li><li>Black-Box Monitoring</li><li>Maintaining the Configuration</li></ul>2. Being On-Call <ul> <li>The Life of an On-Call Engineer</li><li>Balanced On-Call</li><li>Feeling Safe</li><li>Avoiding Inappropriate Operational Load</li></ul>3. Effective Troubleshooting <ul> <li>Theory</li><li>In Practice</li><li>The Magic of Negative Results</li><li>Making Troubleshooting Easier</li><li>Exercise: Distributed System Troubleshooting</li></ul>4. Emergency Response <ul> <li>What to Do When Systems Break</li><li>Test-Induced Emergency</li><li>Challenge-Induced Emergency</li><li>Process-Induced Emergency</li><li>Don't Repeat the Past—Learn From It</li></ul>5. Managing Incidents <ul> <li>Unmanaged Incidents</li><li>Managed Incidents</li><li>When to Declare an Incident</li><li>Elements of Incident Management Process</li></ul>6. Postmortem Culture: Learning from Failure <ul> <li>Google's Postmortem Philosophy</li><li>Collaborate and Share Knowledge</li><li>Introducing a Postmortem Culture</li><li>Exercise: Blameless Postmortem</li></ul>7. Tracking Outages <ul> <li>Escalator</li><li>Outalator</li></ul>8. Testing for Reliability <ul> <li>Types of Software Testing</li><li>Creating a Test and Build Environment</li><li>Testing at Scale</li></ul>9. Software Engineering in SRE <ul> <li>Why is Software Engineering Within SRE Important?</li><li>Auxon Case Study</li><li>Intent-Based Capacity Planning</li><li>Fostering Software Engineering in SRE</li></ul>10. Load Balancing at the Front End <ul> <li>Load Balancing Using DNS</li><li>Load Balancing at the Virtual IP Address</li></ul>11. Load Balancing in the Datacenter <ul> <li>Identifying Bad Tasks: Flow Control and Lame Ducks</li><li>Limiting the Connections Pool with Subsetting</li><li>Load-Balancing Policies</li></ul>12. Handling Overload <ul> <li>The Pitfalls of "Queries Per Second"</li><li>Per-Customer Limits</li><li>Client-Side Throttling</li><li>Criticality</li><li>Utilization Signals</li><li>Handling Overload Errors</li><li>Load from Connections</li></ul>13. Addressing Cascading Failures <ul> <li>Causes of Cascading Failures and Designing to Avoid Them</li><li>Preventing Server Overload</li><li>Slow Startup and Cold Caching</li><li>Triggering Conditions for Cascading Failures</li><li>Testing for Cascading Failures</li><li>Immediate Steps to Address Cascading Failures</li></ul>14. Managing Critical State: Distributed Consensus for Reliability <ul> <li>Motivating the Use of Consensus: Distributed Systems Coordination Failure</li><li>How Distributed Consensus Works</li><li>System Architecture Patterns for Distributed Consensus</li><li>Distributed Consensus Performance</li><li>Deploying Distributed Consensus-Based Systems</li></ul>15. Distributed Periodic Scheduling with Cron <ul> <li>Cron Jobs and Idempotency</li><li>Cron at Large Scale</li><li>Building Cron at Google</li></ul>16. Data Processing Pipelines <ul> <li>Origin of the Pipeline Design Pattern</li><li>Initial Effect of Big Data on the Simple Pipeline Pattern</li><li>Challenges with the Periodic Pipeline Pattern</li><li>Trouble Caused by Uneven Work Distribution</li><li>Drawbacks of Periodic Pipelines in Distributed Environments</li><li>Introduction to Google Workflow</li><li>Stages of Execution in Workflow</li><li>Ensuring Business Continuity</li></ul>17. Data Integrity: What You Read Is What You Wrote <ul> <li>Data Integrity's Strict Requirements</li><li>Google SRE Objectives in Maintaining Data Integrity and Availability</li><li>How Google SRE Faces the Challenges of Data Integrity</li><li>1T Versus 1E: Not "Just" a Bigger Backup</li><li>Knowing that Data Recovery Will Work</li><li>Case Studies</li><li>General Principles of SRE as Applied to Data Integrity</li></ul>18. Reliable Product Launches at Scale <ul> <li>Launch Coordination Engineering</li><li>Setting Up a Launch Process</li><li>Developing a Launch Checklist</li><li>Selected Techniques for Reliable Launches</li><li>Development of LCE</li><li>Exercise: Develop a Production Readiness Review</li></ul><h4>Part 4 – Management</h4>1. Accelerating SREs to On-Call and Beyond <ul> <li>You've Hired Your Next SRE, Now What?</li><li>Initial Learning Experiences: The Case for Structure Over Chaos</li><li>Creating Stellar Reverse Engineers and Improvisational Thinkers</li><li>Reverse Engineering a Production Service</li><li>Five Practices for Aspiring On-Callers</li><li>On-Call and Beyond: Rites of Passage and Practicing Continuing Education</li></ul>2. Dealing with Interrupts <ul> <li>Managing Operational Load</li><li>Factors in Determining How Interrupts Are Handled</li><li>Imperfect Machines</li></ul>3. Embedding an SRE to Recover from Operational Overload <ul> <li>Phase 1: Learn the Service and Get Context</li><li>Phase 2: Sharing Context</li><li>Phase 3: Driving Change</li></ul>4. Communication and Collaboration in SRE <ul> <li>Communications: Production Meetings</li><li>Collaboration Within SRE</li><li>Case Study: Viceroy</li><li>Collaboration Outside SRE</li><li>Case Study: Migrating DFP to F1</li></ul>5. The Evolving SRE Engagement Model <ul> <li>SRE Engagement: What, How, and Why</li><li>The PRR Model</li><li>The SRE Engagement Model</li><li>Production Readiness Reviews: Simple PRR Model</li><li>Evolving the Simple PRR Model: Early Engagement</li><li>Evolving Services Development: Frameworks and SRE Platform</li></ul><h4>Part 5 – Conclusions</h4><ul> <li>1. Lessons Learned From Other Industries</li><li>2. Conclusion</li></ul>- Identify what SRE is and what it is not. - Compares SRE to DevOps. - Understand the difference between service-level indicators (SLI), service-level objectives (SLO), and service-level agreements (SLA). - Develop the technical and professional skills an SRE needs. - Determine what makes up a good SRE team. - Practice common ceremonies like blameless postmortems and production readiness reviews. - Gain an understanding of error budgets and how to calculate reliability costs. - Embed SREs within development teams to increase operational stability.This site reliability engineering training course is perfect for anyone in the IT/SDLC field looking to implement SRE teams and practices in their organization. Professionals who may benefit include: - Software Engineers - Systems Engineers - Network Engineers - Technical Program Managers - Anyone in an IT Leadership role - CIOs / CTOs - Anyone involved with IT infrastructure - IT Operations StaffPart 1 – Introduction - 1. Introduction - 2. The Production Environment at Google, From the Viewpoint of an SRE - 3. Exercise: Mapping Your Production Environment Part 2 – Principles 1. Embracing Risk - Managing Risk - Measuring Service Risk - Risk Tolerance of Services - Motivation for Error Budgets 2. Service-Level Objectives - Service Level Terminology - Indicators in Practice - Objectives in Practice - Agreements in Practice - Exercise: Setting Service-Level Objectives 3. Eliminating Toil - What Is Toil? - Why Less Toil is Better - What Qualifies as Engineering? - Is Toil Always Bad? 4. Monitoring Distributed Systems - Definitions - Why Monitor? - Setting Reasonable Expectations - Symptoms Versus Causes - Black Box Versus White Box - The Four Golden Signals - Worrying About Your Tail - Choosing an Appropriate Resolution for Measurements - As Simple as Possible, No Simpler - Tying These Principles Together - Monitoring for the Long Term 5. The Evolution of Automation at Google - The Value of Automation - The Value for Google SRE - Use Cases for Automation - Automate Yourself Out of a Job - Soothing the Pain: Applying Automation to Cluster Turnups - Borg: Birth of the Warehouse-Scale Computer - Reliability is the Fundamental Feature 6. Release Engineering - The Role of a Release Engineer - Philosophy - Continuous Build and Deployment - Configuration Management 7. Simplicity - System Stability Versus Agility - The Virtue of Boring - I Won't Give Up My Code! - The "Negative Lines of Code" Metric - Minimal APIs - Modularity - Release Simplicity Part 3 – Practices 1. Practical Alerting - Time-Series Monitoring Outside of Google - Instrumentation of Applications - Exporting Variables - Collection of Exported Data - Storage in the Time-Series Arena - Rule Evaluation - Alerting - Sharding the Monitoring Topology - Black-Box Monitoring - Maintaining the Configuration 2. Being On-Call - The Life of an On-Call Engineer - Balanced On-Call - Feeling Safe - Avoiding Inappropriate Operational Load 3. Effective Troubleshooting - Theory - In Practice - The Magic of Negative Results - Making Troubleshooting Easier - Exercise: Distributed System Troubleshooting 4. Emergency Response - What to Do When Systems Break - Test-Induced Emergency - Challenge-Induced Emergency - Process-Induced Emergency - Don't Repeat the Past—Learn From It 5. Managing Incidents - Unmanaged Incidents - Managed Incidents - When to Declare an Incident - Elements of Incident Management Process 6. Postmortem Culture: Learning from Failure - Google's Postmortem Philosophy - Collaborate and Share Knowledge - Introducing a Postmortem Culture - Exercise: Blameless Postmortem 7. Tracking Outages - Escalator - Outalator 8. Testing for Reliability - Types of Software Testing - Creating a Test and Build Environment - Testing at Scale 9. Software Engineering in SRE - Why is Software Engineering Within SRE Important? - Auxon Case Study - Intent-Based Capacity Planning - Fostering Software Engineering in SRE 10. Load Balancing at the Front End - Load Balancing Using DNS - Load Balancing at the Virtual IP Address 11. Load Balancing in the Datacenter - Identifying Bad Tasks: Flow Control and Lame Ducks - Limiting the Connections Pool with Subsetting - Load-Balancing Policies 12. Handling Overload - The Pitfalls of "Queries Per Second" - Per-Customer Limits - Client-Side Throttling - Criticality - Utilization Signals - Handling Overload Errors - Load from Connections 13. Addressing Cascading Failures - Causes of Cascading Failures and Designing to Avoid Them - Preventing Server Overload - Slow Startup and Cold Caching - Triggering Conditions for Cascading Failures - Testing for Cascading Failures - Immediate Steps to Address Cascading Failures 14. Managing Critical State: Distributed Consensus for Reliability - Motivating the Use of Consensus: Distributed Systems Coordination Failure - How Distributed Consensus Works - System Architecture Patterns for Distributed Consensus - Distributed Consensus Performance - Deploying Distributed Consensus-Based Systems 15. Distributed Periodic Scheduling with Cron - Cron Jobs and Idempotency - Cron at Large Scale - Building Cron at Google 16. Data Processing Pipelines - Origin of the Pipeline Design Pattern - Initial Effect of Big Data on the Simple Pipeline Pattern - Challenges with the Periodic Pipeline Pattern - Trouble Caused by Uneven Work Distribution - Drawbacks of Periodic Pipelines in Distributed Environments - Introduction to Google Workflow - Stages of Execution in Workflow - Ensuring Business Continuity 17. Data Integrity: What You Read Is What You Wrote - Data Integrity's Strict Requirements - Google SRE Objectives in Maintaining Data Integrity and Availability - How Google SRE Faces the Challenges of Data Integrity - 1T Versus 1E: Not "Just" a Bigger Backup - Knowing that Data Recovery Will Work - Case Studies - General Principles of SRE as Applied to Data Integrity 18. Reliable Product Launches at Scale - Launch Coordination Engineering - Setting Up a Launch Process - Developing a Launch Checklist - Selected Techniques for Reliable Launches - Development of LCE - Exercise: Develop a Production Readiness Review Part 4 – Management 1. Accelerating SREs to On-Call and Beyond - You've Hired Your Next SRE, Now What? - Initial Learning Experiences: The Case for Structure Over Chaos - Creating Stellar Reverse Engineers and Improvisational Thinkers - Reverse Engineering a Production Service - Five Practices for Aspiring On-Callers - On-Call and Beyond: Rites of Passage and Practicing Continuing Education 2. Dealing with Interrupts - Managing Operational Load - Factors in Determining How Interrupts Are Handled - Imperfect Machines 3. Embedding an SRE to Recover from Operational Overload - Phase 1: Learn the Service and Get Context - Phase 2: Sharing Context - Phase 3: Driving Change 4. Communication and Collaboration in SRE - Communications: Production Meetings - Collaboration Within SRE - Case Study: Viceroy - Collaboration Outside SRE - Case Study: Migrating DFP to F1 5. The Evolving SRE Engagement Model - SRE Engagement: What, How, and Why - The PRR Model - The SRE Engagement Model - Production Readiness Reviews: Simple PRR Model - Evolving the Simple PRR Model: Early Engagement - Evolving Services Development: Frameworks and SRE Platform Part 5 – Conclusions - 1. Lessons Learned From Other Industries - 2. Conclusion3 days2450.003295.00