您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[SREelite&稳定性保障实验室]:SRE实践白皮书1.0.3 - 发现报告

SRE实践白皮书1.0.3

AI智能总结
查看更多
SRE实践白皮书1.0.3

v1.0.3 2024年6月SRE-Elite.com 修订记录 1.0.3修订记录: •第三章第4节《变更管理》依据2024年4月13日上海B站沙龙更新约4万字,包括6篇不同类型的企业案例 1.0.2修订记录: •增加了版权声明为CC BY-ND 4.0•修正了目录没有3.1.1的问题•修改了页眉的时间点•修正了部分错别字 目录 第一章SRE整体介绍............................................11.1前言..................................................11.2SRE发展历程..........................................21.3SRE的目标............................................4第二章SRE的组织架构..........................................6第三章SRE的职能..............................................10 1可靠性构架设计.........................................10 1.1.1分布式设计.................................111.1.2解耦设计...................................111.1.3冗余设计...................................111.1.4熔断设计...................................121.1.5限流设计...................................121.1.6降级设计...................................131.1.7可观测设计.................................13 1.2基础设施保障.....................................14 1.2.1机房多活...................................141.2.2网络容灾...................................14 1.3数据灾备.........................................14 1.3.1数据备份...................................141.3.2数据回滚...................................14 2研发保障...............................................15 2.1.1代码缺陷...................................152.1.2代码规范...................................182.1.3代码安全...................................202.1.4代码圈复杂度...............................222.1.5代码重复...................................232.1.6代码注释与API文档.........................242.1.7代码质量红线...............................25 2.2代码仓库可靠性...................................27 2.2.1仓库性能...................................27 2.2.2仓库容灾...................................292.2.3仓库安全...................................302.2.4仓库可扩展性...............................32 2.3构建可靠性.......................................33 2.3.1构建效率...................................332.3.2构建成功率.................................35 2.4制品可靠性.......................................37 2.4.1制品下载可靠性.............................372.4.2制品部署可靠性.............................382.4.3制品安全可靠性.............................40 3入网控制...............................................41 3.1运行环境适配.....................................41 3.1.1运营环境设计...............................413.1.2容器云适配.................................433.1.3数据库存储适配.............................463.1.4信创适配...................................47 3.2运行环境交付.....................................52 3.2.1基础资源服务...............................523.2.2可观测策略.................................533.2.3自动化策略.................................55 3.3测试策略.........................................58 3.3.1连通性验证.................................583.3.2功能测试...................................593.3.3性能压测...................................623.3.4数据迁移...................................67 3.4变更评审.........................................68 3.4.1稳定性架构设计评估.........................683.4.2非功能性技术评估..........................713.4.3变更保障准备工作评估.......................733.4.4新系统或新业务上线保障评估................74 4变更管理...............................................77 4.1发布管理与变更管理关系阐述........................774.2变更体系设计......................................804.2.1变更体系设计原则...........................804.2.2变更及发布流程设计.........................804.2.3变更的工程体系设计........................1034.3变更管理案例.....................................1314.3.1 B站变更防控的设计与实践...................1314.3.2携程云平台基础设施变更管理实践.............1534.3.3某银行变更管理设计与实践...................1754.4发布管理案例.....................................1944.4.1中移互联网敏捷发布平台建设实践.............194 4.4.2某证券变更一体化平台建设实践...............2134.4.3游戏GitOps发布管理实践....................230 5故障应急..............................................237 5.1.1监控发现..................................2375.1.2巡检发现..................................2385.1.3人工上报(舆情,客服,运营人员等).........240 5.2故障诊断........................................241 5.2.1应急协同..................................2415.2.2故障定界..................................2435.2.3影响评估(影响人数,范围,上报级别)......245 5.3故障恢复........................................246 5.3.1架构自愈..................................2475.3.2应急预案(已知的预案)....................2485.3.3应急维护(人工干预,未知预案)............2485.3.4恢复验证..................................248 5.4故障复盘........................................249 5.4.1复盘组织..................................2505.4.2根因分析..................................2535.4.3制定改进..................................2552.如何做好故障改进..............................2553.改进措施的记录................................2563.5.4.4问题跟踪................................257 6上线后持续优化工作....................................257 6.1用户体验优化....................................2576.1.1基于用户端的直接用户体验优化..............2586.1.2基于系统端的间接用户体验优化..............2596.2重大技术保障....................................2626.2.1整体统筹保障..............................2626.2.2技术方案保障..............................2636.2.3工具可靠性保障............................2646.2.4突发事件保障..............................2666.2.5示例1:哀悼日停止游戏服务保障............2676.2.6示例2:交易类大促核心保障流程和方案......2736.2.