AI智能总结
v1.0.4 2024年9月SRE-Elite.com 修订记录 1.0.4修订记录: •第三章第2节《研发保障》结构进行了优化,并增加了《某大型游戏全球研发保障实践》等合共2个案例,新增2.7万字。 •第三章第5节《故障应急》结构进行了优化,依据2024年6月22日北京小米站沙龙更新并增加了《小米故障应急响应经验分享》等合共5个案例,新增4.5万字。 1.0.3修订记录: •第三章第4节《变更管理》依据2024年4月13日上海B站沙龙更新约4万字,包括6篇不同类型的企业案例1.0.2修订记录: •增加了版权声明为CC BY-ND 4.0•修正了目录没有3.1.1的问题•修改了页眉的时间点•修正了部分错别字 目录 第一章SRE整体介绍............................................11.1前言..................................................11.2SRE发展历程..........................................21.3SRE的目标............................................4第二章SRE的组织架构..........................................6第三章SRE的职能..............................................10 1可靠性架构设计.........................................10 1.1.1分布式设计.................................111.1.2解耦设计...................................111.1.3冗余设计...................................111.1.4熔断设计...................................121.1.5限流设计...................................121.1.6降级设计...................................131.1.7可观测设计.................................13 1.2基础设施保障.....................................14 1.2.1机房多活...................................141.2.2网络容灾...................................14 1.3数据灾备.........................................14 1.3.1数据备份...................................141.3.2数据回滚...................................14 2研发保障...............................................15 2.1.1代码可靠性.................................162.1.1.1代码缺陷.................................172.1.1.2代码规范.................................192.1.1.3代码安全.................................212.1.1.4代码圈复杂度.............................232.1.1.5代码重复.................................242.1.1.6代码注释与API文档.......................262.1.1.7代码质量红线.............................272.1.2代码仓库可靠性.............................28 2.1.2.1仓库性能.................................292.1.2.1仓库容灾.................................302.1.2.3仓库安全.................................322.1.2.4仓库可扩展性.............................332.1.3构建可靠性.................................342.1.3.1构建效率.................................342.1.3.2构建成功率...............................372.1.4制品可靠性.................................382.1.4.1制品下载可靠性...........................382.1.4.2制品部署可靠性...........................402.1.4.3制品安全可靠性...........................412.2研发保障工程体系设计..............................422.2.1面向研发保障的持续集成流水线................422.2.2面向研发保障的可观测设计....................462.2.3面向研发保障的操作调度操作平台..............482.2.4面向研发保障的ITSM平台.....................512.2.5面向研发保障的容器平台......................512.2.6面向研发保障的编译加速平台..................532.3研发保障案例......................................552.3.1腾讯游戏全球研发保障实践...................552.3.2某语音直播公司研发过程保障实践.............129SRE Elite精选原因...............................129 3入网控制..............................................152 3.1运行环境适配....................................152 3.1.1运营环境设计..............................1523.1.2容器云适配................................1543.1.3数据库存储适配............................1573.1.4信创适配..................................158 3.2运行环境交付....................................163 3.2.1基础资源服务..............................1633.2.2可观测策略................................1653.2.3自动化策略................................167 3.3测试策略........................................169 3.3.1连通性验证................................1693.3.2功能测试..................................1713.3.3性能压测..................................1743.3.4数据迁移..................................179 3.4变更评审........................................180 3.4.1稳定性架构设计评估........................1803.4.2非功能性技术评估.........................1823.4.3变更保障准备工作评估......................1853.4.4新系统或新业务上线保障评估...............186 4变更管理..............................................188 4.1发布管理与变更管理关系阐述.......................1894.2变更体系设计.....................................1914.2.1变更体系设计原则..........................1914.2.2变更及发布流程设计........................1924.2.3变更的工程体系设计........................2154.3变更管理案例.....................................2434.3.1 B站变更防控的设计与实践...................2434.3.2携程云平台基础设施变更管理实践.............2664.3.3某银行变更管理设计与实践...................2884.4发布管理案例.....................................3074.4.1中移互联网敏捷发布平台建设实践.............3074.4.2某证券变更一体化平台建设实践...............3264.4.3游戏GitOps发布管理实践....................344 5故障应急..............................................351 5.1故障应急体系设计................................3515.1.1故障应急体系设计原则.....................3515.1.2故障应急流程设计..........................3515.1.2.1故障发现....................................3515.1.2.1.1监控发现...............................3515.1.2.1.1巡检发现..............................3545.1.3人工上报(舆情,客服,运营人员等).........3565.2故障诊断........................................3565.2.1应急协同..................................3565.2.2故障定界..................................3595.2.3影响评估(影响人数,范围,上报级别)......3625.3故障恢复........................................3635.3.1架构自愈..................................3635.3.2应急预案(已知的预案)....................3645.3.3应急维护(人工干预,未知预案)............3645.3.4恢复验证..................................3655.4故障复盘....