1706:19.3 1705:41.5荷航副机长第一次提出质疑:“等一下,我们还没有起飞许可。”这时候荷航已经准备启动推进油门。机场控制塔向荷航:“KLM4805可以飞往Papa Beacon,上升并保持飞行高度9,000英尺,起飞后右转飞往040方向飞,直到与拉1706:20.3荷航机长(舱内):“你说什么?”1706:29.61706:40.5 1705:36.7荷航副机长完成飞行前检查表,荷航4805现在跑道尽头,准备离开。1705:51.2荷航机长:“是没有,我知道,去问一下。”一切已为时已晚。斯帕尔马斯VOR的325径向交差。”泛美副机长向控制塔:“我们还在跑道上滑行,PA1736。”(删除线这段话与控制塔之发话冲突,KLM机组只听到高频噪音,控制塔亦因发话中不知道通信被干扰)1706:34.15泛美机长(舱内):“我们快点离开这个鬼地方。”泛美机长看到了约700米处荷航航班的降落灯 荷航机长(舱内):“好的。”机场控制塔向泛美:“明白,PA1736,请在离开主跑道后通知我。”1706:34.7泛美飞行工程师(舱内):“是啊。让我们等了半个多小时,那个[脏话],现在他又在赶时间。”泛美飞行工程师(舱内):“离开!离开!快点离开!” 1706:09.6 - 1706:17.81706:29.6 - 1706:31.7荷航工程师(第三次质疑):“他还没有留空跑道吗,那架泛美航班?” 荷航向控制塔:“明白,可以飞往Papa Beacon,飞行高度9,000英尺,切入325径向线,我们在起飞点(或是“我们正在起飞”)。”泛美副机长:“好的,我们会在完成后通知你。”1706:35.7 What makes a team effective? Psychological Safety Psychological safety refers to anindividual’s perceptionof the consequences of taking an interpersonalrisk or a belief that a team issafe for risk takingin the face of being seen as ignorant, incompetent,negative, or disruptive. In a team with high psychological safety, teammates feel safe to take risks aroundtheir team members. They feel confident that no one on the team will embarrass or punish anyone elseforadmitting a mistake, asking a question, or offering a new idea.[Source:rework] 开会场景: ●Share information●Making decisions●Creating solutions●Building relationships ●Retrospect●Broadcasting information●Sharing inspiration●Negotiating Company Meeting We are our own image consultants and best image protectors.[Source:Amy Edmondson] 1.Employees don’t ask many questions during meetings.2.Employees don’t feel comfortableowning up to mistakesor place blame onothers when mistakes are made.3.The team avoidsdifficult conversationsand hot-button topics, hardy anydisagreements or conflicts.4.Executives and team leaders tend todominatemeeting discussions.5.Employees don’t often venture outside of their job descriptions tosupportother teammates.… 验尸报告场景一:产品紧急故障处理blameless Understand root cause Prevent recurrence A blamelessly written post-mortem assumes thateveryone involved in an incident hadgood intentionsand did the right thingwith the information they had.Engineers can give a detailed account of the incidentwithout fear of punishment or retribution.[Source:CodeCraft & Google SRE handbook] SHAME/BLAMECYCLE[Source:codecraft] Incident Engineer takes action and contributes to afailure or incident. Punishment Engineer is punished, shamed,blamed, or retrained. More errors Errors more likely, latent conditionscan’t be identified。 Trust Issue Reduced trust between engineers on theground and management looking forsomeone to scapegoat Management Unawareness Silence Management becomes less aware and informed onhow work is being performed day to day, andengineers become less educated on lurking or latentconditions for failure due to silence mentioned in #4 Engineers become silent on details aboutactions/situations/observations, resulting in“Cover-Your-Ass” engineering (from fear ofpunishment) SHAME/BLAMECYCLE[Source:codecraft] Incident Engineer takes action and contributes to afailure or incident. Punishment Engineer is punished, shamed,blamed, or retrained. Collaboration Issue Other engineers become less likely to helpthe engineer on-call due to fear of beingtaken accountable. Vulnerable System Isolated engineer As fewer engineers are solving system breakagetogether, it took longer time to pin down root cause Solve the problem with much less help Second Story First Story Human error is seen asthe effect ofsystemic vulnerabilitiesdeeper inside the organization Human error is seen as cause of failure Saying what people should have done is asatisfying way to describe failure Saying what people should have done doesn’t explainwhy itmade sense for them to do what they did Only byconstantlyseeking out its vulnerabilities canorganizations enhance safety Telling people to be more careful will makethe problem go away 故障处理blameless Regular Post-Mortem Blameless Post-Mortem You didn’t send the data on time, so Icouldn’t create the report. I didn’t get the data I needed to createthe report. James should have ordered the packagingsupplies sooner. The packaging supplies weren’tdelivered on time. Bob didn’t proofread the text, so all thedocuments had to be reprinted. There was an error in the text so all thedocuments had to be reprinted. The project manager didn't have contactinformation for everyone on the shared drive,so I wasn’t able to get in touch with the ITcontact when the server went down. Contact information for everyone wasn’tposted on the shared drive, so I wasn’table to get in touch with the IT contactwhen the server went down. Example 2 Example 3 Example 1 Production Service with99.99% availability. Long ticket queuewith 1-day SLA. Non automated,flakylaunch process HeroEvery day (includingweekends), they look at themonitoring, spot-check graphs,roll back bad canaries by hand,they spot memory useincreases before alerts