"Stuff" happens. It's an imperfect world, especially when it comes to software. No amount of safeguards or preparation can prevent every possible incident, and sometimes simple human error can have significant consequences to a development project. You can't avoid it, but you can learn from it and take steps to minimize the chances of a similar incident in the future. The question is how software organizations can best go about learning from mistakes with agile postmortems.
Retrospective or postmortem?
There are two methods of analyzing the past to learn and improve that are prevalent in business today: the retrospective and the postmortem. There's a fair amount of confusion when it comes to the terms "retrospective" and "postmortem." You can also add the phrase "after action review" to the list of terms that many people use interchangeably.
To muddy the waters further, the retrospective is an evolution of a traditional postmortem, and now a new deep postmortem has emerged as an evolution of the retrospective. Developers conducted postmortems before agile but only at the completion of a project, sort of the final step before closing the book and moving to the next project. The retrospective applies that concept throughout development, but there's still a need to do a deeper dive after the fact, which has led to the new and improved postmortem. In the age of DevOps and continuous development and deployment, though, a project may never really be "complete" and the opportunity for a postmortem may never occur.
Andrew Storms, VP of security services at New Context, describes the retrospective as a component of the agile development methodology. He says that the retrospective occurs at the end of an iteration or sprint within the project development lifecycle, an opportunity for the team to get together and talk about what went well and what could have been done better. A list of action items is created and the team is empowered to enact the changes necessary to improve.
A post on the Software Process and Measurement blog illustrates the difference between a retrospective and a classic postmortem: "The differences between agile retrospectives and postmortems boil down [to] differences in mindset. Agile retrospectives have a bias for action. Agile retrospectives are about making changes in how work is being done. The team does not need permission to make the changes demanded by an agile retrospective. Finally, agile retrospectives are done to help the team, rather than [as] a means of validating and maintaining a standard process."
According to TK Keanini, CTO of Lancope, they're similar or interchangeable to an extent. He styles the postmortem as simply the final and last retrospective for a project that has concluded. "The key differentiator for me is in the term 'mortem,' whereby something is dead or the process finality has happened, and one individual or the team can look back at it in its entirety, whereas a retrospective can happen retroactively at every step of the way."
One way that postmortems or after action reviews (AARs) are different is that they're often a function of incident response after an event of some sort, like an outage. Storms says, "Unfortunately, few organizations hold postmortems after an event with a positive outcome. For this reason, postmortems have been associated with a negative connotation. It's important for organizations to remember they should spend a few minutes recapping an event as a group, regardless if the event's outcome was overall positive or negative. There is important information to be shared and learned regardless of any event's outcome."
Don't point fingers
DevOps is composed of three key components: the core operational principles, the processes and forms of work that reinforce those principles, and the application of those principles to daily work. Postmortems fit into the second component, reinforcing the core principles by learning from past mistakes and improving for the future. However, in order for the postmortem to be effective, it has to be done without casting blame.
Some hospitals conduct postmortems or medical review boards. Doctors and nurses are human and make mistakes, sometimes fatal mistakes. The discussion of those mistakes often occurs behind closed doors, though, so medical professionals can learn from each other and educate everyone not to make those same mistakes again without undermining public confidence in the hospital. There has to be a system to allow medical professionals to discuss mistakes honestly without fear of legal or professional repercussions. If blame is assigned and consequences must be faced, doctors and nurses would be more likely to simply cover up mistakes and nothing would be learned.
The same is true for a postmortem in IT. In order to encourage individuals to report issues and mistakes in the first place, and to foster an honest discussion about what led to the issue and how to avoid it in the future, there has to be some amnesty from blame.
A blog post by Joao Miranda describes the philosophy behind not placing blame. "Blameless postmortems assume that humans have good intentions in the general case. If this assumption is not held, the organisation will try to find someone to blame. In that case, the engineers involved will withheld [sic] information for fear of being punished and so will guarantee that the failure will happen again in the future."
Storms explains, "The blameless recap of any event is by far one of the hardest issues in a company to overcome. For many people, their natural instinct is to cast blame or defend themselves. If you consider that most postmortems are only occurring during events with negative outcomes, then the company only associates postmortems with unhealthy and negative cultural interactions. The keys to blameless meetings is first to hold them regularly, with a standard process, and second is to foster a culture where failure is acceptable and is used as a learning tool."
Best practices for both
Whether you're conducting retrospectives as a function of the ongoing development process or a postmortem at the end of a major project or as a part of incident response, there are some best practices you should follow.
1. No blame allowed: The Retrospective Prime Directive applies nicely, regardless of whether you're conducting a retrospective or postmortem:
Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.
2. Continuous improvement: A retrospective or postmortem in any practice is about learning from the historical record. Keanini stresses, "The most important thing to ask is if the telemetry and findings are complete and how we can prepare for the next retrospective or postmortem with better telemetry and visibility to the processes and domain of interest." Keanini also points out that you can't install the monitoring of an event post-incident, so it's important to continuously improve the fidelity of your telemetry to ensure you're capturing the right metrics and information in the right way to provide maximum value.
3. Learn from history: Lucas Welch, director of communications at Chef, describes how Chef employs retrospectives and postmortems with an emphasis on using the lessons as a learning experience to prevent similar issues in the future. "We have a postmortem write-up, with the timeline, root cause, and corrective actions published in a private repository. Then, we schedule an internal postmortem meeting, where the incident leader for the problem discusses with others who were involved what happened, why, and how to prevent the same thing from happening. This is a learning experience for everyone, and these meetings are conducted in a blameless manner. For any production incident that could impact our users, we publish a public blog detailing the incident and resolution, as well as a forum for feedback."
4. Take action: Welch says that Chef conducts retrospectives or postmortems with two important guidelines. First, the company doesn't focus on past events from a perspective of "could've" or "should've." Second, all follow-up action items are assigned to a team or individual before the end of the meeting. "If the item cannot be a top priority leaving the meeting, we don't make it a follow-up item."
Analyzing past performance and identifying the root causes of issues has always been valuable for organizations. In the rapid-pace, continuous development and deployment world of DevOps, the retrospective and postmortem are more crucial than ever. Many projects have no clear end, and it's important for organizations to continuously learn from mistakes and adapt in real time to ensure the same problems aren't needlessly repeated.
Keep learning
Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
- Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.