If there were an award for biggest improvement in software planning in the 21st century, the concept of velocity would be a strong candidate. And what could be easier than figuring out how fast the team is going and how far away the goal is, and then dividing to calculate when the team will arrive?
Yet projects continue to be late, and companies continue to realize that the project will be late. Here's why velocity goes wrong so frequently, and how to fix it.
Where velocity comes from: Story points
When the team that invented Extreme Programming started work on Chrysler’s Consolidated Compensation Project in the late 1990s, it had a scope for the entire three-year project, which the team broke down into stories. Each story was "pointed" as to the number of perfect engineering half-days it would require to complete. Of course, those half-days failed to account for taking bathroom breaks, reading and responding to emails, going to meetings, getting stuck, and debugging. Assuming humans are always thrown off by these things, but they are usually off by the same amount, the team that intended to measure half-days actually delivered, and velocity was born.
In Extreme Programming Installed, Jeffries, Anderson, and Hendrickson suggested a ratio of 2 or 3:1 for actual performance compared to idealized performance. Knowing that communication would get confusing, they suggested dropping “perfect engineering days” for another unit of measure, suggesting gummy bears as the artificial unit of software work. Eventually the industry settled on story points.
Measure—or goal?
Story points themselves are subjective, defined by the team. Adding up the story points accomplished in a two-week sprint yields the velocity, which is useful for planning.
Once the measure is published, it looks a lot like a report card on team performance. When I was at Socialtext, it was common for senior management to encourage us to have a “push” sprint, to “just get velocity a little higher.” Instead of the 40 points we were averaging per sprint, we would be asked to get to 45. Of course, once we go to 45, then 45 was the new normal.
My old professor Roger Ferguson used to say, “Be careful what you measure—because you’re going to get it.” Companies that ask the team for velocity are going to get more points. However, the easiest way to get more points is to inflate the point value of each story.
Velocity needs to be a measure of team performance used for prediction. When it becomes a goal, it ceases to be a good measure, because people will corrupt the measurement to accomplish the new goal.
The average velocity fallacy
The simplest way to calculate velocity is to take a rolling average of the past few sprints. If a team has earned 35, 40, 30, and 42 points for the past four sprints, then its average is 36.75; for planning purposes, we can call it 37.
Assuming the team has 350 points of work left, then it should be done in 9.45 sprints, which rounds down to 9.
First, notice that rounding from 36.75 to 37 has affected the outcome—pure math would put the number of sprints at 9.52, which rounds up to 10. Second, consider that teams using averages for planning will be “late,” “behind,” or slow half the time—because that is how averages work.
This leads to the poor velocity death spiral. The technical staff have a bad sprint or two. Management exhorts them to “go faster” or “push” or “catch up.” The staff then implement stories as quickly as humanly possible, earning a great deal of points, deferring plenty of bugs—and leaving a lot of bugs not found. Over the next few sprints, the technical staff resort to copy/paste of code, take shortcuts, and skip testing, until the process begins to slow down. Eventually the software simply does not work and needs several sprints of remediation, it is rejected by the customer, or, at best, the team slows back down, ending up in worse shape than when it started, before the push.
One way to make more accurate predictions with velocity is to use confidence intervals.
Confidence intervals
Instead of the average, we could look to the median—the number in the middle. Or, better yet, to all of the numbers, in a sorted list. With a list of ten sprints, halfway between the fifth and sixth is the median. It is also where confidence is about 50%. The bottom number, the worst velocity overall, provides 90% confidence—things have only been that bad, or worse, 10% of the time.
Troy Magennis, the principal of Focused Objective and a former vice president of software engineering for Sabre Holdings, has introduced a set of methods, free tools, and downloads that run through how to do this kind of math to make better projections of velocity over time.
Extra and unplanned work
The classic model of estimating is to point all the stories, get a total, and use that for velocity. The problem is that the total might change. Features estimated at the start of a project often turn out to be more complex; require additional, unplanned work; need escalated support; and so on. A team may be well off to add budget for emergent work or spikes, as SAFe does in its concept of an innovation sprint. If things go well, the team can spend one sprint out of eight doing research and development and prototypes. If they are going badly, the team can finish up the planned and unplanned work.
Normalizing cross-team story points
One of the most tempting approaches to compare teams is to normalize velocity. That is, a story rated 3 points by ten different teams would come up as 3 points for all teams. Perhaps pointing is done at a higher level—maybe the release train. That way any team can grab any feature. Best of all, team performance can be compared as easily as looking at a spreadsheet.
I do not recommend this.
It might be possible to do this for some teams. One recent client of mine had over a hundred teams, the vast majority of which were working on web applications with a similar architecture. At times the teams swapped features. This sort of thing is vanishingly rare—usually the technology stacks and business expertise are so different as to make this kind of comparison meaningless.
As Cem Kaner once pointed out at STPCon, this kind of competition discourages the sharing of information. Teams that have tips and good practices will keep them to themselves, or else risk losing their edge. The same problem can happen at the individual level with annual reviews; this can be even more damaging to morale.
Worst of all, unless the technology is nearly identical within teams, the comparison is meaningless.
So let a story to modify the data warehouse be a 3 for most teams and a 5 for application development. The consequences of normalizing points are not worth the benefits.
How to fix it
Get velocity accuracy at the individual team level, and gather and use historical data. This works at every level, from project (number of points for similar projects) to sprint (generating confidence intervals) to story (points for a similar-sized stories should be a similar amount of complexity).
Velocity is the result of good decisions, not the cause. Instead of exhorting the team to increase velocity, try to do things well. That means improving the value of the stand-up meeting and making retrospectives actionable, with takeaways that are followed-up on. It means watching the flow (or WTF) of stories across the board and making changes to improve cycle time or reduce back-wash from bugs.
If a team can do that, velocity will take care of itself.
Keep learning
Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
- Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.