Managing and controlling the process of continuous integration (CI), especially with modern methodologies such as agile that emphasize fast delivery, is one of the biggest challenges software companies experience. Doing so successfully requires collaboration between many different stakeholders.
In order to manage the many interactions required, organizations usually create the role of CI owner. This person tracks the build, identifies who's responsible when a build or test fails, monitors any failures, identifies unstable tests, locks the codebase in the case of major errors, and conducts many other actions designed to keep the system green. The CI owner frequently brings statistics on metrics such as the reasons for test failures, compilation errors, environment problems, who most often breaks the build, and more. It's a challenging job, but the role includes many repetitive actions that you can and should automate.
Our approach: Hubot + Slack
We use Slack for collaboration, along with the Hubot automation framework to automate our CI owner role as much as possible. Using Hubot, we created a robot that's responsible for managing our CI flows. After a build break, the bot receives all relevant information about the failure, including build number, status, a list of committers, and lists of failing sub-jobs and failing tests. The bot captures this information, opens a new Slack channel, and invites all committers and relevant stakeholders.
It then provides all relevant information about the build failure and manages communication around it, including the ability to tag the root cause of the failure for future analysis. When the build is green-lighted again, the bot archives the channel.
In each channel, we can see the build result, list of committers, links to our Jenkins CI server, and a list of sub-jobs that failed. Everyone can then chat about the failure and take ownership of the problem. We found collaboration in this kind of forum to be quite useful. For example, if several people who are committed in one build consider a break suspicious, the first person to investigate the failures can report to the channel and save the other committers from spending time and effort investigating.
The bot, which is part of the channel’s membership, knows how to answer members who respond in the channel. For example, if a committer says, “Not me,” it lets the person leave the room. For committers, this is an easy way to declare that the issue is not related to their commit, so they can stop receiving communication about the failure. With email, by contrast, the committer remains a party to the thread, since people usually respond by clicking on "Reply All."
We also added logic to the bot such that, if there is compilation error or an environment problem and tests can't run, committers who say, "Not me," aren't allowed to leave the room. They should not be able to do so because their tests haven't run and still might fail in the next build due to changes those comitters made. And when the committer says "checking," the bot knows to wait for additional updates.
If the committer says, “On it,” the robot guides him on how to tag the issues he’s handling. At any time, a user can join the channel and ask the bot about known issues that are being tagged by committers in the context of that channel. In this way, we can easily track all outstanding issues. For each issue, users can see the owner, creation time, and the failing tests or job being addressed.
When the committer pushes the fix and the build is green again, she is closing the issue with relevant specification. The fix classification can be an environment problem, a code fix, an unstable test, or a test fix. This fix classification is valuable because its enables us to measure the return on investment of tests and helps us to decide where to focus our efforts.
How it works in practice
If we discover that most of our tests are failing due to environment problems, then we know we need to spend more money on better resources and devices. Alternatively, if we find that there are many fixes in the tests themselves, we know that we need to hold off on developing more tests until we change the way we write the tests, since they are not sufficiently maintainable.
The classification data is also useful because we can apply machine-learning algorithms to it to automatically predict future failures, as well as causes and solutions.
Below is an example of a committer who took ownership of a test failure and tagged the issue. It's then assigned an ID for tracking.
In addition to dynamic channels, we created two static channels that the bot manages. The first channel is for freezing and unfreezing the code base. In this channel, the bot declares a freeze based on configuration. For example, the bot freezes every compilation failure automatically and then unfreezes it once the build is green. In addition, the bot provides every member of the channel with such information as current freeze status, reason for the freeze, the freeze owners, and an estimate as to how long before the freeze will be lifted.
When we use the freeze channel, we stop sending emails regarding code freeze, and each stakeholder knows in real time what the status is.
The second channel manages machine deployments. The bot deploys machines, and members can see where to find the last deployment. In addition, since all deployments have been done and recorded in this channel, we can easily track when each commit was deployed on the machine for the first time.
Committers especially like how the bot sends a private message to the committer indicating when the commit has been deployed on a machine and is ready for verification. They don't have to wonder whether or not a deployment contains a specific commit.
Get your CI automated and get these results
By using these collaboration tools and processes, you can greatly leverage your capacity, reduce the volume of emails flying around, reduce your freeze time, and gather and disseminate valuable statistics about the build, all with minimal effort.
As for my team, our next step will be to enable our bot to support natural language, bringing an even better user experience to our team. It's a great way to do CI—and DevOps!
Note: Daniel Shmaya and Itay Ben yehuda also developed the solution described in this article.
Image credit: Flickr
Keep learning
Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
- Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.