Eliminating Flaky UI Tests to Stabilize Continuous DeliveryThe software delivery process is constantly evolving to meet market and economic demands. New technologies allow developers to build software faster which requires equal improvement of practices in other areas of delivery such as testing to avoid creating gaps in quality or velocity.
A modern approach to this challenge by top performing software teams is to adopt testing as part of short iterative development cycles rather than treat software quality as a separate phase of the delivery chain. Incorporated into shorter timelines, testing, therefore, has to be fast, resilient to changes and reliable in execution. As a key component of their continuous delivery pipeline, organizations increasingly adopt automated testing earlier into their development process to minimize the downstream impact of defects and their multiplied cost in later stages of delivery.
The Goal: No Flaky Tests, Ever!
Flaky automated tests throw a monkey wrench into the software delivery process. Tests that inconsistently fail block the process of integrating changes into a build that is always “green”, ultimately hindering software teams from deploying on an as-needed basis. In this way, test flakiness represents a tangible barrier between downstream development and delivery activities.
Test flakiness occurs because of a few common reasons:
1. Test execution environment is inconsistently managed or unreliable
- Resources used by UI testing are not reset or are in an inconsistent state
- External dependencies propagate unreliable behavior into tests
- New requirements reserve testing resources, blocking existing requirements
2. Code changes are not reflected in test design
- Modifications to workflows aren’t communicated between team members
- Development and QA release activities do not operate on the same schedule
3. Rigid UI element locator strategies
- XPath and other selector patterns are rigid or based on dynamic data
- Tests duplicate steps required to select element or navigate
- Locator patterns are incompatible between platforms and various device models
4. Test data is out of sync with environment under test
- Restriction on data due to security or privacy inhibits timely testing
- Inability to obtain the latest data leaves test plan gaps open to real-world defects
- Distribution of responsibility for managing test data doesn’t match team needs
Particularly in rapid development cycles, a feedback mechanism that provides inconsistent or incorrect information is bad news. Not only do you have to pause development, but you have to validate whether the code under test caused the failure or the test itself. The more often a test fails erroneously, the further trust in the validation system is reduced.
The problem is further exacerbated when development teams either flag and disable flaky tests or simply fall back to manual testing to maintain team velocity. Both these anti-patterns encourage risks in production by ignoring early warning signs of a larger quality problem. Furthermore, developer productivity on writing code is undermined by a rise in defects, requiring teams to spend more time troubleshooting and less time on planned work.
Treat Testing as Part of Development
To improve test flakiness, coding and testing activities need to be synchronized appropriately through specific practices. The less of a gap that exists between coding and testing activities, the more ownership over improvements to quality will show in the work being delivered.
A Stable Environment Starts with Empathy
Developers that are familiar with how code is tested and how realistic the test environment is compared to use in production have an easier time writing code that ultimately conforms to real-world expectations. Similarly, with test code, no one likes when their test fails because of a resource left in an improper state by a prior script execution or fellow co-worker.
Make sure that tests practice proper set-up and tear-down patterns that leave shared resources in a healthy state. Deleting temporary files, resetting sensors, rolling back incomplete database transactions and closing network connections whenever possible goes a long way to maintaining a stable testing environment.
The shared tenancy of a test environment can also introduce unanticipated problems into test cycles, which is why many organizations are making a switch to automate the management of all environments through container, deployment and artifact management technologies. Nevertheless, changes to network configuration and domain administration can negatively impact test execution, so it is important to carefully manage and buffer these changes from your development process until they can be adequately addressed.
Code and Tests: Two Sides of the “Quality” Coin
Automation has driven testing to become code oriented to meet the complexity and pace requirements of app development. Fortunately, this means that you can apply a number of well-known coding practices to improve the stability of your tests.
Treat your tests as equal citizens to code. This allows your testing strategy to inherit positive aspects of principal project assets such as versioning, traceability, and ease of distribution. Using a system like Git to maintain your test scripts places them on a level playing ground with the rest of the app - a tangible step to including quality in development, simply by making code and tests neighbors.
Reduce technical gaps between app source code and test code. Writing tests in the same language as the app and importing app resource dictionaries into your test scripts encourages healthy coupling between these assets. Test failures at compile time are much easier to understand than after code check-in or in later-stage regression cycles.
A practical example of code and test coupling is the Espresso test framework for Android apps which uses the same resource dictionary to identify objects in both app and test code; this is facilitated by Java as a common language between both. Statically maintained XPath expressions or lists of object identifiers have no direct link to the source code and lead to brittleness. Using resource dictionaries such as “R.id...” references from app source code to drive object repositories and mapping in tests bridges the knowledge gap between developers and test breakage.
Be as hermetic as possible, but not overly so. The more elaborate or resource-dependent the test, the flakier the test can be. Unlike code-level unit testing, validation at the UI level often requires an interactive session or simulated environment to run these tests. While this is execution overhead from a perspective of unit and some integration testing, a real-time reflection of how the app works exposes many defects that code and service level testing cannot. Therefore, make sure that your UI tests are executed in realistic environments but do not rely on 3rd-party or other external services that have previously proved themselves to be flaky.
Soak test new test suites in isolation before adding it to the adopted set of tests that others run. This helps you “put your tests to the test” by exercising them without causing an uproar with your team and slowing down progress. New tests that are included in code commits can easily be run to notify the contributor at first, then tagged as ready for further incorporation once run successfully after a graduation period. Determining how this actually works in your development process is unique to each team, but isolated tests should meet a minimum accepted level of reliability that development, testing, operations and product ownership teams all agree on.
Real Data is Part of the Real World
For automated testing to cover important conditions, representative data must be integrated into the test plan. The easiest way to overcome restrictions on this data is to create a representative data set and regularly incorporate new data from defects as they are uncovered. Vigilance is the key here, since defects exposed by data are notoriously hard to catch early in the development process.
To make sure that all testing incorporates the correct data, each test should be considered as a candidate for using an existing or new test data set when it is first checked in. Since this is a responsibility of the team member who checks it in, it is, by definition, a shared activity across the entire development team. This further distributes the concept of thinking about quality up front into group culture and collaterally into the product of group work.
Stable tests allow teams to more confidently validate and improve the quality of their work. The faster developers receive complete feedback on their current work, the less time they spend fixing bugs later down the line. This translates to greater productivity and bandwidth to ship better code and improve the delivery process as a whole.