How to Reduce Flaky Tests in CI

A reusable checklist for diagnosing and fixing flaky tests in CI, from selectors and timing to retries, data isolation, and environment drift.

Flaky tests in CI are rarely caused by one dramatic problem. More often, they come from small mismatches between the test, the app, and the environment: a selector that is too brittle, a background request that finishes later in CI than it does locally, a shared account that is mutated by parallel jobs, or a retry policy that hides the real issue. This checklist is designed to help teams reduce flaky tests in CI with a repeatable troubleshooting process. Use it when failure rates spike, when you add new end-to-end coverage, or when you migrate tooling. The goal is not just to fix one unstable test, but to build more reliable automated tests and a calmer CI/CD testing workflow.

Overview

If your tests pass locally and fail unpredictably in CI, treat that as a systems problem, not just a test authoring mistake. Flakiness usually sits at the boundary between product behavior, test design, infrastructure, and timing. The fastest path to improvement is to narrow down the type of flake before you change code.

Start with a simple rule: do not guess. Before editing the test, collect enough evidence to classify the failure. For each flaky test, capture at least the following:

The exact step that failed
Whether the failure happens only in CI or also on developer machines
Whether it appears only in parallel runs or only under full-suite load
Whether retries make it pass
Screenshots, videos, traces, console logs, and network logs if your framework supports them
The commit, browser, OS image, and runtime version used in the failing job

That initial classification matters because the fix for a timing issue is very different from the fix for environment drift or test data collision. In practical end-to-end testing guides, teams often jump straight to adding waits or retries. That may reduce noise for a week, but it usually increases maintenance later.

Use this quick triage model:

If the failure point moves around, suspect environment instability, leaked state, or resource contention.
If the failure point is consistent, suspect selector quality, timing, or an unmet readiness condition.
If it fails only in CI, suspect slower execution, different browser dependencies, network behavior, secrets, feature flags, or container image drift.
If it fails only in parallel, suspect shared data, reused accounts, order dependence, or hidden cross-test coupling.

If you are still choosing tooling, framework behavior can influence how easy flaky test fixes become over time. For that comparison work, see Playwright vs Cypress vs WebdriverIO: Best End-to-End Testing Framework in 2026 and Cross-Browser Testing Tools Compared: Playwright, Selenium, Cypress, and Cloud Grids.

Checklist by scenario

This section gives you a reusable test flakiness checklist organized by symptom. Pick the scenario that looks most familiar and work through the checks in order.

Scenario 1: The test fails on element lookup or interaction

This is one of the most common flaky tests in CI patterns. The test cannot find an element, clicks the wrong thing, or interacts before the UI is actually ready.

Prefer stable, intentional selectors such as roles, labels, test IDs, or other explicit hooks over deep CSS chains.
Check whether the element is present but not visible, visible but not enabled, or rendered but covered by another layer.
Verify that your test waits for the condition that actually matters. Waiting for page load is not the same as waiting for data-bound UI to settle.
Remove fixed sleeps where possible. A hard-coded delay may pass on one runner and fail on another.
Make sure animations, transitions, and lazy-loaded components are not changing click timing.
Confirm that there is only one matching element. Ambiguous selectors often pass until UI structure shifts slightly.

A reliable fix usually means aligning the test with user-observable readiness. For example, wait for a specific heading, accessible role, response completion, or save confirmation, rather than a generic timeout.

If the test stalls on navigation, loading states, or API-driven content, the issue may be asynchronous behavior that is faster locally than in CI/CD testing environments.

Check whether the page depends on multiple background requests before becoming usable.
Inspect failed network calls, especially authentication refreshes, feature flag fetches, and third-party scripts.
Confirm that test environments expose the same base URLs, secrets, and route configuration as local runs.
Watch for service-worker caching, stale assets, or API schema mismatches between branches and environments.
Use network-aware waits carefully. Wait for the request or response that unlocks the UI, not every request on the page.
Consider stubbing unstable third-party integrations when the goal of the test is your app behavior, not the vendor dependency.

For teams running Playwright in CI, environment setup and browser dependencies are frequent sources of hidden instability. A practical setup reference is How to Run Playwright in GitHub Actions: Updated CI Setup Guide.

Scenario 3: The test passes alone but fails in the full suite

This usually points to state leakage, resource contention, or test order dependence.

Run the test in isolation and then in a small batch with neighboring tests to detect interference.
Check whether accounts, organizations, carts, projects, or feature settings are shared across tests.
Make test data unique per run when possible. Namespace records with a job ID, worker ID, or timestamp.
Reset server-side state explicitly instead of assuming cleanup happened.
Review browser storage usage. Cookies, local storage, and session storage can bleed across tests if isolation is incomplete.
Audit database fixtures and seed scripts for assumptions about execution order.

If parallel test execution increases failure rates, that is a signal worth keeping. Do not disable all parallelism immediately. First identify which tests are unsafe to parallelize and why.

Scenario 4: The test fails only on one browser or platform

Cross-browser issues can look flaky when they appear sporadically under mixed matrix runs.

Verify that selectors do not rely on browser-specific DOM quirks.
Check viewport assumptions, scrolling behavior, and sticky headers that may cover targets differently.
Review file upload, permissions, clipboard, and media interactions, which often vary across environments.
Compare font rendering and timing-sensitive visual assertions if visual regression testing tools are involved.
Make sure browser versions and launch flags are pinned or at least controlled consistently in CI.

If you need broader guidance on browser testing tools and matrix strategy, revisit Cross-Browser Testing Tools Compared: Playwright, Selenium, Cypress, and Cloud Grids.

Setup and teardown code often receives less scrutiny than test steps, but it is a common flake source.

Check token expiry windows and clock skew between services.
Ensure setup APIs return only after data is truly ready for UI use.
Make teardown idempotent. Cleanup that fails on the second attempt can poison later runs.
Do not rely on UI login for every test if a more stable session bootstrap exists.
Separate environment provisioning failures from real product regressions in reporting.

Setup fragility is especially costly in GitHub Actions testing and GitLab CI pipelines because one unstable preparation step can invalidate many downstream jobs. For pipeline structure ideas, see GitLab CI for Automated Testing: Pipeline Stages, Caching, and Parallel Jobs and Jenkins vs GitHub Actions vs GitLab CI for Test Automation.

Scenario 6: Retries make the problem disappear

Retries have a role, but they should help classify instability rather than hide it.

Tag tests that pass only after retry and report them separately.
Track retry-only passes by test name and component area.
Use retries for temporary containment while a root-cause fix is in progress.
Do not count retry passes as a clean signal of suite health.
Review whether your framework retries the whole test, the assertion, or the job. The scope matters.

A healthy policy is: retries may protect developer flow, but they should increase visibility, not reduce it.

What to double-check

Once you have narrowed the failure to a likely scenario, go through this secondary checklist before you merge a fix. This is where many teams catch the real cause.

Environment parity

Match Node, browser, package manager, and OS image versions between local and CI where practical.
Check whether dependency installation, browser binaries, or system libraries are being cached inconsistently.
Review feature flags and configuration drift across branches, preview environments, and CI jobs.
Confirm that the test environment has enough CPU and memory for the current level of parallelism.

Observability

Enable screenshots on failure at minimum.
Add traces, console logs, and network logs for intermittently failing suites.
Record the worker index, shard, browser, and retry number in test output.
Store artifacts long enough for investigation, not just for the latest run.

Better test reporting tools do not solve flakiness by themselves, but they shorten the time from symptom to evidence.

Data isolation

Create independent test users or workspaces per test or per worker.
Avoid shared email inboxes and mutable seed records unless absolutely necessary.
Ensure IDs generated in one test cannot collide with another parallel run.
Clean up asynchronously created resources such as background jobs, uploads, and webhooks.

Assertion quality

Assert outcomes that matter to users, not incidental implementation details.
Prefer eventual assertions where supported instead of checking state too early.
Do not over-assert. A single test that checks too many unrelated outcomes is more likely to flake.
Separate UI correctness, API correctness, and analytics/event checks unless the journey genuinely depends on all three.

Test ownership

Assign an owner for flaky tests, even if ownership rotates by component.
Define when a test should be quarantined, rewritten, or deleted.
Track repeat offenders. If the same test flakes every sprint, treat it as product debt or framework debt, not random bad luck.

Common mistakes

Teams trying to fix unstable tests often repeat a few patterns that make the suite look healthier while the underlying reliability gets worse.

Adding arbitrary sleeps: This can reduce failures temporarily, but it slows the suite and fails again when timing shifts.
Masking everything with retries: Retries can be useful, but they are not a substitute for understanding the failure mode.
Testing too much through the UI: Not every precondition belongs in an end-to-end flow. Move setup to APIs or fixtures where stable.
Using fragile selectors tied to layout: A style refactor should not break a business-critical checkout test.
Ignoring CI resource pressure: A suite that was stable at two workers may degrade at eight if the environment is undersized.
Keeping known-bad tests in the main signal path: If a test is untrusted, quarantine it visibly until fixed. Silent acceptance lowers confidence in all automated testing for developers.
Blaming the framework first: Tooling matters, but most flaky tests come from synchronization, data, or environment issues that exist regardless of framework.

If you are evaluating whether a framework change would improve stability, compare tradeoffs carefully rather than assuming a switch will remove flake. Framework choice affects debugging ergonomics, auto-waiting behavior, and reporting, but good test design still matters. For broader comparison, see Playwright vs Cypress vs WebdriverIO.

When to revisit

This checklist is most useful when it becomes part of your normal QA and CI/CD maintenance rhythm instead of a one-time cleanup. Revisit it at specific moments:

Before seasonal planning cycles, when teams often increase release volume or add more regression coverage
When workflows or tools change, such as a move to new runners, browser versions, cloud grids, or test frameworks
After adding parallelism, sharding, or new browser matrix combinations
When a product area adds heavy client-side rendering, real-time updates, or third-party integrations
When retry-only passes start climbing, even if pipeline success still looks acceptable

A practical team habit is to run a short flake review every two to four weeks. Keep it simple:

List the top flaky tests by retry count or non-deterministic failures.
Group them by failure mode: selector, timing, data, network, environment, or shared state.
Choose one or two root causes to eliminate, not ten surface symptoms to patch.
Update fixtures, selectors, wait strategy, and pipeline config together where needed.
Document the fix so future contributors do not reintroduce the same instability.

If you want a durable outcome, define what “reliable” means for your team. For example: no known flaky tests in merge-blocking smoke coverage, retries reported separately, and all new end-to-end tests reviewed for selector quality and data isolation. That kind of rule is easier to sustain than an abstract goal of zero flakiness.

The final action item is straightforward: pick one unstable test today, classify its failure mode, and fix the root cause instead of softening the symptom. Repeating that discipline across your suite is how you reduce flaky tests over time and build a CI pipeline for tests that developers can trust.