Identifying Flaky Tests

Flaky tests—tests that sometimes pass and sometimes fail without code changes—are one of the most frustrating problems in software testing. They erode trust in your test suite, waste developer time, and can mask real issues. Xping helps you identify and understand flaky tests through intelligent detection and confidence scoring.

What Makes a Test Flaky?

Flaky tests fail intermittently for reasons unrelated to code correctness. Common causes include:

Timing Issues: Race conditions, timeouts, or slow external services
Environment Dependencies: File system state, network conditions, or resource availability
Non-Deterministic Logic: Random data, date/time dependencies, or unordered collections
Test Isolation Problems: Shared state between tests or improper cleanup
External Service Flakiness: Third-party APIs, databases, or infrastructure instability

How Xping Detects Flakiness

Xping doesn't just track pass/fail results—it analyzes multiple dimensions of test behavior to identify flakiness patterns. By collecting rich telemetry data across test runs, Xping builds a comprehensive understanding of each test's reliability.

The confidence score combines six key factors to give you a complete picture of test reliability. Tests with low confidence scores (below 0.60) exhibit patterns that indicate flakiness rather than consistent failures.

Related: For details on how confidence scores are calculated, see Understanding Confidence Scores.

Using the Flaky Tests Tab

The easiest way to find flaky tests is through the Flaky Tests tab in the dashboard:

Navigate to the Xping Dashboard
Click on the Flaky Tests tab
View tests filtered to show only those detected as flaky (with confidence scores below 0.60)
Sort by confidence score (ascending) to see the most problematic tests first

This filtered view helps you prioritize reliability improvements by showing only tests that need attention.

Related: For navigation details, see Navigating the Dashboard.

Investigating a Flaky Test

When you identify a test with a low confidence score, follow this investigation process:

1. Review the Test Detail Page

Navigate to the test detail page to understand:

Which factors are pulling the score down? Look at the factor breakdown
What's the confidence level? If it's Low or Medium, you may need more data
What's the trend? Is it getting worse or has it always been flaky?

2. Check the Confidence Score Breakdown

The breakdown shows individual scores for each of the six factors:

Historical Pass Rate: Is the test failing frequently?
Execution Stability: Are execution times varying wildly?
Retry Behavior: Is it failing then passing on retry?
Environment Consistency: Does it behave differently in different environments?
Failure Pattern Analysis: Are failures random or predictable?
Dependency Impact: Does it fail when other tests fail?

Low scores in specific factors point to the type of flakiness.

3. Examine Environment Analysis

If "Environment Consistency" has a low score:

Compare pass rates across Local vs. CI
Look for environment-specific failures
Common causes:
- File path differences (Windows backslashes vs. Unix forward slashes)
- Environment variables not set in CI
- Network connectivity differences
- Timing differences (CI machines may be slower)

4. Review Execution History

Look for patterns in recent executions:

Do failures happen at specific times of day?
Do they occur during parallel execution?
Do they correlate with high system load?
Are certain environments more problematic?

5. Check the Flaky Detection Analysis

If the test is flagged as flaky, the Flaky Detection Analysis section provides:

Flakiness Category: Type of flakiness detected (Timing, Environment, Concurrency, etc.)
Root Cause Analysis: Specific indicators found
Recommendations: Actionable suggestions for fixing the issue
Confidence: How certain the system is about the categorization

Common Flakiness Indicators

Retry Behavior (Low Score)

The test is failing initially but passing on retry
Classic sign of timing issues or race conditions
Example: Timeout on first attempt, success on second attempt

Environment Consistency (Low Score)

Works locally, fails in CI (or vice versa)
Different pass rates across operating systems
Example: 100% pass on macOS, 60% pass on Linux

Execution Stability (Low Score)

Wide variance in execution times
Occasional performance spikes
Example: Usually 120ms, occasionally 2400ms

Failure Pattern (Low Score)

Random, unpredictable failures
No correlation with code changes
Example: Fails 3 times, passes 47 times with no pattern

Dependency Impact (Low Score)

Fails when other tests fail
Shared resource contention
Example: 80% of failures occur when another specific test also fails

False Positives: When Low Scores Don't Mean Flakiness

A low confidence score doesn't automatically mean flakiness. Consider these scenarios:

Consistently Broken Test:

Test fails 100% of the time due to a real bug
Low score because it's failing
Not flaky - just broken

Recently Introduced Test:

Fewer than 25 executions
Confidence level is "Low"
May not be flaky - just needs more data

Environment-Specific Bug:

Fails consistently in one environment
Passes consistently in another
May not be flaky - could be environment-specific bug

Xping's Flaky Detection Analysis helps distinguish between these cases and true flakiness.

Table of Contents