Best Practices
This guide consolidates best practices for maintaining test reliability using Xping. Follow these guidelines to build and maintain a trustworthy test suite.
Making Confidence Scores Visible
Include in PR Reviews
Make reliability part of your code review process:
## PR Checklist
- [ ] All new tests have run at least 20 times
- [ ] New tests have confidence scores > 0.60
- [ ] No existing tests show declining trends due to these changes
- [ ] Link to Xping dashboard showing test results
Set Minimum Thresholds
Define quality gates based on test importance:
Recommended Thresholds:
- Critical path tests: 0.85+ (Highly Reliable)
- Integration tests: 0.75+ (Reliable)
- All tests: 0.60+ (above Flaky threshold)
Enforcement:
- Don't merge code that introduces low-confidence tests
- Block PRs if existing tests degrade below thresholds
- Require investigation before merging tests with Medium or Low confidence levels
Display in Team Areas
Keep reliability top-of-mind:
- Show Xping dashboard during standups
- Display flaky test count on team information radiators
- Include reliability metrics in sprint reviews
- Share weekly summaries in team channels
Don't Ignore Low Scores
Why Flaky Tests Are Expensive
Flaky tests cost more than they save:
- Time Waste: Team investigates failures that aren't real bugs
- Eroded Trust: Developers start ignoring all test failures
- Hidden Bugs: Real issues get dismissed as "probably just flaky"
- CI/CD Delays: Pipelines get stuck or require manual overrides
- Productivity Loss: Context switching to investigate intermittent failures
Take Action Immediately
When you see low scores:
- Acknowledge: Don't pretend it's not a problem
- Investigate: Use Xping's telemetry to understand why
- Fix or Disable: Either fix the root cause or temporarily disable
- Track: Create tickets to ensure disabled tests get re-enabled
Related: For fixing strategies, see Fixing Flaky Tests.
Fix at the Source
Don't Mask Flakiness with Retries
Problem Approach:
[Test]
[Retry(3)] // Hiding the problem!
public void FlakyTest()
{
// Test with race condition
}
Better Approach:
[Test]
public async Task ReliableTest()
{
// Fixed the race condition with proper synchronization
await WaitForConditionAsync(...);
}
Identify Root Causes
Use Xping's factor breakdown to identify the real issue:
- Low Retry Behavior: Fix timing/synchronization
- Low Environment Consistency: Fix environment-specific dependencies
- Low Execution Stability: Fix resource contention or performance issues
Use Retries Appropriately
Retries are acceptable for:
- Temporary measure: While investigating a fix
- External services: When testing real external APIs (with appropriate tags)
- Infrastructure flakiness: Network hiccups beyond your control
Retries are NOT acceptable for:
- Hiding test design problems: Race conditions, shared state
- Permanent solution: If you need retries long-term, the test is broken
- All tests: Blanket retry policies mask real issues
Document Learnings
Build Institutional Knowledge
When you fix a flaky test:
Record What You Learned:
## Flaky Test Fix: UserLoginTest
**Problem**: Test failed intermittently with "element not found" error
**Root Cause**: Race condition - clicked login before page fully loaded
**Solution**: Added explicit wait for login button to be clickable
**Pattern**: Timing/Race Condition (see Common Flaky Patterns guide)
**Prevention**: Always wait for explicit conditions, never use fixed delays
**Xping Indicators**:
- Execution Stability: 0.42 (high variance)
- Retry Behavior: 0.38 (often passed on retry)
Share with the Team
- Add to team wiki or documentation
- Present in engineering meetings
- Create runbooks for common patterns
- Mentor new team members on reliability
Build a Pattern Library
Create a team reference:
- Common flakiness patterns you've seen
- Solutions that worked
- Anti-patterns to avoid
- Code snippets for reliable test patterns
Set Quality Standards
Define Minimum Scores by Test Type
Critical Path Tests (e.g., login, checkout, payment):
- Minimum confidence: 0.85 (Highly Reliable)
- Rationale: These must be rock-solid; failures indicate serious bugs
- Action: Investigate immediately if score drops below threshold
Integration Tests (e.g., API calls, database operations):
- Minimum confidence: 0.75 (Reliable)
- Rationale: Some environmental variance acceptable, but should be stable
- Action: Monitor and fix if degrading
Unit Tests:
- Minimum confidence: 0.75 (Reliable)
- Rationale: Should be completely deterministic
- Action: Very low scores indicate test design problems
End-to-End Tests:
- Minimum confidence: 0.60 (Moderately Reliable)
- Rationale: External dependencies may introduce some flakiness
- Action: Strive for higher but accept some environmental variance
Enforce Standards
In CI/CD:
- Block PRs introducing tests below thresholds
- Fail builds if critical tests degrade
- Require manual approval for low-confidence tests
In Code Review:
- Check new tests in Xping before approving
- Require explanations for tests with low scores
- Suggest improvements based on Xping's analysis
Test Design Principles
Write Deterministic Tests
Good:
// Fixed seed for reproducibility
var random = new Random(12345);
var testData = random.Next(1, 100);
Bad:
// Non-deterministic
var random = new Random();
var testData = random.Next(1, 100);
Ensure Proper Isolation
Good:
private Database _database;
[SetUp]
public void SetUp()
{
_database = new Database(); // Fresh instance per test
}
[TearDown]
public void TearDown()
{
_database.Dispose(); // Clean up
}
Bad:
private static Database _database = new Database(); // Shared state!
Make Tests Environment-Agnostic
Good:
[Test]
public void ProcessFile_Succeeds()
{
var testFile = Path.Combine(TestContext.CurrentContext.TestDirectory, "test.txt");
// Works on Windows, Linux, macOS
}
Bad:
[Test]
public void ProcessFile_Succeeds()
{
var testFile = "C:\\temp\\test.txt"; // Windows-only!
}
CI/CD Integration Strategies
Run Tests Multiple Times
Before merging, ensure new tests are stable:
- name: Stability Check for New Tests
run: |
# Run new tests 20 times to gather confidence data
for i in {1..20}; do
dotnet test --filter "Category=new"
done
Use Xping Data in Pipelines
Check confidence scores before merging:
- name: Check Test Reliability
run: |
# Query Xping API for test confidence scores
# Fail if any test has confidence < 0.60
Separate Flaky Test Runs
Run known-flaky tests separately:
- name: Reliable Tests
run: dotnet test --filter "Category!=flaky"
- name: Flaky Tests (Informational)
run: dotnet test --filter "Category=flaky"
continue-on-error: true
Team Culture and Shared Responsibility
Make Reliability Everyone's Job
- Developers: Write reliable tests from the start
- Reviewers: Check Xping scores during PR review
- QA: Monitor dashboard and report degradations
- Team Leads: Prioritize reliability in sprint planning
Celebrate Improvements
Recognize reliability wins:
- Tests fixed and maintaining high scores
- Decrease in flaky test count
- Improvement in team's average confidence score
- Quick turnaround on fixing degradations
Regular Reviews
Include reliability in regular ceremonies:
- Daily Standup: Mention any critical flaky tests
- Sprint Planning: Allocate time for fixing flaky tests
- Sprint Review: Show reliability metrics trends
- Retrospectives: Discuss what caused flakiness and how to prevent it
Common Questions
Q: Why is my passing test showing a low confidence score?
A: Confidence isn't just about pass/fail. A test that passes 100% of the time but requires retries, has wildly varying execution times, or only passes in certain environments will have a lower confidence score. Xping is telling you the test is unreliable even if it technically passes eventually.
Q: How long does it take to get a confidence score?
A: You need at least 10 test executions before a confidence score is calculated. After 10 runs, you'll see a score with "Low Confidence" level. The score becomes more reliable as you accumulate more executions:
- 25+ runs: Medium Confidence
- 50+ runs: High Confidence
- 100+ runs: Very High Confidence
Q: My test has a low score but I know why it's flaky. Can I override it?
A: No, confidence scores are calculated objectively based on actual behavior. However, you can:
- Add tags/categories to provide context (e.g.,
[Category("external-service")]) - Use filters to exclude these tests from your main view
- Create separate test suites for inherently unstable tests (like external service health checks)
- Focus on improving the test rather than overriding the score
Q: What's the difference between Success Rate and Confidence Score?
A:
- Success Rate: Simple percentage of passed executions (e.g., 85 passed out of 100 = 85%)
- Confidence Score: Holistic reliability assessment that considers pass rate, execution consistency, retry behavior, environment consistency, and more
A test with 100% success rate can have a low confidence score if it frequently needs retries or shows inconsistent behavior.
Q: Should I disable all tests below 0.60?
A: Not necessarily. First, understand why the score is low:
- Check the confidence level - is it based on enough data?
- Review the factor breakdown - which specific factors are problematic?
- Look at the trend - is it improving?
Then decide:
- If it's providing value and failures are legitimate → Fix the flakiness
- If it's flaky and nobody trusts it → Fix or disable
- If it's testing something inherently unstable → Reconsider test design or move to a separate suite
Q: How do I share a specific test's data with my team?
A: Each test has a unique URL (e.g., /tests/{test-id}). Copy and share this URL in your issue tracker, pull request, or team chat. Anyone with access to the dashboard can view the full test details.
Q: Can I get notifications when a test's confidence drops?
A: The current implementation calculates scores and flags low-confidence tests in the Flaky Tests tab. Check your dashboard regularly, or consider implementing your own alerting using the Xping API. Future versions may include built-in alerting capabilities.
Q: Why does a test show "Insufficient Data"?
A: The test hasn't reached the minimum 10 executions required for confidence score calculation. Keep running your tests, and once it hits 10 executions, you'll see a confidence score appear.
Q: The confidence score seems wrong. What should I do?
A:
- Check the Confidence Level - if it's Low, the score may change significantly as more data is collected
- Review the Factor Breakdown to see which factors are affecting the score
- Verify the Sample Size - small samples can be less accurate
- Look at the Execution History to verify the data matches reality
If something still seems incorrect, this may indicate an edge case worth investigating further.
See Also
- Identifying Flaky Tests - Find unreliable tests
- Fixing Flaky Tests - Strategies for addressing issues
- Monitoring Test Health - Regular review workflows
- Common Flaky Patterns - Pattern catalog