Best Practices

This guide consolidates best practices for maintaining test reliability using Xping. Follow these guidelines to build and maintain a trustworthy test suite.

Making Confidence Scores Visible

Include in PR Reviews

Make reliability part of your code review process:

## PR Checklist
- [ ] All new tests have run at least 20 times
- [ ] New tests have confidence scores > 0.60
- [ ] No existing tests show declining trends due to these changes
- [ ] Link to Xping dashboard showing test results

Set Minimum Thresholds

Define quality gates based on test importance:

Recommended Thresholds:

Critical path tests: 0.85+ (Highly Reliable)
Integration tests: 0.75+ (Reliable)
All tests: 0.60+ (above Flaky threshold)

Enforcement:

Don't merge code that introduces low-confidence tests
Block PRs if existing tests degrade below thresholds
Require investigation before merging tests with Medium or Low confidence levels

Display in Team Areas

Keep reliability top-of-mind:

Show Xping dashboard during standups
Display flaky test count on team information radiators
Include reliability metrics in sprint reviews
Share weekly summaries in team channels

Don't Ignore Low Scores

Why Flaky Tests Are Expensive

Flaky tests cost more than they save:

Time Waste: Team investigates failures that aren't real bugs
Eroded Trust: Developers start ignoring all test failures
Hidden Bugs: Real issues get dismissed as "probably just flaky"
CI/CD Delays: Pipelines get stuck or require manual overrides
Productivity Loss: Context switching to investigate intermittent failures

Take Action Immediately

When you see low scores:

Acknowledge: Don't pretend it's not a problem
Investigate: Use Xping's telemetry to understand why
Fix or Disable: Either fix the root cause or temporarily disable
Track: Create tickets to ensure disabled tests get re-enabled

Related: For fixing strategies, see Fixing Flaky Tests.

Fix at the Source

Don't Mask Flakiness with Retries

Problem Approach:

[Test]
[Retry(3)] // Hiding the problem!
public void FlakyTest()
{
    // Test with race condition
}

Better Approach:

[Test]
public async Task ReliableTest()
{
    // Fixed the race condition with proper synchronization
    await WaitForConditionAsync(...);
}

Identify Root Causes

Use Xping's factor breakdown to identify the real issue:

Low Retry Behavior: Fix timing/synchronization
Low Environment Consistency: Fix environment-specific dependencies
Low Execution Stability: Fix resource contention or performance issues

Use Retries Appropriately

Retries are acceptable for:

Temporary measure: While investigating a fix
External services: When testing real external APIs (with appropriate tags)
Infrastructure flakiness: Network hiccups beyond your control

Retries are NOT acceptable for:

Hiding test design problems: Race conditions, shared state
Permanent solution: If you need retries long-term, the test is broken
All tests: Blanket retry policies mask real issues

Document Learnings

Build Institutional Knowledge

When you fix a flaky test:

Record What You Learned:

## Flaky Test Fix: UserLoginTest

**Problem**: Test failed intermittently with "element not found" error

**Root Cause**: Race condition - clicked login before page fully loaded

**Solution**: Added explicit wait for login button to be clickable

**Pattern**: Timing/Race Condition (see Common Flaky Patterns guide)

**Prevention**: Always wait for explicit conditions, never use fixed delays

**Xping Indicators**:
- Execution Stability: 0.42 (high variance)
- Retry Behavior: 0.38 (often passed on retry)

Add to team wiki or documentation
Present in engineering meetings
Create runbooks for common patterns
Mentor new team members on reliability

Build a Pattern Library

Create a team reference:

Common flakiness patterns you've seen
Solutions that worked
Anti-patterns to avoid
Code snippets for reliable test patterns

Set Quality Standards

Define Minimum Scores by Test Type

Critical Path Tests (e.g., login, checkout, payment):

Minimum confidence: 0.85 (Highly Reliable)
Rationale: These must be rock-solid; failures indicate serious bugs
Action: Investigate immediately if score drops below threshold

Integration Tests (e.g., API calls, database operations):

Minimum confidence: 0.75 (Reliable)
Rationale: Some environmental variance acceptable, but should be stable
Action: Monitor and fix if degrading

Unit Tests:

Minimum confidence: 0.75 (Reliable)
Rationale: Should be completely deterministic
Action: Very low scores indicate test design problems

End-to-End Tests:

Minimum confidence: 0.60 (Moderately Reliable)
Rationale: External dependencies may introduce some flakiness
Action: Strive for higher but accept some environmental variance

Enforce Standards

In CI/CD:

Block PRs introducing tests below thresholds
Fail builds if critical tests degrade
Require manual approval for low-confidence tests

In Code Review:

Check new tests in Xping before approving
Require explanations for tests with low scores
Suggest improvements based on Xping's analysis

Test Design Principles

Write Deterministic Tests

Good:

// Fixed seed for reproducibility
var random = new Random(12345);
var testData = random.Next(1, 100);

Bad:

// Non-deterministic
var random = new Random();
var testData = random.Next(1, 100);

Ensure Proper Isolation

Good:

private Database _database;

[SetUp]
public void SetUp()
{
    _database = new Database(); // Fresh instance per test
}

[TearDown]
public void TearDown()
{
    _database.Dispose(); // Clean up
}

Bad:

private static Database _database = new Database(); // Shared state!

Make Tests Environment-Agnostic

Good:

[Test]
public void ProcessFile_Succeeds()
{
    var testFile = Path.Combine(TestContext.CurrentContext.TestDirectory, "test.txt");
    // Works on Windows, Linux, macOS
}

Bad:

[Test]
public void ProcessFile_Succeeds()
{
    var testFile = "C:\\temp\\test.txt"; // Windows-only!
}

CI/CD Integration Strategies

Run Tests Multiple Times

Before merging, ensure new tests are stable:

- name: Stability Check for New Tests
  run: |
    # Run new tests 20 times to gather confidence data
    for i in {1..20}; do
      dotnet test --filter "Category=new"
    done

Use Xping Data in Pipelines

Check confidence scores before merging:

- name: Check Test Reliability
  run: |
    # Query Xping API for test confidence scores
    # Fail if any test has confidence < 0.60

Separate Flaky Test Runs

Run known-flaky tests separately:

- name: Reliable Tests
  run: dotnet test --filter "Category!=flaky"

- name: Flaky Tests (Informational)
  run: dotnet test --filter "Category=flaky"
  continue-on-error: true

Team Culture and Shared Responsibility

Make Reliability Everyone's Job

Developers: Write reliable tests from the start
Reviewers: Check Xping scores during PR review
QA: Monitor dashboard and report degradations
Team Leads: Prioritize reliability in sprint planning

Celebrate Improvements

Recognize reliability wins:

Tests fixed and maintaining high scores
Decrease in flaky test count
Improvement in team's average confidence score
Quick turnaround on fixing degradations

Regular Reviews

Include reliability in regular ceremonies:

Daily Standup: Mention any critical flaky tests
Sprint Planning: Allocate time for fixing flaky tests
Sprint Review: Show reliability metrics trends
Retrospectives: Discuss what caused flakiness and how to prevent it

Common Questions

Q: Why is my passing test showing a low confidence score?

A: Confidence isn't just about pass/fail. A test that passes 100% of the time but requires retries, has wildly varying execution times, or only passes in certain environments will have a lower confidence score. Xping is telling you the test is unreliable even if it technically passes eventually.

Q: How long does it take to get a confidence score?

A: You need at least 10 test executions before a confidence score is calculated. After 10 runs, you'll see a score with "Low Confidence" level. The score becomes more reliable as you accumulate more executions:

25+ runs: Medium Confidence
50+ runs: High Confidence
100+ runs: Very High Confidence

Q: My test has a low score but I know why it's flaky. Can I override it?

A: No, confidence scores are calculated objectively based on actual behavior. However, you can:

Add tags/categories to provide context (e.g., [Category("external-service")])
Use filters to exclude these tests from your main view
Create separate test suites for inherently unstable tests (like external service health checks)
Focus on improving the test rather than overriding the score

Q: What's the difference between Success Rate and Confidence Score?

Success Rate: Simple percentage of passed executions (e.g., 85 passed out of 100 = 85%)
Confidence Score: Holistic reliability assessment that considers pass rate, execution consistency, retry behavior, environment consistency, and more

A test with 100% success rate can have a low confidence score if it frequently needs retries or shows inconsistent behavior.

Q: Should I disable all tests below 0.60?

A: Not necessarily. First, understand why the score is low:

Check the confidence level - is it based on enough data?
Review the factor breakdown - which specific factors are problematic?
Look at the trend - is it improving?

Then decide:

If it's providing value and failures are legitimate → Fix the flakiness
If it's flaky and nobody trusts it → Fix or disable
If it's testing something inherently unstable → Reconsider test design or move to a separate suite

A: Each test has a unique URL (e.g., /tests/{test-id}). Copy and share this URL in your issue tracker, pull request, or team chat. Anyone with access to the dashboard can view the full test details.

Q: Can I get notifications when a test's confidence drops?

A: The current implementation calculates scores and flags low-confidence tests in the Flaky Tests tab. Check your dashboard regularly, or consider implementing your own alerting using the Xping API. Future versions may include built-in alerting capabilities.

Q: Why does a test show "Insufficient Data"?

A: The test hasn't reached the minimum 10 executions required for confidence score calculation. Keep running your tests, and once it hits 10 executions, you'll see a confidence score appear.

Q: The confidence score seems wrong. What should I do?

Check the Confidence Level - if it's Low, the score may change significantly as more data is collected
Review the Factor Breakdown to see which factors are affecting the score
Verify the Sample Size - small samples can be less accurate
Look at the Execution History to verify the data matches reality

If something still seems incorrect, this may indicate an edge case worth investigating further.

Table of Contents

Best Practices

Making Confidence Scores Visible

Include in PR Reviews

Set Minimum Thresholds

Display in Team Areas

Don't Ignore Low Scores

Why Flaky Tests Are Expensive

Take Action Immediately

Fix at the Source

Don't Mask Flakiness with Retries

Identify Root Causes

Use Retries Appropriately

Document Learnings

Build Institutional Knowledge

Share with the Team

Build a Pattern Library

Set Quality Standards

Define Minimum Scores by Test Type

Enforce Standards

Test Design Principles

Write Deterministic Tests

Ensure Proper Isolation

Make Tests Environment-Agnostic

CI/CD Integration Strategies

Run Tests Multiple Times

Use Xping Data in Pipelines

Separate Flaky Test Runs

Team Culture and Shared Responsibility

Make Reliability Everyone's Job

Celebrate Improvements

Regular Reviews

Common Questions

Q: Why is my passing test showing a low confidence score?

Q: How long does it take to get a confidence score?

Q: My test has a low score but I know why it's flaky. Can I override it?

Q: What's the difference between Success Rate and Confidence Score?

Q: Should I disable all tests below 0.60?

Q: How do I share a specific test's data with my team?

Q: Can I get notifications when a test's confidence drops?

Q: Why does a test show "Insufficient Data"?

Q: The confidence score seems wrong. What should I do?

See Also