Testing Best Practices

Best practices for testing AI agents with FluxLoop.

Core Principles

1. Test Early, Test Often

Start testing from day one:

# Don't wait until "ready"
fluxloop test

Benefits:

Catch issues early
Build confidence incrementally
Establish baseline performance

2. Use Realistic Inputs

Test with realistic, diverse inputs:

✅ fluxloop inputs generate --persona novice_user,expert_user
❌ Manual inputs: ["test1", "test2", "test3"]

3. Automate Everything

Integrate testing into your workflow:

# CI/CD pipeline
fluxloop test --yes --no-skip-upload

4. Track Changes Over Time

Compare results across versions:

fluxloop results compare --baseline production

Input Design

Diverse Personas

Cover different user types:

personas:
  - novice_user       # 40%
  - intermediate_user # 35%
  - expert_user       # 25%

Edge Cases

Include unusual scenarios:

inputs:
  - ""                          # Empty input
  - "A" * 1000                  # Very long input
  - "!@#$%^&*()"               # Special characters
  - "混合语言 mixed language"  # Mixed languages

Realistic Distribution

Match production traffic:

# Analyze production logs
analyze_logs.py --output distribution.yaml

# Generate matching test inputs
fluxloop inputs generate --distribution distribution.yaml

Test Coverage

Feature Coverage

Test all major features:

scenarios:
  - authentication
  - user_management
  - billing
  - api_integration
  - error_handling

Behavioral Coverage

Test different behaviors:

behaviors:
  - happy_path          # Normal usage
  - error_conditions    # Failures
  - edge_cases          # Unusual inputs
  - multi_turn          # Conversations
  - time_sensitive      # Timeouts, etc.

Persona Coverage

Test all user types:

personas:
  - novice_user
  - expert_user
  - frustrated_user
  - mobile_user
  - non_native_english

Test Organization

Scenario Structure

scenarios/
├── smoke/            # Quick sanity checks
├── regression/       # Core functionality
├── performance/      # Speed tests
└── edge-cases/       # Unusual scenarios

Naming Conventions

✅ auth_password_reset_flow
✅ billing_subscription_upgrade
✅ api_rate_limit_exceeded

❌ test1
❌ scenario_final_v2
❌ my_test

Iteration Strategy

Multiple Iterations

Run each input multiple times:

fluxloop test --iterations 10

Why:

Measure consistency
Account for LLM randomness
Statistical significance

Recommended Iterations

Test Type	Iterations	Reason
Smoke	1	Quick feedback
Regression	3-5	Balance speed/confidence
Performance	10-20	Statistical validity
Critical Path	10-20	High confidence

Evaluation Criteria

Define Clear Criteria

criteria:
  - id: accuracy
    description: "Provides factually correct information"
    weight: 0.4

  - id: helpfulness
    description: "Helps user achieve their goal"
    weight: 0.3

  - id: safety
    description: "No harmful or inappropriate content"
    weight: 0.3

Set Appropriate Thresholds

evaluation:
  pass_threshold: 0.75  # Adjust based on use case

  # Critical features: Higher threshold
  critical:
    pass_threshold: 0.90

  # Experimental features: Lower threshold
  experimental:
    pass_threshold: 0.60

CI/CD Integration

Pre-Deployment Testing

# .github/workflows/deploy.yml
- name: Run Tests
  run: fluxloop test --scenario production --yes

- name: Check Pass Rate
  run: fluxloop results check --fail-on-regression

- name: Deploy
  if: success()
  run: deploy.sh

Continuous Testing

# Daily regression tests
schedule:
  - cron: '0 2 * * *'

Branch Testing

# Test on every PR
on: [pull_request]

steps:
  - run: fluxloop test --scenario regression --no-skip-upload

Performance Testing

Response Time

Monitor latency:

criteria:
  - id: response_time
    thresholds:
      excellent: 500
      good: 1000
      acceptable: 2000

Throughput

Test concurrent requests:

fluxloop test --parallel 10 --iterations 100

Resource Usage

Monitor memory and CPU:

runtime:
  limits:
    max_memory: "1GB"
    max_cpu: "2"

Debugging Failures

Review Failed Tests

# List failures
fluxloop results list --status failed

# View specific failure
fluxloop results show input_001

Reproduce Locally

# Reproduce specific failure
fluxloop test --input input_001 --iterations 1

Analyze Patterns

# Group failures by type
fluxloop results analyze --group-by error_type

Common Pitfalls

❌ Testing Too Late

Don't wait until "done":

❌ Build agent → Deploy → Test
✅ Test → Build → Test → Deploy → Test

❌ Insufficient Coverage

Don't test only happy paths:

❌ Only novice_user persona
✅ Multiple personas + edge cases

❌ Ignoring Failures

Don't skip failures:

❌ "It's just one failure, ship it"
✅ Investigate all failures

❌ Manual Testing Only

Don't rely on manual testing:

❌ Manually test 10 cases
✅ Automated test 1000 cases

❌ No Baseline

Don't test without comparison:

❌ Test once, no context
✅ Compare to baseline, track trends

Optimization

Parallel Execution

Speed up tests:

fluxloop test --parallel 4

Selective Testing

Test relevant scenarios:

# Changed authentication code?
fluxloop test --scenario auth --no-skip-upload

# Changed everything?
fluxloop test --all-scenarios --no-skip-upload

Caching

Cache unchanged results:

runtime:
  cache:
    enabled: true
    ttl: 3600  # 1 hour

Monitoring

Set Up Alerts

# Alert on failures
alerts:
  - type: failure_rate
    threshold: 0.10  # Alert if > 10% fail
    channel: slack

  - type: score_drop
    threshold: 0.05  # Alert if score drops > 5%
    channel: email

Track Metrics

Monitor key metrics:

metrics:
  - pass_rate
  - avg_score
  - response_time_p95
  - failure_count

Review Regularly

Schedule reviews:

Weekly: Review test results
Monthly: Update test scenarios
Quarterly: Revise evaluation criteria

Team Practices

Shared Scenarios

Use team scenarios:

# Pull team scenarios
fluxloop sync pull

# Test locally
fluxloop test --scenario team-regression

Code Reviews

Include test results in PRs:

## Test Results

- Scenario: auth-flows
- Pass Rate: 98% (49/50)
- Avg Score: 0.89
- Link: https://results.fluxloop.ai/run/run_123

Documentation

Document test strategy:

# Testing Strategy

## Scenarios
- smoke: Quick health checks
- regression: Core functionality
- performance: Speed tests

## Frequency
- smoke: Every commit
- regression: Every PR
- performance: Daily

## Criteria
- accuracy: 0.4 weight
- helpfulness: 0.3 weight
- safety: 0.3 weight

Basic Workflow - End-to-end testing
Evaluation Guide - Understanding results
CI/CD Integration - Automation

Examples

See FluxLoop Examples for complete testing workflows.

Core Principles​

1. Test Early, Test Often​

2. Use Realistic Inputs​

3. Automate Everything​

4. Track Changes Over Time​

Input Design​

Diverse Personas​

Edge Cases​

Realistic Distribution​

Test Coverage​

Feature Coverage​

Behavioral Coverage​

Persona Coverage​

Test Organization​

Scenario Structure​

Naming Conventions​

Iteration Strategy​

Multiple Iterations​

Recommended Iterations​

Evaluation Criteria​

Define Clear Criteria​

Set Appropriate Thresholds​

CI/CD Integration​

Pre-Deployment Testing​

Continuous Testing​

Branch Testing​

Performance Testing​

Response Time​

Throughput​

Resource Usage​

Debugging Failures​

Review Failed Tests​

Reproduce Locally​

Analyze Patterns​

Common Pitfalls​

❌ Testing Too Late​

❌ Insufficient Coverage​

❌ Ignoring Failures​

❌ Manual Testing Only​

❌ No Baseline​

Optimization​

Parallel Execution​

Selective Testing​

Caching​

Monitoring​

Set Up Alerts​

Track Metrics​

Review Regularly​

Team Practices​

Shared Scenarios​

Code Reviews​

Documentation​

Related​

Examples​