Evaluation Guide

Understanding and interpreting FluxLoop test evaluation results.

Overview

FluxLoop evaluates agent test results using:

Automated criteria scoring
LLM-based evaluation
Performance metrics
Behavioral analysis

Evaluation Process

1. Run Tests

fluxloop test --scenario production --no-skip-upload

2. Automatic Evaluation

Results are evaluated against criteria:

# Evaluation criteria
criteria:
  - accuracy: "Factual correctness"
  - helpfulness: "User helpfulness"
  - safety: "No harmful content"

3. View Results

# View locally
fluxloop results show

# View on Web Platform
# https://results.fluxloop.ai/run/run_123

Evaluation Criteria

LLM Judge

LLM evaluates subjective qualities:

criteria:
  - id: accuracy
    type: "llm_judge"
    weight: 0.4
    rubric: |
      1.0: Completely accurate
      0.7: Mostly accurate, minor errors
      0.4: Significant inaccuracies
      0.0: Incorrect information

Metrics

Automated performance measurements:

criteria:
  - id: response_time
    type: "metric"
    weight: 0.2
    thresholds:
      excellent: 500   # < 500ms
      good: 1000       # < 1s
      acceptable: 2000 # < 2s

Classifiers

Binary pass/fail checks:

criteria:
  - id: safety
    type: "classifier"
    weight: 0.4
    required: true  # Must pass

Scoring

Individual Scores

Each criterion scored 0.0 - 1.0:

{
  "input_id": "input_001",
  "scores": {
    "accuracy": 0.9,
    "helpfulness": 0.8,
    "safety": 1.0
  }
}

Overall Score

Weighted average:

overall_score = (accuracy * 0.4) + (helpfulness * 0.4) + (safety * 0.2)
              = (0.9 * 0.4) + (0.8 * 0.4) + (1.0 * 0.2)
              = 0.88

Pass/Fail

Based on threshold:

evaluation:
  pass_threshold: 0.75

# Score 0.88 >= 0.75 → PASS ✅

Interpreting Results

High Score (0.9+)

✅ Excellent performance

Agent meets or exceeds expectations
Ready for production

Good Score (0.75-0.9)

✅ Good performance

Acceptable for production
Minor improvements possible

Marginal Score (0.6-0.75)

⚠️ Needs improvement

Some issues present
Review failures
Iterate on prompts

Poor Score (< 0.6)

❌ Significant issues

Not ready for production
Major revisions needed

Common Patterns

High Accuracy, Low Helpfulness

accuracy: 0.9
helpfulness: 0.5

Agent provides correct but unhelpful responses.

Fix: Improve response formatting, add examples.

High Helpfulness, Low Accuracy

accuracy: 0.6
helpfulness: 0.9

Agent is friendly but provides wrong information.

Fix: Improve knowledge base, add fact-checking.

Inconsistent Scores

input_001: 0.9
input_002: 0.4
input_003: 0.9

Agent handles some cases well, others poorly.

Fix: Identify failure patterns, add specific training.

Best Practices

1. Define Clear Criteria

✅ Good:
  - id: accuracy
    description: "Provides factually correct information"

❌ Vague:
  - id: quality
    description: "Good response"

2. Use Multiple Criteria

Don't rely on single metric:

criteria:
  - accuracy
  - helpfulness
  - safety
  - response_time

3. Regular Evaluation

Evaluate frequently:

# Daily in CI
fluxloop test --scenario regression --no-skip-upload

4. Track Trends

Compare results over time:

fluxloop results compare --baseline last-week

Criteria Command - View evaluation criteria
Test Command - Run evaluated tests
Viewing Results - Web Platform analysis

Next Steps

Define evaluation criteria for your agent
Run tests and review scores
Iterate based on results
Set up continuous evaluation

Overview​

Evaluation Process​

1. Run Tests​

2. Automatic Evaluation​

3. View Results​

Evaluation Criteria​

LLM Judge​

Metrics​

Classifiers​

Scoring​

Individual Scores​

Overall Score​

Pass/Fail​

Interpreting Results​

High Score (0.9+)​

Good Score (0.75-0.9)​

Marginal Score (0.6-0.75)​

Poor Score (< 0.6)​

Common Patterns​

High Accuracy, Low Helpfulness​

High Helpfulness, Low Accuracy​

Inconsistent Scores​

Best Practices​

1. Define Clear Criteria​

2. Use Multiple Criteria​

3. Regular Evaluation​

4. Track Trends​

Related​

Next Steps​