fluxloop criteria

Manage evaluation criteria.

Synopsis

fluxloop criteria [command] [options]

Description

The criteria command manages evaluation criteria for test scenarios. Criteria define what makes a test pass or fail.

Commands

`fluxloop criteria list`

List all evaluation criteria.

Usage:

fluxloop criteria list [options]

Options:

--scenario <name>: List criteria for specific scenario
--type <type>: Filter by criteria type
--json: Output in JSON format

Examples:

# List all criteria
fluxloop criteria list

# List criteria for specific scenario
fluxloop criteria list --scenario password-reset

# List criteria of specific type
fluxloop criteria list --type contains

Output:

Evaluation Criteria

Scenario: password-reset
┌──────────────────────────────────────────────────────────────────┐
│ contains-reset-link                                              │
│ Type: contains                                                   │
│ Field: response                                                  │
│ Value: "reset link"                                              │
│ Required: Yes | Weight: 0.3                                      │
│ Pass rate: 95% (last 30 days)                                   │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│ response-time                                                    │
│ Type: response_time                                              │
│ Threshold: < 3000ms                                              │
│ Required: Yes | Weight: 0.2                                      │
│ Pass rate: 82% (last 30 days)                                   │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│ empathy-check                                                    │
│ Type: sentiment                                                  │
│ Min score: 0.6                                                   │
│ Required: No | Weight: 0.3                                       │
│ Pass rate: 78% (last 30 days)                                   │
└──────────────────────────────────────────────────────────────────┘

Total: 4 criteria

`fluxloop criteria create`

Create a new evaluation criterion.

Usage:

fluxloop criteria create [options]

Options:

--scenario <name>: Scenario to add criterion to (required)
--type <type>: Criterion type (required)
--name <name>: Criterion name/ID
--required: Mark as required (default: true)
--weight <weight>: Weight for scoring (0.0-1.0, default: 1.0)

Type-specific options:

For contains type:

--field <field>: Field to check (response, metadata, etc.)
--value <value>: Value to search for

For response_time type:

--threshold <ms>: Max response time in milliseconds

For sentiment type:

--min-score <score>: Minimum sentiment score (0.0-1.0)

For regex type:

--pattern <regex>: Regular expression pattern

Examples:

# Create "contains" criterion
fluxloop criteria create \
  --scenario password-reset \
  --type contains \
  --name "mentions-email" \
  --field response \
  --value "email"

# Create response time criterion
fluxloop criteria create \
  --scenario password-reset \
  --type response_time \
  --threshold 3000 \
  --required

# Create sentiment criterion
fluxloop criteria create \
  --scenario password-reset \
  --type sentiment \
  --min-score 0.6 \
  --weight 0.3

# Create regex criterion
fluxloop criteria create \
  --scenario order-tracking \
  --type regex \
  --name "order-number-format" \
  --pattern "^#[0-9]{6}$"

Interactive Flow:

$ fluxloop criteria create --scenario password-reset

Create Evaluation Criterion

Scenario: password-reset

Criterion types:
  1. contains - Response contains specific text
  2. not_contains - Response doesn't contain text
  3. response_time - Response time under threshold
  4. sentiment - Sentiment analysis score
  5. regex - Regular expression match
  6. json_schema - JSON schema validation
  7. custom - Custom evaluation function

Select type (1-7): 1

Field to check:
  1. response - Agent response text
  2. metadata - Response metadata
  3. tool_calls - Tool invocations
  4. all - All fields

Select field (1-4): 1

Value to search for: reset link

Criterion name (optional): contains-reset-link

Required? (y/n): y

Weight (0.0-1.0, default 1.0): 0.3

✅ Criterion created: contains-reset-link

Criterion added to: scenarios/password-reset.yaml

Next steps:
• Run test to validate: fluxloop test --scenario password-reset
• View criteria: fluxloop criteria list --scenario password-reset

`fluxloop criteria show`

Display details for a specific criterion.

Usage:

fluxloop criteria show <criterion-id> [options]

Options:

--scenario <name>: Scenario containing the criterion (required)
--json: Output in JSON format

Examples:

# Show criterion details
fluxloop criteria show contains-reset-link \
  --scenario password-reset

Output:

Criterion: contains-reset-link

Scenario: password-reset
Type: contains
Field: response
Value: "reset link"

Configuration:
  Required: Yes
  Weight: 0.3
  Case sensitive: No

Performance (Last 30 days):
  Tests run: 156
  Passed: 148 (95%)
  Failed: 8 (5%)

  Avg check time: 12ms

Recent failures:
  • Jan 15 14:30 - "password reset" (missing "link")
  • Jan 14 11:20 - "sent reset email" (missing "link")
  • Jan 13 16:45 - "check your inbox" (missing "reset link")

Suggestions:
  • Consider accepting variations: "reset email", "reset message"
  • Current strict matching may be too rigid
  • 5% failure rate is within acceptable range

`fluxloop criteria update`

Update an existing criterion.

Usage:

fluxloop criteria update <criterion-id> [options]

Options:

--scenario <name>: Scenario containing the criterion (required)
--required <bool>: Update required flag
--weight <weight>: Update weight
--value <value>: Update value (for contains/regex types)
--threshold <ms>: Update threshold (for response_time type)

Examples:

# Update criterion weight
fluxloop criteria update contains-reset-link \
  --scenario password-reset \
  --weight 0.5

# Make criterion optional
fluxloop criteria update empathy-check \
  --scenario password-reset \
  --required false

# Update threshold
fluxloop criteria update response-time \
  --scenario password-reset \
  --threshold 5000

`fluxloop criteria delete`

Delete a criterion.

Usage:

fluxloop criteria delete <criterion-id> [options]

Options:

--scenario <name>: Scenario containing the criterion (required)
--force: Skip confirmation prompt

Examples:

# Delete criterion (with confirmation)
fluxloop criteria delete old-criterion \
  --scenario password-reset

# Delete without confirmation
fluxloop criteria delete old-criterion \
  --scenario password-reset \
  --force

`fluxloop criteria pull`

Pull criteria from the cloud.

Usage:

fluxloop criteria pull [options]

Options:

--scenario <name>: Pull criteria for specific scenario
--all: Pull criteria for all scenarios (default)

Examples:

# Pull all criteria
fluxloop criteria pull

# Pull criteria for specific scenario
fluxloop criteria pull --scenario password-reset

`fluxloop criteria push`

Push criteria to the cloud.

Usage:

fluxloop criteria push [options]

Options:

--scenario <name>: Push criteria for specific scenario
--all: Push criteria for all scenarios (default)

Examples:

# Push all criteria
fluxloop criteria push

# Push criteria for specific scenario
fluxloop criteria push --scenario password-reset

Criterion Types

1. Contains

Check if response contains specific text:

- id: contains-reset-link
  type: contains
  field: response
  value: "reset link"
  required: true
  weight: 0.3
  case_sensitive: false

2. Not Contains

Check if response doesn't contain specific text:

- id: no-error-messages
  type: not_contains
  field: response
  value: ["error", "failed", "cannot"]
  required: true
  weight: 0.2

3. Response Time

Check if response time is under threshold:

- id: response-time
  type: response_time
  threshold_ms: 3000
  required: true
  weight: 0.2

4. Sentiment

Analyze sentiment of response:

- id: empathy-check
  type: sentiment
  min_score: 0.6
  max_score: 1.0
  required: false
  weight: 0.3

5. Regex

Match response against regular expression:

- id: order-number-format
  type: regex
  field: response
  pattern: "#[0-9]{6}"
  required: true
  weight: 0.3

6. JSON Schema

Validate JSON response structure:

- id: api-response-schema
  type: json_schema
  schema:
    type: object
    required: ["status", "data"]
    properties:
      status:
        type: string
        enum: ["success", "error"]
      data:
        type: object
  required: true
  weight: 0.4

7. Custom Function

Custom evaluation logic:

- id: custom-validation
  type: custom
  function: |
    def evaluate(response, context):
        # Custom validation logic
        if "password" in response.lower():
            if "reset" in response.lower():
                return True, "Mentions password reset"
            else:
                return False, "Mentions password but not reset"
        return False, "Doesn't mention password"
  required: true
  weight: 0.3

Criteria File Format

Criteria are typically embedded in scenario files:

# scenarios/password-reset.yaml
name: password-reset
description: Test password reset flow

personas:
  - frustrated-user
  - tech-savvy-user

inputs:
  - "I can't login"
  - "Forgot my password"

criteria:
  - id: contains-reset-link
    type: contains
    field: response
    value: "reset link"
    required: true
    weight: 0.3
    description: "Agent mentions sending a reset link"

  - id: response-time
    type: response_time
    threshold_ms: 3000
    required: true
    weight: 0.2
    description: "Response within 3 seconds"

  - id: empathy-check
    type: sentiment
    min_score: 0.6
    required: false
    weight: 0.3
    description: "Empathetic and helpful tone"

  - id: follow-up-offered
    type: contains
    field: response
    value: ["help", "contact", "support", "assist"]
    required: false
    weight: 0.2
    description: "Offers additional help"

Best Practices

1. Balance Required and Optional

criteria:
  # Required criteria (must pass)
  - id: core-functionality
    required: true
    weight: 0.5

  # Optional criteria (nice to have)
  - id: extra-feature
    required: false
    weight: 0.2

2. Use Appropriate Weights

criteria:
  # Critical (high weight)
  - id: security-check
    weight: 0.4

  # Important (medium weight)
  - id: functionality-check
    weight: 0.3

  # Nice-to-have (low weight)
  - id: tone-check
    weight: 0.1

3. Provide Clear Descriptions

- id: mentions-timeline
  type: contains
  value: ["minutes", "hours", "shortly", "soon"]
  description: |
    Agent provides a timeline for resolution.
    Helps set user expectations.
    Pass examples: "within 10 minutes", "shortly"
    Fail examples: no timeline mentioned

4. Test Criteria Independently

# Test single criterion
fluxloop test \
  --scenario password-reset \
  --criterion contains-reset-link

5. Monitor Criterion Performance

# View criterion statistics
fluxloop criteria show contains-reset-link \
  --scenario password-reset

Troubleshooting

Criterion Always Fails

⚠️  Warning: Criterion 'contains-reset-link' has 0% pass rate

Scenario: password-reset
Tests run: 10
Passed: 0 (0%)

Recent responses:
  • "I've sent a password reset email"
  • "Check your inbox for reset instructions"
  • "You'll receive a reset message shortly"

Suggestion:
  • Value "reset link" is too specific
  • Consider accepting variations: ["reset", "password reset"]
  • Or use regex: "reset.*(link|email|message)"

Criterion Too Lenient

⚠️  Warning: Criterion 'mentions-help' has 100% pass rate

This criterion might be too lenient and not providing value.

Suggestion:
  • Review criterion requirements
  • Make it more specific
  • Or remove if not needed

fluxloop test: Run tests with criteria
fluxloop scenarios: Manage scenarios
fluxloop sync: Sync criteria with cloud

Synopsis​

Description​

Commands​

fluxloop criteria list​

fluxloop criteria create​

fluxloop criteria show​

fluxloop criteria update​

fluxloop criteria delete​

fluxloop criteria pull​

fluxloop criteria push​

Criterion Types​

1. Contains​

2. Not Contains​

3. Response Time​

4. Sentiment​

5. Regex​

6. JSON Schema​

7. Custom Function​

Criteria File Format​

Best Practices​

1. Balance Required and Optional​

2. Use Appropriate Weights​

3. Provide Clear Descriptions​

4. Test Criteria Independently​

5. Monitor Criterion Performance​

Troubleshooting​

Criterion Always Fails​

Criterion Too Lenient​

Related Commands​

See Also​

Synopsis

Description

Commands

`fluxloop criteria list`

`fluxloop criteria create`

`fluxloop criteria show`

`fluxloop criteria update`

`fluxloop criteria delete`

`fluxloop criteria pull`

`fluxloop criteria push`

Criterion Types

1. Contains

2. Not Contains

3. Response Time

4. Sentiment

5. Regex

6. JSON Schema

7. Custom Function

Criteria File Format

Best Practices

1. Balance Required and Optional

2. Use Appropriate Weights

3. Provide Clear Descriptions

4. Test Criteria Independently

5. Monitor Criterion Performance

Troubleshooting

Criterion Always Fails

Criterion Too Lenient

Related Commands

See Also