# Ticket: Test fresh-context routing behavior

## Metadata
- Type: Ticket
- Status: Planned
- Project: Nimrod / Agent Operations
- Created: 2026-06-06
- Updated: 2026-06-06
- Priority: High

## Goal

Test whether fresh-context agents use the revised `AGENTS.md` as intended: lightweight startup, progressive disclosure, ticket/spec workflow, QRSPI, DevOps controls, locks, registry checks, and context hygiene.

## Why

The rewritten `AGENTS.md` should be validated against realistic tasks before relying on it. We need to know whether fresh agents route requests correctly without overloading context or skipping safety/process requirements.

## Scope

Included:
- Define test prompts across software, DevOps, hybrid, personal-assistant, and recovery tasks
- Capture initial agent responses only, without allowing uncontrolled implementation
- Evaluate responses against expected behavior
- Feed failures back into `AGENTS.md` or linked docs

Not included:
- Letting test agents perform live infrastructure mutations
- Testing every possible Nimrod role in the first pass

## Acceptance Criteria

This ticket is done when:
- [x] A fresh-context test matrix exists — `tickets/artifacts/2026-06-06-test-fresh-context-routing/test-matrix.md` with 6 prompts covering software, DevOps, hybrid, personal assistant, recovery, and triage scenarios
- [ ] At least 5 test prompts are run — needs fresh Pi session per test
- [ ] Initial responses are reviewed against expectations
- [ ] Failures are classified
- [ ] Required improvements are made to `AGENTS.md` or linked docs

## Progress

2026-06-07:
- Created test matrix at `tickets/artifacts/2026-06-06-test-fresh-context-routing/test-matrix.md` with 6 prompts and scoring template.
- Previous test results exist in `tickets/artifacts/2026-06-06-test-fresh-context-routing/` from 2026-06-06.
- Many gaps identified in earlier tests were addressed in subsequent AGENTS.md revisions and policy doc creation.

## Next Actions

1. Run test prompts 1–6 in fresh Pi sessions with dry-run instruction.
2. Score and classify failures.
3. Patch docs/processes based on findings.
