Evaluation
Retrieval ablation and answer-quality metrics over the golden set. The browser renders the committed artifact generated by eval/run_eval.py.
Real evaluation. Real evaluation over 13 indexed videos and 145 hand-labeled queries. Metrics are regression signals, not universal benchmark claims.
- Golden set
- eval/golden/expanded.jsonl
- Generated
- Jun 3, 2026, 6:30 PM ET
- Judge
- haiku
- Primary config
- production
Config:
Retrieval ablation · 145 golden queries · top-10
| Config | MRR | Timestamp@5s | No-answer F1 |
|---|---|---|---|
| Dense only | 88% | 82% | 40% |
| Dense + loose gate | 88% | 82% | 12% |
| Dense + strict gate | 82% | 77% | 63% |
| Hybrid BM25 | 86% | 78% | 22% |
| Hybrid + rerank | 89% | 81% | 22% |
| Hybrid + rewrite | 86% | 78% | 22% |
| Production | 86% | 79% | 84% |
Answer quality — Production
- Quality
- 85%
- Grounded
- 96%
- Correct
- 80%
- Useful
- 94%
123 answerable queries judged by haiku
No-answer gate
16
True refuse
0
Missed refuse
6
Over-refuse
123
True answer
Refusal precision 73% · recall 100%