VideoRAG

Evaluation

Retrieval ablation and answer-quality metrics over the golden set. The browser renders the committed artifact generated by eval/run_eval.py.

Real evaluation. Real evaluation over 13 indexed videos and 145 hand-labeled queries. Metrics are regression signals, not universal benchmark claims.

Golden set
eval/golden/expanded.jsonl
Generated
Jun 3, 2026, 6:30 PM ET
Judge
haiku
Primary config
production
Config:

Retrieval ablation · 145 golden queries · top-10

ConfigMRRTimestamp@5sNo-answer F1
Dense only88%82%40%
Dense + loose gate88%82%12%
Dense + strict gate82%77%63%
Hybrid BM2586%78%22%
Hybrid + rerank89%81%22%
Hybrid + rewrite86%78%22%
Production86%79%84%

Answer quality — Production

Quality
85%
Grounded
96%
Correct
80%
Useful
94%
123 answerable queries judged by haiku

No-answer gate

16
True refuse
0
Missed refuse
6
Over-refuse
123
True answer

Refusal precision 73% · recall 100%