People
Evaluation & Harness Engineer
US · Remote · Senior (4+ years)
Evaluation infrastructure is how we know whether research decisions improve the product for real customers — not a QA afterthought.
What you will work on
Retrieval evaluation
Precision, recall, ranking, latency, confidence calibration across search paths.
Entity Coherence evaluation
Identity resolution accuracy and coherence drift detection.
Agent reasoning evaluation
Multi-hop failure modes: premature termination, over-retrieval, confident wrong answers.
Regression detection and CI/CD
Continuous evaluation on every deployment.
What we are looking for
Required
- 4+ years building evaluation infrastructure for AI systems
- Information retrieval metrics in production
- Ground truth datasets for ambiguous correct answers
- Python and ML experimentation infrastructure
Strongly preferred
- RAG or agent reasoning evaluation in production
- CI/CD for ML evaluation
Not a fit if
- Only unit/integration testing without AI evaluation methodology