People

Evaluation & Harness Engineer

US · Remote · Senior (4+ years)

Evaluation infrastructure is how we know whether research decisions improve the product for real customers — not a QA afterthought.

What you will work on

Retrieval evaluation

Precision, recall, ranking, latency, confidence calibration across search paths.

Entity Coherence evaluation

Identity resolution accuracy and coherence drift detection.

Agent reasoning evaluation

Multi-hop failure modes: premature termination, over-retrieval, confident wrong answers.

Regression detection and CI/CD

Continuous evaluation on every deployment.

What we are looking for

Required

4+ years building evaluation infrastructure for AI systems
Information retrieval metrics in production
Ground truth datasets for ambiguous correct answers
Python and ML experimentation infrastructure

Strongly preferred

RAG or agent reasoning evaluation in production
CI/CD for ML evaluation

Not a fit if

Only unit/integration testing without AI evaluation methodology

Apply → All roles