People

Evaluation & Harness Engineer

US · Remote · Senior (4+ years)

Evaluation infrastructure is how we know whether research decisions improve the product for real customers — not a QA afterthought.

What you will work on

Retrieval evaluation

Precision, recall, ranking, latency, confidence calibration across search paths.

Entity Coherence evaluation

Identity resolution accuracy and coherence drift detection.

Agent reasoning evaluation

Multi-hop failure modes: premature termination, over-retrieval, confident wrong answers.

Regression detection and CI/CD

Continuous evaluation on every deployment.

What we are looking for

Required

  • 4+ years building evaluation infrastructure for AI systems
  • Information retrieval metrics in production
  • Ground truth datasets for ambiguous correct answers
  • Python and ML experimentation infrastructure

Strongly preferred

  • RAG or agent reasoning evaluation in production
  • CI/CD for ML evaluation

Not a fit if

  • Only unit/integration testing without AI evaluation methodology