Protege Logo

Protege

Research Scientist, Benchmarks & Evaluations

Posted 25 Days Ago
Be an Early Applicant
Remote
Hiring Remotely in USA
Entry level
Remote
Hiring Remotely in USA
Entry level
The Research Scientist will design and evaluate benchmarks for AI models, validating their effectiveness and translating findings into products for use across various domains. Responsibilities include rigorous evaluation, publishing research, and collaborating on evaluation data with annotators.
The summary above was generated by AI

Company Overview:

We are building Protege to solve the biggest unmet need in AI — getting access to the right training data. The process today is time intensive, incredibly expensive, and often ends in failure. The Protege platform facilitates the secure, efficient, and privacy-centric exchange of AI training data.

Solving AI’s data problem is a generational opportunity. We’re backed by world-class investors and already powering partnerships with some of the most ambitious teams in AI. The company that succeeds will be one of the largest in AI — and in tech.

We’re a lean, fast-moving, high-trust team of builders who are obsessed with velocity and impact. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI.

DataLab is Protege’s research arm — a team of research scientists committed to tackling the fundamental challenges and open questions regarding data for AI. We bridge the gap between research theory and data deployment to push the frontier forward, publishing on the questions that matter: what agentic AI should actually be trained to do, how to quality-control large-scale corpora, and how to build evaluation datasets that reflect the real world rather than the leaderboard.

We’re a lean, fast-moving, high-trust team of builders who deeply care about scientific rigor and impact. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI.

The Role

Benchmarks decide what AI gets built. Today, most evals don’t measure what we actually care about — they’re contaminated, gameable, synthetic or measure capabilities that don’t transfer to the real tasks frontier models are deployed against. We’re hiring a Research Scientist to lead the design of benchmarks and evaluations that frontier labs, enterprises, and policymakers can actually trust.

You’ll own the science of evaluation across DataLab — designing tasks that meaningfully separate models, validating those tasks against human baselines, and pressure-testing them for contamination, elicitation gaps, and statistical noise. You’ll publish, and your work will directly shape the eval datasets Protege delivers to the most ambitious teams in AI.

What you’ll do

  • Design tasks and benchmarks that distinguish capability levels across frontier models — including agentic, reasoning-heavy, and domain-specific (healthcare, finance, scientific) settings.

  • Validate evaluations rigorously: run human baselines, analyze inter-rater reliability, study how elicitation and scaffolding shift results, and quantify what’s signal versus noise.

  • Develop the “science of evals” at Protege — including item response theory, contamination analysis, predictive validity studies, and statistical frameworks for comparing models with appropriate uncertainty.

  • Run evaluations on current frontier models, sometimes in collaboration with partners at AI labs, enterprises, and government.

  • Publish research that establishes Protege as the standard-setter for evaluation data, and contribute to the broader AI community’s understanding of what good evals look like.

  • Translate findings into product, working closely with the data and engineering teams to turn research into evaluation datasets customers can deploy.

  • Partnering with outsourced annotation vendors - Evaluation data is only as good as the people producing it. A meaningful share of this role is owning the statistical machinery that determines which annotators we trust, on which tasks, and by how much — and translating that into trustworthiness scores Protege’s customers can rely on..

What we’re looking for

  • Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field — applied econometrics with AI experience, quantitative finance, computer science, engineering, statistics/mathematics or any applied research discipline.

  • Hands-on experience evaluating LLMs, agents, or other ML systems — including prompting, scaffolding, and fluency with the tooling researchers use to run evals at scale.

  • Experience with annotator quality and inter-rater reliability — designing labeling protocols, computing agreement statistics, and reasoning about annotator bias and calibration.

  • Excellent scientific writing and communication — you can synthesize technical findings into narratives that frontier labs, enterprise customers, and policymakers can act on.

  • A bias toward velocity. You know which pipelines need to be production-grade and which can be scrappy, and you get reliable results fast.

Bonus

  • Experience with RL evaluation techniques — reward modeling, off-policy evaluation, evals for RLHF/RLAIF or agentic RL pipelines.

  • Ability to navigate new customer architectures, data systems, and requirements quickly.

  • Experience with latent-variable models of annotator skill (Dawid-Skene, MACE, IRT-style approaches) or with running large expert-annotator panels in regulated domains.

  • Track record of published benchmarks or evaluation papers the field has adopted.

Protege Values

Pass the Loved Ones’ Test
We act with integrity and do the right thing — especially when it’s hard and no one is watching.
Always Find a Way
We are resourceful, resilient builders who solve hard problems and push through obstacles.
Go Fast and Grow Fast
Velocity matters. We move with urgency, learn quickly, and continuously improve as individuals and as a company.
Practice Kindness and Candor
We communicate directly and respectfully, building trust through honest feedback and genuine care for one another.
Deliver Together
We win as one team. Collaboration, accountability, and shared ownership drive our success.
Own the Outcome. Hone the Craft.
We take pride in our work, sweat the details, and continuously raise the bar for excellence.

Similar Jobs

38 Minutes Ago
In-Office or Remote
200K-300K Annually
Expert/Leader
200K-300K Annually
Expert/Leader
Information Technology • Software • Financial Services • Big Data Analytics
Global Quantitative Researchers at Citadel leverage advanced statistical and quantitative techniques to drive investment strategies and optimize portfolios.
Top Skills: C++Python
3 Hours Ago
Remote or Hybrid
Pennsylvania, USA
71K-166K Annually
Junior
71K-166K Annually
Junior
Digital Media • Information Technology • News + Entertainment
Full‑stack .NET developer responsible for writing, maintaining and optimizing code, designing APIs and system architecture, implementing unit/integration tests, supporting deployments, troubleshooting performance issues, and collaborating with QA and stakeholders. May work variable hours including nights/weekends.
Top Skills: AjaxAngularAsp.NetBootstrapperC#Continuous IntegrationCSSEntity FrameworkGitHTMLIisIocJavaJavaScriptJqueryJSONLinqMvc 5Net Core 2.0Net FrameworkOrmSalesforce Experience CloudSap AbapSQL ServerTfsTypescriptVb.NetVisual StudioWeb ApiXML
3 Hours Ago
Remote or Hybrid
Pennsylvania, USA
84K-196K Annually
Senior level
84K-196K Annually
Senior level
Digital Media • Information Technology • News + Entertainment
Design, build, test, and deploy scalable Salesforce solutions across multi-cloud environments using Apex, LWC, Visualforce, declarative tools and integrations (MuleSoft/REST/SOAP). Lead configuration, data modeling, CI/CD, code reviews, troubleshooting, and Agile delivery while mentoring junior engineers and supporting platform governance and long-term architectural alignment.
Top Skills: ApexCi/CdCopadoCSSFlowsGitHTMLJavaScriptLightning App BuilderLightning Web Components (Lwc)Media CloudMulesoftRest ApisSales CloudSalesforce Experience CloudService CloudSoap ApisSOQLSoslVisualforce

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

  • Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
  • Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
  • Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
  • Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account