P-1 AI Logo

P-1 AI

AI Evals Lead

Posted 25 Days Ago
Remote
Hiring Remotely in United States
170K-200K Annually
Mid level
Remote
Hiring Remotely in United States
170K-200K Annually
Mid level
The AI Evals Lead will develop and manage eval benchmarks for AI systems, ensuring effective performance measurement and quality assurance through collaboration with engineering teams and industry partners.
The summary above was generated by AI

About P-1 AI:

We are building an engineering AGI. We founded P-1 AI with the conviction that the greatest impact of artificial intelligence will be on the built world. Our first product is Archie, an AI engineer capable of quantitative intuition over physical product domains and engineering tool use. Archie initially performs at the level of an entry-level design engineer but rapidly gets smarter and more capable. We aim to put an Archie on every engineering team at every industrial company on earth.

Our founding team includes the top minds in deep learning, model-based engineering, and industries that are our customers. We closed a $23 million seed round led by Radical Ventures that includes a number of other AI and industrial luminaries (from OpenAI, DeepMind, etc.).


About the Role:

In this role, you’ll be responsible for the evals that we use to ensure that Archie is learning and retaining the skills needed to successfully perform its engineering work, and to benchmark it against industry skill expectations. Working within a small, tightly-knit team of high-performers, you’ll be principally responsible for clearly defining, implementing, and validating these, including input from our engineering experts and industrial partners. You’ll also be responsible for translating these eval tests into multiple formats for use with different types of AI and non-AI systems and agents.

This role can be either remote (based in the US or Canada and with existing work authorization) or based in our San Mateo Bay Area office. If you are remote, you should plan to spend one week per quarter co-working with the rest of the company in our San Mateo office, with the occasional team travel workshop in between. We will support relocation for candidates interested in moving to the Bay Area.


What you’ll do:

  • Implement and operate the system for organizing, transforming, running, grading, and reporting on eval benchmarks.

  • Design and execute the process by which we develop and QA our evals, incorporating contributions from our own engineering team, industrial partners, and subject-matter experts.

  • Ensure that evals run effectively within our CI/CD system, continuously benchmarking our evolving AI platform and the experiments we’re performing around it.

  • Create methods for detecting and testing for common quality challenges of AI, including hallucinations, undesirable stochasticity, and regressions.

  • Be a technical leader in the consistent implementation and organization of automated tests across other areas of our technology stacks.

Who you are:

  • Experience in constructing comprehensive test suites for software and/or AI systems, including coordinating the contributions of others.

  • Experience designing metrics to evaluate systems and visualize their performance, including differences across successive generations.

  • Experience in developing, managing, and running evals against LLM-based systems is a strong plus.

  • Good communication skills with a variety of stakeholders (AI researchers, domain experts, application developers).

  • Proficiency in Python programming, complex modules and modern software development tools and practices (Git, CI/CD, etc.).

  • Ability to thrive in a fast-paced, dynamic startup environment.

Our Values:

Mission obsession & urgency: We are obsessed with building engineering AGI as quickly as possible. We also recognize that as a startup, speed is our most precious competitive advantage. We are constantly asking ourselves what we can do to go faster. We make tradeoffs and sacrifices (personally and in the workplace) in exchange for speed.

Intellectual excellence & curiosity: We ask “what if?” and experiment liberally. We always look for better ways of doing something. We read voraciously. We challenge each other to be better. We surround ourselves with A players and we actively and unapologetically reject B players (and even B+ players⸺because they tend to surround themselves with C players).

Shipping discipline: We treat production with respect. We test and demo our product constantly. We listen attentively to our customers, users, and stakeholders, and we respect our commitments to them. We also respect our commitments to each other and will go the extra mile (or ten or one hundred) to honor them.

Ownership: We all have significant ownership stakes in the company and operate in founder mode. We believe in hierarchical requirements but not in hierarchical information flows. If we see that something is broken or can be done better, we flag it and we fix it. We encourage each other to play with and fix anything and everything... but there’s a clear owner for everything.

Interview Process:

  • Initial screening call (30 mins)

  • Biographical/behavioural interview (45 mins)

  • Technical interview (75 mins)

  • CEO interview (30 mins)

Compensation:

Salary: $170k - $200k.

This role includes a significant equity component. We are an early-stage startup, so we favor equity over cash in our current compensation philosophy. This role is best suited for candidates who value long-term ownership and impact over short-term cash optimization. Our benefits include healthcare, dental, and vision insurance, 401k with employer matching, unlimited PTO.

Similar Jobs

3 Minutes Ago
Remote or Hybrid
United States
106K-141K Annually
Mid level
106K-141K Annually
Mid level
Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Manage ACDelco distribution and dealer relationships within Northern California to drive sales, execute marketing and inventory programs, train and direct sales reps, prospect new distribution opportunities, support forecasting and warranty initiatives, and travel frequently to meet business objectives.
Top Skills: ExcelGm Parts SystemsMs OutlookPowerPoint
3 Minutes Ago
Remote or Hybrid
United States
106K-141K Annually
Mid level
106K-141K Annually
Mid level
Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Lead ACDelco growth across a multi-state territory by driving sales, program participation, inventory placement, and marketing execution with Direct and National Accounts. Develop and direct sales reps, prospect distribution opportunities, execute local promotions, support training, monitor warranty/returns, provide market input for forecasting, and maintain customer relationships. Role requires frequent travel and independent field leadership to meet regional and national revenue targets.
Top Skills: ExcelGm Parts SystemsMs OutlookPowerPoint
3 Minutes Ago
Remote or Hybrid
United States
107K-175K Annually
Senior level
107K-175K Annually
Senior level
Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Design, build, and maintain scalable BI solutions and Power BI dashboards for CRM performance and lifecycle analytics. Integrate AEP, RTCDP, and CRM data, ensure data quality and governance, automate reporting across brands/markets, support attribution and migration to cloud, and collaborate with cross-functional teams to deliver actionable marketing insights.
Top Skills: Adobe Experience Platform (Aep)AzureCrm SystemsExcelOracle Pl/SqlPower BIRtcdpSQLTableauTeradata

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

  • Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
  • Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
  • Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
  • Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account