adaption Logo

adaption

Distributed Systems Engineer, Data & Inference Platform

Reposted Yesterday
Remote or Hybrid
Hiring Remotely in CA
Senior level
Remote or Hybrid
Hiring Remotely in CA
Senior level
Design and operate distributed inference systems for LLMs, build large-scale data pipelines, and debug production issues while collaborating with researchers and ML engineers.
The summary above was generated by AI
The Role

You'll build and operate the systems that turn raw compute into useful intelligence — the inference services that serve LLMs at scale and the data pipelines that feed them. One week you're hunting a tail-latency regression in a production inference service handling millions of requests; the next you're redesigning a Ray Data pipeline so it stops melting down at petabyte scale. The work spans architecture, implementation, and the on-call pager that keeps you honest about both. Researchers and ML engineers will hand you workloads that barely run; you'll hand them back systems that run reliably, efficiently, and cheaply enough to matter.

Responsibilities
  • Serve Models at Scale: Design and operate distributed inference systems for LLMs, optimizing throughput, latency, and cost across heterogeneous GPU fleets. Batching, scheduling, KV cache management, autoscaling — you own the levers that make inference economical.

  • Move the Data: Build large-scale data pipelines (Ray Data, Spark, or equivalents) that ingest, transform, and curate the datasets behind training and evaluation. The bottleneck is rarely where people think it is, and you find it.

  • Debug the Undebuggable: Chase down the failure modes that only emerge under real production traffic — stragglers, head-of-line blocking, silent data corruption, GPU memory fragmentation — and write the postmortems that prevent the next ten. Define SLOs, build the observability to measure them, and own the on-call rotation that defends them.

  • Partner Across the Stack: Work directly with researchers and ML engineers to take experimental workloads from "runs on one node" to "runs in production." You're a systems partner, not a ticket queue.

Qualifications
  • 5+ years building and operating distributed systems in production.

  • Deep experience with at least one large-scale data or compute framework (Ray, Spark, Flink, Beam, Dask).

  • Strong fluency in Python and at least one systems language (Go, Rust, C++).

  • Working knowledge of the GPU/accelerator stack: CUDA fundamentals, NCCL, mixed precision, memory layout. You don't need to write kernels, but you should know why a workload is bound by what it's bound by.

  • Experience operating Kubernetes-based infrastructure, including custom operators or schedulers.

  • A track record of owning hard production incidents end-to-end — diagnosis, mitigation, and the durable fix.

  • Bonus: hands-on experience with LLM inference engines (vLLM, SGLang, TensorRT-LLM, TGI), modern lakehouse formats (Iceberg, Delta, Hudi), or open-source contributions to relevant projects.

Above all, we're looking for great teammates who make work feel lighter and aren't afraid to go out on a limb with bold ideas. You don't need to be perfect, but you do need to be adaptable. We encourage you to apply, even if you don't check every box.

About Us

Most AI is frozen in place - it doesn't adapt to the world. We think that's backwards. Our mandate is to build efficient intelligence that evolves in real-time. Our vision is AI systems that are flexible, personalized, and accessible to everyone. We believe efficiency is what makes this possible - it's how we expand access and ensure innovation benefits the many, not the few. We believe in talent density: bringing together the best and most driven individuals to push the boundaries of continual adaptation. We're looking for builders and creative thinkers ready to shape the next era of intelligence.

 
Benefits
  • Flexible work: In-person collaboration in the Bay Area, a distributed global-first team, and team offsites.

  • Adaption Passport: Annual travel stipend to explore a country you've never visited. We're building intelligence that evolves alongside you, so we encourage you to keep expanding your horizons.

  • Lunch Stipend: Weekly meal allowance for take-out or grocery delivery.

  • Well-Being: Comprehensive medical benefits and generous paid time off.

Similar Jobs

47 Minutes Ago
Remote
125K-136K Annually
Mid level
125K-136K Annually
Mid level
Artificial Intelligence • Information Technology • Professional Services • Software • Analytics • Generative AI • Big Data Analytics
The Personalization Manager will drive testing and personalization projects, develop strategies aligned with business objectives, oversee CRO platforms, and ensure effective audience management for marketing campaigns.
Top Skills: Adobe Experience PlatformAdobe TargetAPIsCdpsMarketing AutomationOmni-Channel Campaign ManagementTag Management Systems
59 Minutes Ago
Easy Apply
Remote or Hybrid
Easy Apply
Senior level
Senior level
AdTech • Cloud • Marketing Tech • Productivity • Software • Analytics • Automation
The Senior Product Manager will lead the rollout of Acquia's AI-Native Cloud Platform, manage product lifecycle transitions, and ensure customer expansion into AI-native services.
Top Skills: AICloud InfrastructurePaas
An Hour Ago
Easy Apply
Remote
United States
Easy Apply
202K-338K Annually
Senior level
202K-338K Annually
Senior level
Cloud • Security • Software • Cybersecurity • Automation
The Principal Solutions Architect will lead technical architectures, drive AI-focused solutions, collaborate across teams, and mentor others to guide customers in maximizing GitLab’s DevSecOps platform's value.
Top Skills: AICi/CdCloud ComputingDevsecopsGitlab

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

  • Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
  • Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
  • Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
  • Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account