Luma AI Jobs

Senior Site Reliability Engineer

Luma AI

Senior Site Reliability Engineer

Reposted 2 Days Ago

In-Office or Remote

2 Locations

170K-290K Annually

Expert/Leader

In-Office or Remote

2 Locations

170K-290K Annually

Expert/Leader

As a Software Engineer in Reliability, you'll architect and manage multi-cloud GPU infrastructure, ensuring performance, security, and scale while debugging complex hardware/software issues.

The summary above was generated by AI

About Luma AI

Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale. Our SRE team is the foundation of our research and product velocity, responsible for the thousands of NVIDIA and AMD GPUs across multiple providers that power our work.

Where You Come In

We are looking for a hands-on, first-principles engineer who is fluent in Linux, comfortable operating close to the metal, and capable of architecting systems for the next generation of AI infrastructure.

You will build, maintain, and scale Luma’s infrastructure across on-prem and multi-vendor clouds (AWS & OCI), serving as the bridge between hardware vendors, cloud providers, and our research teams.

What You’ll Do

Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next-generation infrastructure operates.
Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance.
Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast-moving AI startup environment.
Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level.
Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil.
Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA.

Who You Are

5+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment.
Deep Linux Mastery: You possess deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance.
Expert in Technologies: You have working experiencewith Terraform, Airflow, and Ray
Cloud Infrastructure Expert: You have strong experience with providers like AWS or OCI.
Tenacious Troubleshooter: You thrive on solving complex, low-level problems where hardware and software intersect.
Startup DNA: You are energetic and thrive in a less structured, fast-paced environment.
Security-Minded: You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO.
Expert in High-Performance Networking: You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs.

What Sets You Apart (Bonus Points)

Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm.
Experience managing large-scale GPU clusters for AI/ML workloads (training or inference).
Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray.
Deep expertise in Data Pipeline and Infrastructure

Compensation

The base pay range for this role is $170,000 – $290,000 per year.

About Luma

Luma’s mission is to build unified general intelligence that can generate, understand, and operate in the physical world.

We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change.

Similar Jobs

MongoDB

Site Reliability Engineer

23 Days Ago

Easy Apply

Remote or Hybrid

New Jersey, USA

Easy Apply

127K-249K Annually

Senior level

127K-249K Annually

Senior level

Big Data • Cloud • Software • Database

Maintain and improve multi-cloud Kubernetes infrastructure, CI/CD (Argo Workflows/ArgoCD), observability, and networking. Build reliable continuous deployment tooling and onboarding flows, provide internal support, collaborate across Platform Engineering, contribute upstream (open-source/operators), and participate in a 24/7 on-call rotation to resolve deployment infrastructure issues.

Top Skills: AlertingArgo WorkflowsArgocdAWSAzureCi/CdContainersDnsGCPGoKubernetesLinuxLoad BalancerObservabilityPythonService MeshTcp/IpTls

Life360

Senior Site Reliability Engineer

2 Days Ago

Remote

163K-194K Annually

Senior level

163K-194K Annually

Senior level

Kids + Family • Mobile

Build and maintain large-scale infrastructure platforms (Kubernetes, AWS) using AI-native tooling. Design scalable, resilient platform services, resolve complex failures, drive cost efficiency, participate in on-call rotation, mentor engineers, and automate infrastructure via IaC and CI/CD.

Top Skills: AnsibleArgocdAWSBashChefCi/CdClaude CodeCloudflareCloudFormationConsulCursorDockerEc2Github CopilotHaproxyJavaJenkinsKafkaKubernetesLinuxNexusPythonService MeshTerraform

Remote (Remote.com)

Senior Site Reliability Engineer

25 Days Ago

Remote

54K-150K Annually

Senior level

54K-150K Annually

Senior level

HR Tech

Own infrastructure strategy and operational excellence for Remote Build: design IaC with Terraform and Kubernetes, implement observability and incident response, embed security and compliance, optimize performance and costs, automate toil, and improve platform reliability and developer experience.

Top Skills: AWSBashCi/CdCliDatadogDocker HubEcrElixirElkGithub ActionsGitlabGoGrafanaInfrastructure-As-CodeJavaJenkinsKubernetesLinuxMcpNode.jsPrometheusPythonTerraform

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center