NVIDIA

Principal Software Engineer, Rack-Scale System Software — CSP Engagements

Posted 13 Hours Ago

Be an Early Applicant

In-Office or Remote

3 Locations

272K-431K Annually

Expert/Leader

In-Office or Remote

3 Locations

272K-431K Annually

Expert/Leader

Lead technical engagements with cloud service providers on rack-scale system software and firmware. Drive architecture alignment, integration readiness, error handling and recovery, health telemetry, firmware orchestration, and serviceability. Capture CSP feedback, influence NVIDIA system software design, improve tooling and tests, and mitigate execution risks through early collaboration.

The summary above was generated by AI

We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for rack-scale system SW/FW, working with CSP engineering teams to ensure they can deploy, monitor, and operate these systems reliably at fleet scale. In this role, you will collaborate with NVIDIA's cross-functional rack-scale system SW/FW engineering teams with dedicated CSP-facing technical leadership. Your focus is on the system-level software that manages, monitors, and recovers the rack as a whole — fabric management, GPU/NVSwitch error handling and recovery, health telemetry APIs, firmware update orchestration, and SW-driven serviceability. You will drive work streams with CSP engineering teams to build shared understanding of the architecture, incorporate their operational feedback, and ensure integration readiness.

What you'll be doing:

Drive rack-scale SW/FW architecture alignment across CSP engagements — including fabric management software, link health monitoring, GPU/NVSwitch error handling, SW/FW serviceability features (e.g., hot-plug support, component isolation, firmware-driven recovery), and multi-component firmware orchestration
Drive technical work streams with CSP engineering teams on rack-scale system software — ensuring they deeply understand fabric management, NVSwitch behavior, error handling and recovery policies, health telemetry APIs, and SW/FW-controlled recovery operation
Capture and synthesize CSP engineering feedback on rack-scale system software — health monitoring APIs, SW-driven serviceability workflows, firmware update orchestration, and error recovery behavior — champion that feedback into NVIDIA's architecture decisions
Collaborate with multi-functional teams to ensure customer operational requirements are reflected in system software and firmware development
Identify cross-CSP patterns in rack-scale SW/FW issues, error handling behavior, and system configuration practices — drive documentation, tooling, and test strategy improvements as a result
Collaborate with execution teams on left-shift strategy — ensuring customer-side SW/FW integration work is identified early and completed ahead of hardware availability
Make critical technical decisions on rack-scale system SW/FW tradeoffs and mitigate execution risks through early engagement with CSP engineering teams

What we need to see:

15+ years of experience in system software, platform firmware, or large-scale distributed systems engineering. BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience)
Deep understanding of rack-scale system software challenges: multi-component coordination, error propagation, health monitoring, and serviceability / reliability
Experience with fabric management software, cluster management, or system-level orchestration frameworks. Familiarity with firmware architectures and update lifecycle management (multi-component update sequencing, rollback, recovery)
Understanding of error handling and recovery design patterns in distributed systems — fault isolation, retry policies, graceful degradation
Experience with health monitoring and telemetry systems: health scoring, event correlation, API design for fleet-level observability
Understanding of GPU or accelerator system software (drivers, device management, power management) is a strong plus
Customer obsession — genuine passion for understanding how CSPs operate sophisticated systems at fleet scale and simplifying their experience
Proven success providing technical leadership across organizational boundaries and influencing system software design without direct authority. Strong communication — ability to translate complex system software architecture into actionable mentorship for customer engineering teams

Ways to stand out from the crowd:

Experience with NVIDIA NVSwitch, NVOS, or GPU fabric management software
Background in system software for large-scale clusters at a hyperscaler (cluster management, fleet orchestration, health platforms)
Experience crafting error handling and recovery frameworks for multi-component systems (hundreds or thousands of coordinating devices)
Familiarity with GPU or accelerator fleet operations — driver lifecycle, firmware rollout strategies, health-based scheduling
Understanding of how system software decisions impact serviceability, availability, and operational cost at fleet scale

NVIDIA’s invention of the GPU in 1999 fueled the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern deep learning — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company.” We're looking to grow our company and establish teams with the most thoughtful people in the world. Are you ready to change the next generation of computing? Join us at the forefront of technological advancement.

NVIDIA data center systems, such as DGX and HGX, have become core to NVIDIA's rapidly growing enterprise and cloud provider businesses. These platforms bring together the full power of NVIDIA GPUs, NVIDIA NVLink, NVIDIA InfiniBand networking, NVIDIA Grace CPUs, and a fully optimized NVIDIA AI and HPC software stack.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 30, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center