Graphcore Logo

Graphcore

Distinguished Engineer - Inference Serving Network and Storage

Posted 2 Days Ago
Be an Early Applicant
Hybrid
Austin, TX, USA
Expert/Leader
Hybrid
Austin, TX, USA
Expert/Leader
Lead the networking and storage architecture for inference serving, defining strategy and technical direction for large-scale AI services.
The summary above was generated by AI
About us

Graphcore is a globally recognized leader in Artificial Intelligence computing systems. The company designs advanced semiconductors and data center hardware that provide the specialized processing power needed to drive AI innovation, while delivering the efficiency required to support its broader adoption.

As part of the SoftBank Group, Graphcore is a member of an elite family of companies responsible for some of the world’s most transformative technologies.

Job Summary

We are seeking a Distinguished Engineer to lead the networking and storage architecture for a new inference serving initiative. This is a chief technologist role for the serving fabric and data path, responsible for defining and driving the end-to-end strategy for networking, storage, observability, provisioning, and automation in support of large-scale AI inference services.

You will shape core technical decisions that directly influence product capability, service differentiation, and competitive advantage. On the networking side, you will lead the design of the serving fabric, inter-partition latency path, management network, QoS and transport tuning, segmentation, observability, and automation. In terms of storage, you will define the architecture for model artifact storage, checkpoint distribution, KV and session tiering and restore, telemetry and log storage, and backup and disaster recovery.  

Storage is expected to be a critical component of inference serving at scale, particularly for KV cache management, state movement, and service resiliency. You will therefore set technical direction across both networking and storage domains as first-class pillars of the platform.

This is a Grade 7 role for a recognized expert and thought leader who can convert strategic thinking into tangible group-level impact, lead a small team, and have influence across functions and external partners.

The Team

You will be in the System Engineering group and work across organizational boundaries with ML software, applied AI, hardware and systems, inference service teams, and other platform and infrastructure groups. You will also engage closely with external partners responsible for key elements of the inference service stack, as well as business counterparts who depend on differentiated service capabilities, reliability, and scale.

This role requires strong technical leadership without relying solely on formal authority. You will be expected to align stakeholders, make architectural trade-offs clear, and drive execution across multiple teams while raising the technical bar for the broader organization.

Responsibilities and Duties
  • Define and coordinate the networking architecture for inference serving, including serving fabric build, inter-partition latency path optimization, and management network architecture.  
  • Lead the strategy for QoS, transport tuning, traffic isolation, segmentation, and service differentiation to support multiple inference SLAs and workload classes.
  • Drive the build of monitoring, resource prioritization, and automated management frameworks for network and storage systems at production scale.  
  • Define the storage architecture for model artifact repositories, checkpoint distribution, session state, telemetry and log storage, backup, and disaster recovery.
  • Lead the design of KV cache storage, tiering, restore, and movement mechanisms as a core platform capability for large-scale inference serving.
  • Optimize network and storage subsystems for demanding AI and HPC workloads, balancing throughput, latency, resiliency, cost, and operational simplicity.
  • Work with ML software and inference service teams to develop infrastructure that supports current methods for deploying large language models. Methods include disaggregated prefill/decode paths, continuous batching, and large-model scaling techniques.  
  • Guide architecture for scaling models that use tensor, pipeline, expert, and other parallelism strategies, ensuring the serving infrastructure supports efficient execution and state movement.
  • Establish performance models, benchmarks, and tuning methodologies for end-to-end serving behavior, including tail latency, throughput stability, and recovery characteristics.
  • Lead a small multi-functional team while providing technical direction and architectural oversight across a wider matrixed organization.
  • Influence roadmap, standards, and implementation choices across internal teams and external partners.
  • Act as the senior technical authority for this domain, identifying risks early, resolving complex trade-offs, and ensuring the platform evolves in line with business and product needs.
Candidate Profile

Essentials


  • MS or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent practical experience.
  • Significant industry experience, typically 15+ years, in large-scale systems, distributed infrastructure, or platform architecture.
  • Deep expertise in networking and storage software at scale, including architecture, implementation, configuration, and performance optimization.
  • Proven experience designing and operating networking and storage systems for demanding applications in AI, HPC, or large-scale cloud environments.
  • Strong understanding of high-performance transport, congestion and flow control, QoS, segmentation, telemetry, and production observability.
  • Strong understanding of distributed storage architectures, artifact distribution, checkpointing, caching, replication, backup, disaster recovery, and operational resilience.
  • Demonstrated ability to architect low-latency, high-throughput systems where network and storage behavior materially affect application performance.
  • Experience leading highly ambiguous, cross-functional technical initiatives with impact across multiple teams or product areas.
  • Strong communication and influencing skills, with the ability to align senior technical and business stakeholders.
  • Track record as a recognized expert who drives strategy, shapes technical direction, and delivers solutions beyond existing precedents.

Desirable

  • Familiarity with innovative LLM serving techniques and infrastructure requirements.
  • Experience with prefill/decode disaggregated inference, continuous batching, and differentiated inference services with multiple SLA and QoS tiers.
  • Understanding of model scaling and serving approaches involving tensor, pipeline, expert, and related parallelism techniques.
  • Experience with KV cache management, tiering, restore, and memory/storage trade-offs in inference systems.
  • Knowledge of modern inference serving algorithms, schedulers, and system-level optimization techniques.
  • Experience working with external technology partners, suppliers, or ecosystem collaborators in the delivery of complex infrastructure platforms.
  • Background in production-grade automation and provisioning systems for large infrastructure estates


Benefits
  
In addition to a competitive salary, Graphcore offers a competitive benefits package. We welcome people of different backgrounds and experiences; we’re committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.   

Top Skills

AI
Automation Systems
Distributed Infrastructure
Hpc
Llm Serving Techniques
Monitoring Frameworks
Networking
Storage Systems

Graphcore Austin, Texas, USA Office

Graphcore Austin Office Office

Austin, TX, United States

Similar Jobs at Graphcore

2 Days Ago
Hybrid
Austin, TX, USA
Senior level
Senior level
Artificial Intelligence • Semiconductor
The People Lead in the US region will be responsible for strategic business partnering, employee relations, and building the people model while supporting leadership and improving the employee experience in a fast-paced environment.
Top Skills: People OperationsUs Employment Law
2 Days Ago
Hybrid
Austin, TX, USA
Senior level
Senior level
Artificial Intelligence • Semiconductor
The Senior Thermal Engineer will design and develop liquid cooling solutions for AI data center hardware, ensuring compliance with thermal specifications. Responsibilities include leading design processes, performing thermal simulations, collaborating with vendors, and validating thermal solutions for performance and efficiency.
Top Skills: AnsysComsolFlowtherm
3 Days Ago
Hybrid
Austin, TX, USA
Senior level
Senior level
Artificial Intelligence • Semiconductor
Lead architecture and development of OpenBMC firmware for AI infrastructure, collaborating with partners on reliability, scalability, and serviceability.
Top Skills: BashCC++Ci/CdDcmiI2CI3CIpmiLinuxMctpNc-SiOpenbmcPciePldmPmciPythonRedfishSgpioSpiUartUsbYocto

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

  • Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
  • Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
  • Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
  • Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account