OpsMill Logo

OpsMill

Product Reliability Engineer | US

Posted 6 Hours Ago
Be an Early Applicant
Remote
2 Locations
Mid level
Remote
2 Locations
Mid level
Owner of on-prem reliability and escalations: reproduce and resolve L2/L3 issues across heterogeneous Kubernetes environments, build diagnostics and automation, improve CI and e2e test stability, establish performance baselines, harden install/upgrade flows, and write tooling in Python/Go/Rust to reduce repeat incidents.
The summary above was generated by AI

Shipping infrastructure software is only half the job. The other half is making it work in environments you don’t control—across messy reality, strict security constraints, and endless platform variations. The difference between a good product and a trusted one is how quickly you can diagnose issues and how effectively you prevent them from happening again.

At OpsMill, we're building Infrahub, a schema-driven infrastructure source of truth that helps teams unify data and scale automation reliably. Our customers deploy Infrahub on-prem, which means reliability is a product feature, not just an operational concern. When something breaks in the field, it's not just a support ticket—it's a signal about what we need to fix, test, or instrument better.

Why This Role Exists

We need someone who can operate in both worlds: diving deep on gnarly customer escalations while systematically eliminating entire classes of problems. You'll be the crucial bridge between "customer is blocked right now" and "this type of issue can't happen again." You'll build the diagnostics, tests, and automation that turn on-prem deployment chaos into predictable, debuggable, fixable reliability.

What You'll Be Doing
  • Partner directly with customers and with our Solution Architecture/Customer Success teams on L2/L3 escalations—communicating findings, driving root-cause analysis, and resolving complex packaging, deployment, upgrade, and runtime issues across heterogeneous Kubernetes environments.

  • Drive issues to resolution by reproducing problems locally, isolating root causes, and coordinating fixes with engineering—then documenting learnings in crisp RCAs that become actionable improvements

  • Build and maintain diagnostics tooling including support bundles, health checks, environment validators, and "what changed?" helpers that make future troubleshooting 10x faster

  • Own the test automation infrastructure roadmap, improving CI stability, reducing flaky tests, and creating reproducible integration/e2e environments that catch issues before customers do

  • Establish and maintain performance baselines and regression tests that serve as actionable gates, helping teams catch scale and latency issues early

  • Improve installation and upgrade robustness by identifying recurring failure modes and eliminating them through product changes, automation, and guardrails

  • Write production-quality code in Python, Go, or Rust for internal tooling and product improvements that directly enhance reliability

  • Close the reliability feedback loop by systematically turning field issues into better tests, observability, documentation, and product defaults—measuring success through reduced time-to-resolution and fewer repeat incidents

What You Bring
  • 4-7 years of experience in production engineering, SRE, platform engineering, or similar roles where you've owned reliability and customer escalations

  • Strong software engineering fundamentals including design, debugging, testing, code review, and a focus on maintainable, production-quality code

  • Practical Kubernetes expertise sufficient to debug real deployments: troubleshooting resources, networking, storage, RBAC, and platform-specific quirks across different distributions

  • Deep troubleshooting instincts and observability experience using logs, metrics, and traces to diagnose issues quickly in complex, distributed systems

  • Experience with at least one of: Python, Go, or Rust for building tooling and contributing to product code (you don't need to be expert in all three)

  • Excellent problem decomposition and communication skills—you can break down messy, ambiguous issues and clearly explain your findings and recommendations

  • Self-directed remote work capability with strong async communication skills and the ability to operate independently in a fast-moving environment where priorities shift based on customer needs

  • Collaborative mindset with experience partnering across product, engineering, and customer-facing teams to drive systematic improvements

Nice-to-Haves
  • Experience with packaging and distribution systems (containers, Helm charts, installers) and managing upgrade/migration flows

  • Background running CI/CD at scale including test parallelization, hermetic environments, and artifact management

  • Familiarity with performance tooling such as profiling, load generation, and benchmark harnesses

  • Previous experience in customer-facing technical roles like escalation engineering, support engineering, or solutions engineering

  • Contributions to open source projects, especially in infrastructure, observability, or reliability tooling

Why OpsMill?
  • The people: Work alongside world-class engineers who've built and scaled automation platforms in production. Daily technical challenges with smart colleagues who push you to grow.

  • The product: Shape Infrahub based on real customer needs. Your input directly influences features, integrations, and roadmap priorities.

  • The mission: We're making enterprise-grade infrastructure automation accessible to any organization. Open-source at the core, production-ready out of the box. This is a multi-year journey, not a quarterly sprint.

  • The impact: You'll work with teams managing some of the world's most complex infrastructure deployments, solving problems that ripple across entire organizations.

Our Commitment to Diversity and Inclusion

OpsMill is committed to building a diverse and inclusive team. We believe different perspectives make us stronger and more innovative. We encourage applications from candidates of all backgrounds and experiences, and we're committed to providing an inclusive environment where everyone can do their best work.

Similar Jobs

13 Minutes Ago
Easy Apply
Remote
Easy Apply
199K-279K Annually
Expert/Leader
199K-279K Annually
Expert/Leader
Big Data • Fintech • Mobile • Payments • Financial Services
Partner with senior engineering leaders to drive large-scale organizational change, leadership development, and talent strategy. Lead talent management, succession planning, workforce planning, and people analytics to improve organizational effectiveness, employee engagement, and leadership capability. Serve as a trusted advisor, coach executives, and collaborate with People COEs to scale culture and drive high performance.
13 Minutes Ago
Remote or Hybrid
CA, USA
200K-415K Annually
Expert/Leader
200K-415K Annually
Expert/Leader
Blockchain • Fintech • Mobile • Payments • Software • Financial Services
Senior individual contributor building and maintaining underwriting and credit decisioning ML systems for Cash App Borrow and Afterpay. Responsibilities include feature engineering, model training, calibration, experimentation, deployment, monitoring, and portfolio-level analysis. Collaborate with cross-functional teams to align models with business and regulatory goals and develop AI-native engineering workflows and governance for reliable, auditable model development.
Top Skills: AirflowAWSClaude CodeCopilotCursorGCPGitInternal Feature StoreLightgbmMlflowModel Hosting PlatformNumpyPandasPrefectPythonPyTorchScikit-LearnSnowflakeSQLXgboost
8 Hours Ago
Easy Apply
Remote
Easy Apply
209K-269K Annually
Senior level
209K-269K Annually
Senior level
Big Data • Fintech • Mobile • Payments • Financial Services
Lead and grow a team of 2-3 PMs owning agent tooling and workflows. Define vision and roadmap for agent experience, drive AI-first automation, partner with Operations and cross-functional teams, deliver scalable systems, and measure impact through analytics and experimentation.
Top Skills: Agent ToolingAIAnalyticsAutomationChat SystemsExperimentationPhone SystemsWorkflow Systems

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

  • Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
  • Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
  • Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
  • Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account