CyrusOne

Senior Reliability Engineer

Posted 2 Days Ago

Be an Early Applicant

Remote

Hiring Remotely in USA

140K-170K Annually

Expert/Leader

Remote

Hiring Remotely in USA

140K-170K Annually

Expert/Leader

The Senior Reliability Engineer leads the reliability strategy for mission-critical data centers, overseeing risk management, predictive analytics, and continuous improvement initiatives.

The summary above was generated by AI

The Senior Reliability Engineer serves as a subject-matter expert and strategic technical authority for infrastructure reliability across a portfolio of mission-critical data center sites. This role leads the design, governance, and continuous improvement of reliability strategies for power, cooling, and control systems, applying advanced engineering judgment, analytics, and risk-based decision-making.
The Senior Reliability Engineer independently evaluates complex reliability risks, prioritizes initiatives under uncertainty, and influences operational, maintenance, and capital decisions that materially impact uptime, safety, and lifecycle cost. This role operates with minimal oversight and is expected to shape standards, mentor others, and elevate reliability capability across the organization.

Responsibilities:

Enterprise Reliability Strategy & Asset Care

Architect and govern portfolio-level, risk-based asset strategies for mission-critical power and cooling infrastructure.
Apply advanced RCM principles to define maintenance and inspection strategies aligned to failure risk, system criticality, and redundancy posture.
Evaluate and balance tradeoffs between maintenance investment, operational risk, spares coverage, redundancy, and capital replacement.
Establish and maintain enterprise PM quality standards, including audits, task effectiveness reviews, and elimination of low-value maintenance.

Operational Governance & Change Risk Management

Serve as a final technical authority for high-risk SOPs, MOPs, EOPs, and operational change packages.
Perform system-level risk assessments for planned work, incidents, and abnormal operating conditions.
Guide site teams in CMMS data integrity, work management maturity, and adherence to approved operating procedures.
Lead or oversee complex reliability investigations involving multiple systems, teams, or contributing factors.

Advanced Analytics & Condition Monitoring

Design and mature predictive condition-monitoring programs across the portfolio (oil analysis, thermography, vibration, battery monitoring, controls analytics).
Develop and interpret leading reliability indicators and degradation trends to anticipate failures before impact.
Apply statistical analysis, reliability modeling, and engineering judgment to evaluate failure likelihood and consequence.
Translate analytical insights into strategic maintenance, operational mitigations, or capital recommendations.

Critical Spares & Lifecycle Strategy

Define and govern enterprise critical spares strategies, accounting for supplier risk, lead times, and system exposure.
Identify systemic spares gaps and drive remediation plans in partnership with Supply Chain and Operations.
Lead lifecycle asset assessments to guide long-range capital planning and replacement prioritization.
Provide data-driven input to business cases supporting capital investments and infrastructure upgrades.

Incident Leadership, RCA & Continuous Improvement

Lead high-impact post-incident RCAs and FMEAs, ensuring depth of analysis beyond proximate causes.
Identify and address latent design, procedural, and organizational contributors to reliability events.
Ensure lessons learned result in durable changes to standards, procedures, maintenance strategies, or training.
Champion continuous improvement initiatives that measurably reduce risk and failure recurrence across sites.

Technical Leadership & Capability Development

Act as a mentor and technical escalation point for Reliability Engineers, site engineers, and CE leaders.
Coach teams on reliability methods, risk-based decision-making, and interpretation of condition-monitoring data.
Influence and evolve enterprise reliability standards, playbooks, and operating philosophies.
Partner with leadership to strengthen operator certification, training rigor, and operational discipline.

Qualifications:

10+ years of experience in reliability engineering, maintenance engineering, or facilities engineering within mission-critical environments.
Demonstrated leadership of complex, multi-system reliability programs with measurable business impact.
Expert-level knowledge of RCM, FMEA, RCA, and maintenance optimization methodologies.
Deep technical understanding of mission-critical infrastructure, including UPS, generators, switchgear, chillers, cooling towers, CRAH/CRAC, and BMS/EPMS.
Proven experience governing SOP/MOP/EOP programs and assessing operational change risk in live environments.
Advanced ability to analyze condition-monitoring, CMMS, and operational datasets and convert insights into strategic actions.
Proficiency in data analysis and visualization tools (Excel, Power BI, or similar).
Ability to apply statistical techniques or reliability modeling to support risk-informed decision-making under uncertainty.
Strong executive-level communication skills; able to influence senior leaders and defend technical positions.

Preferred Experience:

Experience designing and scaling enterprise critical spares and lifecycle asset management programs.
Hands-on experience with predictive analytics, failure modeling, or reliability simulations.
Proficiency with Python, R, or similar tools for advanced reliability analytics.
Working knowledge of SQL or other data query languages.
Strong familiarity with NFPA, IEEE, ASHRAE, and other relevant codes and standards.
Experience presenting reliability risk, capital tradeoffs, and investment recommendations to executive audiences.

Education & Certifications:

Bachelor’s degree in Mechanical, Electrical, or Industrial Engineering (or equivalent experience).
Preferred: CMRP, CRE, or similar advanced reliability or maintenance certification.

Work Conditions:

Supports 24×7 mission-critical operations; participates in on-call rotation and may support after-hours events.
Ability to work safely in energized environments in compliance with LOTO and NFPA 70E.
Travel to supported sites approximately 25%.

Salary range: $140,000-$170,000

CyrusOne is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, sex, sexual orientation, gender identity, religion, national origin, disability, veteran status, or other legally protected status.

CyrusOne provides reasonable accommodation for qualified individuals with disabilities in accordance with the Americans with Disabilities Act (ADA) and any other state or local laws. We will respond to requests for reasonable accommodations to assist you in applying for positions at CyrusOne, or to submit a resume.

Top Skills

Bms

Chillers

Cooling Towers

Crac

Crah

Epms

Excel

Fmea

Generators

Power BI

Predictive Analytics

Python

Rca

Rcm

SQL

Switchgear

Ups

Similar Jobs

NBCUniversal

Senior Site Reliability Engineer

4 Hours Ago

Remote or Hybrid

Los Angeles, CA, USA

130K-160K Annually

Senior level

130K-160K Annually

Senior level

AdTech • Cloud • Digital Media • Information Technology • News + Entertainment • App development

The Unified Communication Engineer manages and improves telecom systems, provides technical support, and integrates new UC technologies while ensuring stability of voice networks.

Top Skills: AWSCiscoMicrosoftUcs ServersVcenterVMwareVoipZoom

Coinbase

Senior Software Engineer

Yesterday

Easy Apply

Remote

USA

Easy Apply

181K-212K Annually

Senior level

181K-212K Annually

Senior level

Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3

Seeking a Senior Site Reliability Engineer to enhance software reliability, automate systems, and mentor engineering teams in reliability practices. Requires strong skills in system design, coding, and observability, along with at least 6 years of software engineering experience.

Top Skills: AWSAzureDatadogDockerEc2GCPGoKibanaKubernetesRubyTerraform

Circle

Senior Site Reliability Engineer

7 Days Ago

Remote

United States of America

153K-205K Annually

Senior level

153K-205K Annually

Senior level

Blockchain • Fintech • Payments • Financial Services • Cryptocurrency • Web3

The Senior Site Reliability Engineer manages production infrastructure, ensuring performance and reliability using AI tools, Kubernetes, and CI/CD pipelines while mentoring teams.

Top Skills: Apache AirflowAWSAws LambdaAzureChatgptCi/CdCrossplaneGCPGeminiGithub CopilotGoKubernetesOpensearchPostgresPythonRedisSnowflakeTerraform

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center