Senior Site Reliability Engineer
Remote | US
(EST Preferred)
About Climavision
At Climavision, we’re rebuilding climate technology from the ground up and changing the way we see weather. We merge the power of a proprietary, high-resolution weather radar and satellite network with advanced weather prediction modelling and decades of industry expertise to reduce existing coverage gaps and drastically improve forecasting ability. Our revolutionary new approach to climate technology weather solutions is poised to help reduce the economic risks of climate change on companies, governments, and societies alike. We are backed by The Rise Fund, the world’s largest global impact platform committed to achieving measurable, positive social and environmental outcomes alongside competitive financial returns. Climavision is headquartered in Louisville, KY, with research and development operations in Raleigh, NC.
The Work
Are you an experienced Site Reliability Engineer who thrives at the intersection of software engineering and production operations? Do you take pride in keeping mission-critical customer systems reliable under real-world operational pressure? Are you looking for an opportunity to own production reliability for a modern hybrid infrastructure platform spanning cloud, colocation, and edge environments?
If so, we have an exceptional opportunity for you.
Climavision is seeking a Senior Site Reliability Engineer to contribute towards reliability, operational excellence, and production resilience for our customer-facing platform and weather data services. This role is focused on ensuring our systems consistently meet demanding customer SLAs, including a 99.5% availability commitment for radar-derived data services. A central focus of this role is establishing multi-replica and multi-cluster high availability across our .NET services, including hands-on refactoring of C# code to make services safe to run as multiple instances and across clusters.
This is a hands-on engineering role for someone who is equally comfortable debugging production .NET services, troubleshooting Kubernetes clusters, leading incident response, and improving operational maturity across the organization. The successful candidate will combine strong software engineering experience in C# / .NET with deep production operations expertise and a disciplined approach to reliability engineering.
Climavision operates a hybrid infrastructure footprint spanning Microsoft Azure, colocation data centers, and edge Kubernetes clusters, deployed alongside weather radar systems. This role will drive production reliability across Azure, colocation, and edge environments.
35% Production Reliability Engineering
30% Application Reliability & .NET Service Architecture
20% Kubernetes Platform Reliability/Operations
15% Observability, Automation, and Operational Excellence
Primary Responsibilities:
- Own production reliability for Climavision’s customer-facing platform and radar-derived weather data services across Azure, colocation, and edge Kubernetes environments.
- Contribute to the definition and improvement of SLIs, SLOs, alerting standards, and operational metrics used to measure platform reliability.
- Support and coordinate production incident response efforts, including troubleshooting, mitigation, communication, and postmortem analysis.
- Diagnose and resolve complex production issues across application services, Kubernetes infrastructure, storage, and distributed systems.
- Drive multi-replica and multi-cluster high availability across Climavision’s .NET services. This includes working directly in the C# codebase to refactor services that are not currently safe to run as multiple replicas, addressing in-process state, sticky scheduling assumptions, non-idempotent operations, race conditions, and other patterns that prevent safe horizontal scaling, so that services can be deployed with multiple replicas, across multiple clusters, for high availability.
- Contribute to the multi-cluster high-availability strategy across Climavision’s hybrid fleet, including active-active and active-passive failover behavior, traffic routing, data replication considerations, and graceful degradation when a cluster becomes unavailable.
- Operate and improve Climavision’s self-managed Kubernetes platform spanning cloud-hosted, colocation, and edge clusters with a focus on availability, resiliency, recovery and operational performance
- Ensure Kubernetes platform lifecycle activities including upgrades, patching, cluster health, node management, and production change management, are executed in a manner that preserves service availability and minimizes customer-facing risk
- Improve reliability and operational maturity of production platform services, including observability, autoscaling, ingress, and distributed storage. Partner with the teams responsible for the underlying networking and security primitives rather than owning those areas directly.
- Design and validate Kubernetes workloads for resiliency, scalability, and operational efficiency, including autoscaling behavior, workload placement, resource management, and graceful degradation strategies.
- Read, debug, and contribute production-quality C# / .NET code focused on reliability improvements, multi-replica safety, instrumentation, operational tooling, and performance optimization.
- Partner with software engineering teams to improve production readiness, resiliency patterns, deployment safety, and operational visibility before services reach production. Champion multi-replica-safe design patterns as new services are built.
- Maintain and improve deployment pipelines, Helm charts, Kubernetes manifests, and infrastructure automation supporting safe and repeatable production releases.
- Support and evolve Climavision’s observability platform, including metrics, logging, distributed tracing, dashboarding, and alerting.
- Conduct performance engineering and capacity-planning efforts for customer-facing services during peak weather-event demand.
- Help facilitate blameless postmortem reviews and drive operational follow-up items through completion.
- Improve disaster recovery, failover, and business continuity capabilities across cloud, colocation, and edge environments.
- Drive operational excellence initiatives, including automation, reduction of operational toil, game days, production readiness reviews, and reliability best practices.
- Contribute as a senior technical resource and mentor on reliability engineering and production operations practices.
On-Call Expectation:
Climavision operates customer-facing production systems under contractual SLAs that do not pause outside business hours. The Senior Site Reliability Engineer will participate in a primary on-call rotation, taking one full week of primary on-call duty at a time. During the on-call week, the engineer is expected to be reachable and able to actively respond to production incidents and pages 24 hours a day, 7 days a week, including nights, weekends, and holidays. This includes:
- Acknowledging pages and incidents posted to the DevOps Support channel within the established response-time SLO, regardless of the hour the page is received.
- Driving the incident to mitigation or resolution before stepping away, including engaging additional engineers when appropriate.
- Maintaining reliable connectivity (laptop, network, paging device) and personal availability for the full duration of the rotation week.
- Planning personal time, travel, and other commitments around the published on-call rotation, and arranging documented coverage swaps in advance when conflicts are unavoidable.
- Owning written incident handoffs at the end of the rotation and authoring postmortems for incidents that occurred during the week.
Candidates who are not able or willing to meet this on-call standard should not apply.
Qualifications
A bachelor’s degree in computer science, software engineering, or a related field; equivalent professional experience considered.
- Minimum of 7 years of experience in Site Reliability Engineering, DevOps, Production Engineering, Platform Engineering, or a related infrastructure-focused role, with at least 4 years in a role formally titled Site Reliability Engineer or carrying explicit SLO / error-budget accountability.
- Strong, hands-on software engineering experience with a minimum of 3 years of experience supporting and modifying C# / .NET applications in production environments. Candidates without production C# / .NET development experience will not be considered for this role; this is a non-negotiable requirement driven by the technology stack of Climavision’s products.
- Demonstrated experience refactoring production application code (preferably C# / .NET) to make services horizontally scalable across multiple replicas: removing in-process state, ensuring idempotency, handling concurrent execution safely, and enabling safe deployment of multiple instances of a service.
- Experience designing or operating multi-cluster high-availability architectures, including failover behavior, traffic routing, and cross-cluster service deployment.
- Experience supporting customer-facing production systems with uptime, reliability, and incident-response responsibilities.
- Strong hands-on experience operating production workloads in self-managed or highly customized Kubernetes environments.
- Experience diagnosing and resolving production incidents across application, platform and Kubernetes infrastructure layers, including workload scheduling, storage, ingress, and cluster-level failures.
- Experience operating Kubernetes outside of strictly managed cloud environments, including bare-metal, colocation, edge, or hybrid infrastructure.
- Experience with Kubernetes operational tooling and ecosystem technologies such as Rancher, Helm, autoscaling frameworks, observability stacks, or distributed storage systems.
- Strong understanding of infrastructure automation and Infrastructure as Code concepts using tools such as Terraform and Ansible.
- Experience supporting CI/CD and production deployment pipelines. Experience with Octopus Deploy is strongly preferred.
- Experience with monitoring, logging, and observability platforms such as DataDog, Prometheus, Grafana, Loki, OpenTelemetry, or comparable technologies.
- Experience operating distributed systems and microservice-based architectures in production environments.
- Working knowledge of Microsoft Azure infrastructure.
- Strong troubleshooting skills across infrastructure, application, and platform layers.
- Demonstrated experience participating in a structured production on-call rotation supporting business-critical systems.
- Strong written and verbal communication skills, including incident documentation and postmortem authoring.
- Experience working in start-up, scale-up, or other fast-moving engineering environments.
Preferred experience:
- Experience operating Kubernetes platforms using RKE2 and Rancher.
- Experience supporting hybrid cloud and colocation infrastructure environments.
- Experience with service mesh technologies such as Istio.
- Experience with Kubernetes-native storage platforms such as Longhorn.
- Experience operating PostgreSQL or PostGIS in Kubernetes environments.
- Experience with distributed messaging systems such as RabbitMQ or NATS.
- Experience supporting GPU-enabled workloads in Kubernetes.
- Familiarity with reliability engineering practices, including SLIs, SLOs, error budgets, and operational maturity metrics.
Physical Demands & Work Environment:
- This is a full-time, exempt position.
- Fully Remote - United States. Eastern Timezone will have preference
- This job requires frequent use of a computer to complete tasks, attend meetings, and communicate via Microsoft Teams.
Once you land this position, you’ll get to enjoy:
- Benefits of a dynamic and growing organization
- A challenging, hands-on role that will have real impact on the business
- Competitive compensation
- Comprehensive benefits package
- 401(k) Savings Plan
- Medical/Dental/Vision Benefits
- Health Savings Account (HSA) and Flexible Spending Account (FSA)
- Unlimited Paid Time-off
- 11 Paid Holidays
- Paid Parental Leave
- Company Paid Short-term Disability (STD)
- Company Paid Long-term Disability (LTD)
- Company Paid Life Insurance
The salary range for this position is $135-170k annually, however Climavision considers several factors when extending an offer of employment including but not limited to, the applicant’s education, experience, the responsibilities of the role, training, knowledge, skills, and abilities, as well as internal equity and alignment with market data. Any offer of employment is contingent on completion of a background check to company standard. Please note this job description is not designed to cover or contain a comprehensive listing of activities, duties or responsibilities that are required of the employee for this job. Duties, responsibilities, and activities may change at any time with or without notice.
Climavision is an equal opportunity employer. All aspects of employment including the decision to hire, promote, discipline, or discharge, will be based on merit, competence, performance, and business needs. We do not discriminate on the basis of race, color, religion, marital status, age, national origin, ancestry, physical or mental disability, medical condition, pregnancy, genetic information, gender, sexual orientation, gender identity or expression, veteran status, or any other status protected under federal, state, or local law.
Similar Jobs
What you need to know about the Austin Tech Scene
Key Facts About Austin Tech
- Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
- Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
- Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
- Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
- Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center


