i4DM Logo

i4DM

Senior Site Reliability Engineer

Posted Yesterday
Remote
Hiring Remotely in USA
Senior level
Remote
Hiring Remotely in USA
Senior level
Drive SRE practices for VA enterprise healthcare platforms: automate infrastructure and CI/CD, define SLIs/SLOs, improve observability and reliability, support incident response, and ensure cloud-native, secure, compliant operations in AWS and containerized environments.
The summary above was generated by AI
Description

About Our Team 

Our employees thrive in a culture that is fast-paced, collaborative, and ego-free, where innovation and teamwork are encouraged at every level. We provide Federal agencies with immediate access to highly skilled professionals who understand complex mission challenges and deliver efficient, scalable solutions. By continuously investing in talent, technology, and specialized capabilities, we maintain expert teams prepared to support evolving Federal missions through tailored technical solutions and modern service delivery approaches. 

We value diverse perspectives and strive to attract talent from all backgrounds. We are seeking professionals who are passionate about technology, mission success, and solving complex operational challenges with creativity and purpose. If you enjoy expanding your technical expertise while supporting impactful Federal initiatives, you will thrive within our organization. Veterans and military spouses are strongly encouraged to apply and bring their valuable experience to our team. 


About the Role 

We are seeking an experienced and highly motivated Senior Site Reliability Engineer to serve as a key technical contributor supporting the Technical Director in advancing site reliability engineering, cloud operations, automation, and resilient service delivery for VA enterprise healthcare platforms and applications. 

In this role, you will partner closely with the Technical Director, Program Manager, Maintenance Technical Director, Monitoring & Incident Management teams, and VA stakeholders to improve availability, performance, scalability, and operational excellence across mission-critical, 24x7 enterprise environments. 

The Senior Site Reliability Engineer will apply software engineering principles to operations by automating infrastructure and workflows, defining and measuring reliability targets, strengthening observability, supporting incident response, and continuously improving system resiliency while aligning with Federal security and governance requirements. 


RESPONSIBILITIES 

Site Reliability Engineering & Service Ownership 

  • Partner with the Technical Director to implement and mature Site Reliability Engineering (SRE) practices across platform services and hosted applications. 
  • Improve the full service lifecycle from design and deployment through operation and continuous refinement, with a focus on availability, latency, performance, efficiency, and capacity. 
  • Define, track, and report service level indicators (SLIs), service level objectives (SLOs), and error budgets to guide engineering decisions and service improvements. 

Automation, CI/CD & Infrastructure as Code 

  • Build, enhance, and maintain CI/CD pipelines that enable secure, automated, and repeatable application and infrastructure delivery. 
  • Develop and support Infrastructure as Code (IaC) and configuration automation using tools such as Terraform and Ansible to improve consistency, speed, and auditability. 
  • Integrate automated testing, validation, and security checks into delivery workflows to improve release quality and reduce change-related risk. 

Observability, Reliability & Performance Engineering 

  • Design and improve monitoring, logging, tracing, alerting, and dashboards to strengthen observability and accelerate issue detection and response. 
  • Analyze system behavior and performance trends to improve reliability, scalability, and operational efficiency across distributed and cloud-native environments. 
  • Reduce operational toil by automating repetitive tasks, improving runbooks, and engineering sustainable solutions for recurring operational issues. 

Cloud Engineering & Modernization 

  • Support cloud infrastructure and platform services in AWS and containerized environments such as Kubernetes, ensuring systems are resilient, scalable, and secure. 
  • Contribute to platform modernization efforts by improving deployment patterns, environment consistency, and operational readiness for cloud-native services. 
  • Assist with capacity planning, reliability reviews, and architectural improvements to support growth, resilience, and mission continuity. 

Security & Compliance Integration 

  • Implement reliability engineering practices that align with Federal security requirements, including secure configuration, least privilege, vulnerability remediation, and policy-based controls. 
  • Partner with cybersecurity and engineering teams to support secure-by-design infrastructure and application delivery practices. 
  • Help ensure operational processes and automation align with compliance expectations for Federal and VA environments. 

Cross-Functional Collaboration 

  • Collaborate with development, platform, operations, monitoring, incident management, and architecture teams to improve service reliability and deployment outcomes. 
  • Work closely with the Technical Director and team leads to translate technical direction into actionable engineering improvements and operational standards. 
  • Support Agile and SAFe delivery practices by helping teams adopt reliable release processes, operational readiness checks, and continuous improvement measures. 

Incident Support & Continuous Improvement 

  • Participate in incident response, service restoration, root cause analysis, and post-incident reviews for critical systems and services. 
  • Identify recurring issues, reliability gaps, and failure patterns, and drive corrective actions through automation, architectural improvements, and process refinement. 
  • Contribute to on-call readiness, operational documentation, and blameless continuous improvement practices that improve resilience and reduce mean time to recovery. 

TAG: #LI-I4DM

TAG: INDMJC

Requirements

 QUALIFICATIONS 

  • Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field, or equivalent practical experience. 
  • 5+ years of experience in Site Reliability Engineering, DevOps, platform engineering, cloud operations, or related roles supporting enterprise or mission-critical environments. 
  • Hands-on experience supporting cloud platforms (AWS preferred), Linux-based environments, and distributed systems at scale. 
  • Strong experience with Infrastructure as Code and automation tools such as Terraform, Ansible, or comparable technologies. 
  • Experience with containers and orchestration platforms such as Kubernetes, EKS, ECS, or Docker in production environments. 
  • Experience building or maintaining CI/CD pipelines and deployment automation in support of secure, reliable software delivery. 
  • Strong understanding of monitoring, observability, incident response, root cause analysis, and performance optimization principles. 
  • Proficiency with one or more scripting or programming languages such as Python, Go, Bash, or PowerShell. 
  • Demonstrated ability to troubleshoot complex systems, automate operational tasks, and collaborate effectively across engineering and operations teams. 
  • Candidates must be eligible to obtain and maintain a Public Trust clearance. 

PREFERRED QUALIFICATIONS 

  • Experience supporting VA, Federal Government, or other regulated environments with strong security and compliance requirements. 
  • Experience defining and operationalizing SLIs, SLOs, error budgets, and service health metrics for production systems. 
  • Familiarity with observability platforms and tools such as Prometheus, Grafana, CloudWatch, ELK, Splunk, or OpenTelemetry. 
  • Experience with FedRAMP, NIST, Zero Trust, or other Federal security frameworks relevant to cloud and platform operations. 
  • Experience supporting healthcare platforms, high-availability enterprise services, or large-scale modernization initiatives. 
  • Relevant certifications such as AWS Certified DevOps Engineer, AWS Certified Solutions Architect, Certified Kubernetes Administrator (CKA), HashiCorp Terraform Associate, or SRE/DevOps certifications. 

Similar Jobs

20 Days Ago
Easy Apply
Remote
United States
Easy Apply
130K-140K Annually
Senior level
130K-140K Annually
Senior level
Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software
Lead SRE work to keep Circle highly available and performant: respond to incidents, own monitoring/alerting/log management, manage and optimize MySQL/Postgres/ClickHouse/Redis databases, maintain server infrastructure and deployment pipelines, collaborate with engineering teams, and build internal SRE tooling and automation.
Top Skills: AWSClickhouseKubernetesLlm-Based Tools (Copilots)MySQLPostgresRedis
21 Days Ago
Easy Apply
Remote
USA
Easy Apply
186K-219K Annually
Senior level
186K-219K Annually
Senior level
Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
Own reliability, automation, and DevOps for Coinbase's corporate IAM platform: on-call/incident response, CI/CD and IaC pipelines, identity lifecycle tooling, observability and disaster recovery, documentation, and cross-team IAM advisement to ensure secure, scalable access for a global workforce.
Top Skills: AbacAuth0AWSAzureC#Ci/CdContainer OrchestrationDuoEntraidGCPGenerative AiGitGoIacJavaMfaOktaPingPythonRbacRubySsoTerraform
21 Days Ago
Easy Apply
Remote
USA
Easy Apply
186K-219K Annually
Senior level
186K-219K Annually
Senior level
Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
Senior SRE on the IT Operations team owning reliability, monitoring, and incident response for AI infrastructure. Build automation, CI/CD and Kubernetes tooling, improve observability and documentation, and develop internal full-stack tools using Go or Python. Partner with Infrastructure, Security, and Compliance to scale secure, resilient AI deployment pipelines.
Top Skills: AnsibleAWSBashChefCi/CdDockerEc2GitGoKubernetesLinuxPuppetPythonRubySaltTerraform

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

  • Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
  • Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
  • Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
  • Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account