i4DM Logo

i4DM

Monitoring & Incident Management Manager

Posted 19 Days Ago
Be an Early Applicant
Remote
Hiring Remotely in USA
Senior level
Remote
Hiring Remotely in USA
Senior level
Lead enterprise monitoring operations and incident management for mission-critical platforms, ensuring system reliability, service restoration, and operational coordination within the Department of Veterans Affairs.
The summary above was generated by AI
Description

About Our Team 

Our employees thrive in a culture that is fast-paced, collaborative, and ego-free, where innovation and teamwork are encouraged at every level. We provide Federal agencies with immediate access to highly skilled professionals who understand complex mission challenges and deliver efficient, scalable solutions. By continuously investing in talent, technology, and specialized capabilities, we maintain expert teams prepared to support evolving Federal missions through tailored technical solutions and modern service delivery approaches. 

We value diverse perspectives and strive to attract talent from all backgrounds. We are seeking professionals who are passionate about technology, mission success, and solving complex operational challenges with creativity and purpose. If you enjoy expanding your technical expertise while supporting impactful Federal initiatives, you will thrive within our organization. Veterans and military spouses are strongly encouraged to apply and bring their valuable experience to our team. 


About the Role 

We are seeking an experienced and highly motivated Monitoring & Incident Management Manager to lead enterprise monitoring operations, incident detection, response coordination, and operational situational awareness supporting a mission-critical platform within the Department of Veterans Affairs (VA) environment. 

In this role, you will serve as the Contractor’s lead responsible for ensuring monitoring and incident management processes effectively support system reliability, operational continuity, and rapid restoration of services across a large-scale, 24x7 enterprise healthcare platform. 

You will work closely with the Program Manager, Technical Directors, DevSecOps & SRE teams, and VA stakeholders to ensure incidents are proactively identified, escalated, communicated, and resolved in alignment with strict service-level expectations and operational standards. 


RESPONSIBILITIES 

Monitoring & Operational Oversight 

  • Lead all monitoring operations supporting enterprise platform services and hosted healthcare applications. 
  • Oversee system health, performance, availability, and reliability across cloud-based and platform environments. 
  • Ensure proactive detection of issues through effective monitoring, alerting, and observability practices (not relying on user-reported incidents). 
  • Drive improvements in monitoring coverage, alert accuracy, and operational visibility across all platform services.

Incident Management & Response 

  • Lead incident management processes, ensuring timely identification, triage, escalation, tracking, and resolution of incidents impacting mission-critical services. 
  • Coordinate and support major incident response activities, including outage management, stakeholder communication, and service restoration. 
  • Ensure incidents are managed in accordance with defined severity levels, response timelines, and escalation procedures. 
  • Oversee root cause analysis, post-incident reviews, and implementation of corrective and preventive actions. 

Operational Coordination & Stakeholder Engagement 

  • Serve as the primary coordination lead during operational events, ensuring alignment across VA stakeholders, technical leadership, and delivery teams. 
  • Communicate incident status, service impacts, and recovery progress clearly and consistently to stakeholders. 
  • Coordinate rapid response actions during critical incidents to minimize disruption to healthcare services. 
  • Maintain strong collaboration across Program Management, SRE, DevSecOps, and engineering teams. 

Observability & Continuous Improvement 

  • Partner with DevSecOps, SRE, and engineering teams to enhance observability capabilities, including monitoring, logging, and alerting solutions. 
  • Identify recurring issues, operational trends, and system weaknesses, driving continuous service improvement initiatives. 
  • Support adoption of modern monitoring practices, including automation, event correlation, and AIOps capabilities where applicable. 
  • Improve mean time to detect (MTTD) and mean time to resolve (MTTR) across platform services. 

Reporting & Operational Readiness 

  • Maintain operational reporting, including incident metrics, system performance trends, and SLA adherence. 
  • Provide regular updates and dashboards to VA stakeholders on operational health and incident trends. 
  • Ensure readiness of incident response procedures, escalation paths, and communication protocols. 
  • Support operational processes aligned with Agile and SAFe delivery environments. 

TAG: #LI-I4DM

TAG: INDMJC

Requirements

QUALIFICATIONS 

  • Bachelor’s degree in Information Technology, Computer Science, Engineering, Cybersecurity, or a related field. 
  • 5+ years of experience supporting enterprise monitoring, incident management, or operational environments for mission-critical systems. 
  • Strong expertise in ITIL-based incident management processes, escalation procedures, and service restoration practices. 
  • Experience with modern observability and monitoring tools (e.g., logging, metrics, tracing platforms). 
  • Experience supporting cloud-based or hybrid environments and enterprise-scale application platforms. 
  • Strong communication and coordination skills, with the ability to manage high-pressure operational events across technical and business stakeholders. 
  • Ability to operate in 24x7, SLA-driven environments with strict performance and response requirement. 
  • Candidates must be eligible to obtain and maintain a Public Trust clearance. 

PREFERRED QUALIFICATIONS 

  • Experience supporting VA or Federal Government environments, including familiarity with incident management frameworks and operational procedures. 
  • Experience with AIOps concepts and automation tools to enhance monitoring and incident detection. 
  • Familiarity with platforms such as AWS, Kubernetes, and enterprise monitoring tools (e.g., Splunk, Dynatrace, or similar). 
  • Exposure to SAFe Agile, DevSecOps, and Site Reliability Engineering (SRE) practices. 
  • ITIL, SAFe, or related certifications. 

Similar Jobs

An Hour Ago
Remote or Hybrid
Pennsylvania, USA
71K-166K Annually
Junior
71K-166K Annually
Junior
Digital Media • Information Technology • News + Entertainment
Full‑stack .NET developer responsible for writing, maintaining and optimizing code, designing APIs and system architecture, implementing unit/integration tests, supporting deployments, troubleshooting performance issues, and collaborating with QA and stakeholders. May work variable hours including nights/weekends.
Top Skills: AjaxAngularAsp.NetBootstrapperC#Continuous IntegrationCSSEntity FrameworkGitHTMLIisIocJavaJavaScriptJqueryJSONLinqMvc 5Net Core 2.0Net FrameworkOrmSalesforce Experience CloudSap AbapSQL ServerTfsTypescriptVb.NetVisual StudioWeb ApiXML
An Hour Ago
Remote or Hybrid
Pennsylvania, USA
84K-196K Annually
Senior level
84K-196K Annually
Senior level
Digital Media • Information Technology • News + Entertainment
Design, build, test, and deploy scalable Salesforce solutions across multi-cloud environments using Apex, LWC, Visualforce, declarative tools and integrations (MuleSoft/REST/SOAP). Lead configuration, data modeling, CI/CD, code reviews, troubleshooting, and Agile delivery while mentoring junior engineers and supporting platform governance and long-term architectural alignment.
Top Skills: ApexCi/CdCopadoCSSFlowsGitHTMLJavaScriptLightning App BuilderLightning Web Components (Lwc)Media CloudMulesoftRest ApisSales CloudSalesforce Experience CloudService CloudSoap ApisSOQLSoslVisualforce
An Hour Ago
Remote or Hybrid
65K-139K Annually
Senior level
65K-139K Annually
Senior level
Digital Media • Information Technology • News + Entertainment
Sell Comcast Business solutions to mid-market and enterprise multi-location customers by developing territory strategy, prospecting leads, delivering face-to-face presentations, and managing customer relationships. Collaborate with partners and internal teams to meet financial targets, ensure service excellence, and maintain accurate sales records. Requires knowledge of network design, SDWAN, security, and related networking technologies.
Top Skills: 23)Business Continuity/Disaster RecoveryCustomer Premises Equipment (Cpe)CybersecurityEthernetLanManNetwork SecurityNetworking Protocols (Layers 1SdwanVoipVpnWanWdm

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

  • Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
  • Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
  • Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
  • Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account