Fullbay Logo

Fullbay

Observability & Operations Engineer

Posted 7 Days Ago
In-Office or Remote
Hiring Remotely in Phoenix, AZ
Senior level
In-Office or Remote
Hiring Remotely in Phoenix, AZ
Senior level
The Observability & Operations Engineer designs observability strategies, implements monitoring and AI tools, and manages incident lifecycles while enhancing cloud operations and developer platforms.
The summary above was generated by AI

Observability & Operations Engineer  

About Us:

At Fullbay, our mission is simple — to create safer roads for our families and yours. As leaders in the heavy-duty repair industry, we power shops with technology that helps them run smarter and more efficiently. As an AI-First company, we invite artificial intelligence to eliminate friction, spark innovation, and drive efficiencies in every conversation— for our teams and our customers.

Position Overview:

The Observability & Operations Engineer is a key technical contributor who brings an AI-first mindset to maintaining, monitoring, and operating our AWS cloud environment and internal Developer Platform. In this role, you won’t just react to incidents — you’ll leverage AI-powered tooling, intelligent alerting, and automation to get ahead of problems before they impact users. You’ll work deeply across AWS and its PaaS ecosystem, building repeatable, code-first pipelines that treat infrastructure and observability configuration as first-class software. From using AI coding assistants to accelerate runbook development, to applying ML-based anomaly detection across logs and metrics, you’ll be expected to ask “how can AI help here?” as a first instinct. Working within a dedicated platform team, you’ll build the observability foundations that keep our systems fast, reliable, and self-healing.

Primary Duties & Responsibilities:

  • Design and implement a comprehensive observability strategy (logging, metrics, tracing, alerting) across all AWS environments, leveraging AI-powered tools to detect anomalies and surface insights automatically
  • Build and manage monitoring platforms such as Datadog, Grafana, Prometheus, and AWS CloudWatch — actively exploring AI-native features within these tools to reduce alert fatigue and improve signal quality
  • Use AI coding assistants (e.g. GitHub Copilot, Claude) to accelerate development of dashboards, runbooks, and automation scripts
  • Own the incident management lifecycle — on-call rotations, post-mortems, root cause analysis — and apply AI-assisted log analysis to speed up diagnosis and resolution
  • Instrument Java, Kotlin, and Node.js-based cloud-native applications to emit structured logs, distributed traces, and metrics; identify opportunities to use ML-based anomaly detection in place of static thresholds
  • Build repeatable, code-first observability pipelines that treat dashboards, alerts, and runbooks as first-class software — versioned, tested, and deployed through Harness
  • Leverage AWS PaaS services (Lambda, API Gateway, ECS, RDS, SQS, SNS, and others) to build scalable, automated operational tooling
  • Collaborate with development teams to embed observability and AI-assisted quality checks into CI/CD pipelines via Harness
  • Own the FinOps function for our AWS environment — tracking cloud spend, building cost dashboards, identifying waste, and using AI-powered cost analysis tools to surface optimization opportunities and drive accountability across engineering teams
  • Monitor AWS infrastructure for performance, availability, and cost — partnering with finance and engineering to enforce spend governance
  • Develop and maintain Infrastructure as Code using Terraform, using AI pair programming to improve quality and consistency
  • Contribute to architectural decisions with a focus on resilience, automation, and reducing toil through intelligent systems
  • Adheres to all confidentiality and compliance regulations
  • Performs other duties as assigned

Minimum Education & Work Experience:

  • 7–10 years of experience in Software Engineering, Cloud Operations, or Site Reliability Engineering
  • 5+ years of hands-on experience with AWS infrastructure and AWS PaaS services; certifications are a plus
  • Demonstrated experience building repeatable, code-first pipelines and treating operational configuration as first-class software
  • Experience working with polyglot environments including Java, Kotlin, and Node.js
  • Demonstrated experience using AI tools (coding assistants, AI-powered observability platforms, or similar) in a professional setting — we’re an AI-first company and expect this to be part of how you work, not something you’re just exploring

Key Skills and Qualifications:

  • Deep experience with enterprise observability platforms — including AWS-native tooling such as CloudWatch, X-Ray, and OpenTelemetry, or comparable platforms such as Datadog, Grafana, or Prometheus
  • Proficiency with distributed tracing frameworks and log management platforms (e.g. ELK Stack, Splunk, Fluent Bit); experience mapping these patterns to AWS-native tooling is a strong plus
  • Strong understanding of SRE principles including SLOs, SLAs, error budgets, and chaos engineering
  • Hands-on FinOps experience — cloud cost allocation, chargeback modeling, rightsizing, and savings plans optimization across AWS
  • Strong working knowledge of AWS PaaS services including Lambda, API Gateway, ECS, RDS, SQS, SNS, and IAM — and how to leverage them to build scalable operational tooling
  • Experience instrumenting polyglot applications (Java, Kotlin, Node.js) and cloud-native microservices for observability
  • Proven ability to build repeatable, code-first pipelines — treating dashboards, alerts, runbooks, and infrastructure configuration as versioned, testable software
  • Experience with CI/CD tooling, specifically Harness
  • Solid understanding of Infrastructure as Code using Terraform
  • Fluency with AI tools in day-to-day work — whether that’s AI coding assistants, AI-powered monitoring features, or using LLMs to accelerate problem solving; you default to asking “can AI help here?” before doing things the hard way
  • Ability to lead incident response, facilitate blameless post-mortems, and drive long-term reliability improvements
  • Strong collaboration skills for working across platform and product engineering teams
  • Knowledge of containerization technologies and microservices architecture

Physical Demands and Work Environment:

The physical demands described here are representative of those that must be met by an employee to successfully perform the essential functions of this job. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions

  • Regularly required to sit at a desk in front of a computer and use hands to finger, handle, or feel objects, tools, or controls (including a computer keyboard and operating a telephone), lift and/or move up to 10 pounds. 
  • Frequently requires the use of hands and arms for reaching, as well as the ability to walk and communicate effectively through speaking and listening.
  • Specific vision abilities required by this position include close vision, color vision, and the ability to adjust focus.   
  • Noise level in the work environment is usually moderate.
  • Type on a computer keyboard and look at a computer monitor, and operate a cell phone or a computer-based phone


Top Skills

AI
AWS
Aws Cloudwatch
Ci/Cd
Datadog
Grafana
Harness
Java
Kotlin
Node.js
Prometheus
Terraform

Similar Jobs

2 Hours Ago
Remote or Hybrid
USA
Mid level
Mid level
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
The Patent Attorney will manage patent portfolios, assist with patent prosecution, conduct invention mining, and utilize AI-driven workflows to enhance team efficiency.
Top Skills: AILegal TechnologyPatent ProsecutionSpreadsheets
2 Hours Ago
Remote or Hybrid
USA
Senior level
Senior level
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
The Sr. Client Platform Engineer will manage cross-platform endpoints, lead vulnerability management, ensure security compliance, provide user support, automate workflows, and optimize endpoint performance.
Top Skills: Antivirus SoftwareBashDisk EncryptionEdrIdentity And Access ManagementJamf ProLinuxmacOSMicrosoft SccmPowershellPythonWindows
4 Hours Ago
Remote or Hybrid
United States
179K-322K Annually
Expert/Leader
179K-322K Annually
Expert/Leader
Artificial Intelligence • Fintech • Insurance • Marketing Tech • Software • Analytics
The Business Information Security Officer (BISO) partners with technology and business leaders to align cybersecurity strategies, influence security service delivery, and improve security culture across the organization.
Top Skills: Identity And Access ManagementInformation Security

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

  • Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
  • Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
  • Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
  • Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account