Cloud - SRE - Reliability

Elastic

Sorry, this job was removed at 2:33 p.m. (CST) on Thursday, September 3, 2020

View 738 Jobs

Find out who's hiring in Austin.

See all Developer + Engineer jobs in Austin

View 738 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Save job

Elastic is a search company with a simple goal: to solve the world's data problems with products that delight and inspire. As the creators of the Elastic Stack, we help thousands of organizations including Cisco, eBay, Goldman Sachs, Microsoft, The Mayo Clinic, NASA, The New York Times, Wikipedia, Verizon, and many more use Elastic to power mission-critical systems. From stock quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what's possible with data, delivering on the promise that good things come from connecting the dots. We have a distributed team of Elasticians across 30+ countries (and counting), and our diverse open source community spans over 100 countries. Learn more at elastic.co.

Thanks to our ongoing expansion we have the opportunity to grow our Cloud SRE - Reliability team, the front-line owners of Incident Management, Investigation, and Response for the Elastic Cloud platform. We take a Site Reliability Engineering approach to addressing stability concerns, so we’re looking for people who are just as passionate about resolving distributed system issues as they are coding and collaborating with others. In this role you’ll be responsible for the health of thousands of Elasticsearch clusters spread across all major cloud providers.

Who you are:

You have outstanding interpersonal skills, and can effectively coordinate incident response across globally distributed teams in a dynamic, growing environment
You are a software engineer at heart, with a compulsion to automate yourself out of a job
You have production-grade experience operating Linux systems, with the ability to methodically diagnose system, network, and application issues
Experience with GovCloud is welcome

What you’ll do:

In this role you will:

participate in a weekly on-call rotation, using a follow-the-sun model; on-call shifts are aligned with local business hours
provide low-latency response to incidents and service instability, coordinating with internal and external teams as needed
contribute to tooling, automation, and system engineering efforts, freeing yourself and others from day-to-day toil
lead blameless post-mortems, ensuring preventative actions are prioritised appropriately
be an advocate for Elastic Cloud customers, sharing your deep insight into our production systems with other engineering teams

What you’ve done:

You don't need to have all of these items, but these represent the types of work you will do at Elastic Cloud

You have operated a SaaS product in a public cloud (AWS, GCP, Azure, or SoftLayer preferred), and have some stories to share
You are adept at writing software to automate orchestration tasks at scale; we commonly use Python, Go, and Shell scripting
You can use metrics systems (e.g. Elastic, Graphite, Prometheus, Influx) effectively to diagnose issues and quantify impacts
You have worked with cloud infrastructure-as-code tooling; Terraform, CloudFormation, or others
You've diagnosed and resolved Elastic Stack cluster issues
You are familiar with containerisation and container orchestration concepts

#li-MD1

Read Full Job Description

Cloud - SRE - Reliability

Location

Similar Jobs