Cloud - SRE Manager - Infrastructure at Elastic
Thanks to our ongoing expansion we have the opportunity to grow our Site Reliability team. We're a part of the Elastic Cloud engineering team with a focus on solving Cloud operations problems and keeping the SaaS online, who aren’t afraid to get our hands dirty. We are the first line of consumers for Elastic's products and our experience helps influence the direction of the stack. While most organizations may have a single or a handful of Elastic Stack deployments, here you’ll be responsible for identifying, troubleshooting and reporting platform problems to product engineers (or fixing the code yourself) in order to ensure that the thousands of Elasticsearch clusters we manage are providing a stable and reliable service. We’re looking for people who are just as passionate about troubleshooting issues with distributed systems as they are to automate, code and collaborate to solve problems.What You Will Be Doing:
- Manage, mentor and lead a globally distributed team that is passionate about Elastic Cloud infrastructure services and collaborate on issues with product engineers
- You will be technically focused and participate in software engineering, writing code for the continuing reduction of human intervention in operational tasks and automation of processes
- Monitor the Elastic Cloud platform and Cloud infrastructure, responding to incidents, correcting and improving systems to prevent incidents and planning capacity
- Manage Cloud provider infrastructure, system deployments and product releases
- Demonstrate and promote standard methodologies for teams using Cloud platforms
- Participate in 24x365 on-call schedules
- Foster a culture of mutual respect, collaboration and consensus-based decision-making
- Promote collaboration and sharing of knowledge with team, organisation and community
- Support development and training through regular mentorship and performance management
- Recruit exceptional candidates to support team growth
- Increase reliability of the Elasticsearch services
- You are a seasoned SRE located in either the US or APJ.
- Experience leading teams of engineers (people manager, tech lead)
- You are comfortable writing software to automate API-driven tasks at scale. SRE's use Python and Go regularly but are also encouraged to contribute to the product codebase in Java, Scala, and Python.
- At least three years of experience using a public Cloud; AWS, GCP, Azure, SoftLayer or OpenStack
- You have used Ansible, Puppet, Chef or another config management suite, know where it's broken, and open to trying new alternatives
- You have experience with project and roadmap planning
- A healthy knowledge of Linux (have compiled your own kernel at some point, know how to trace syscalls, understand TCP, care about the difference between sysvinit/runit/systemd, etc.)
- Relentless desire to automate and build software tools
- Desire to represent work in git, driven by a GitHub workflow through issues and pull requests
- Love open source development, and have contributed to some project somewhere (doesn't have to be ours), whether through mailing lists, patches, documentation, etc.
- Enjoy working remotely and the communication it requires
- Love a diverse environment, working with men and women all over the world
We're looking to hire team members invested in realizing the goal of making real-time data exploration easy and available to anyone. As a distributed company, we believe that diversity drives our vibe! Whether you're looking to launch a new career or grow an existing one, Elastic is the type of company where you can balance great work with great life.
- Competitive pay based on the work you do here and not your previous salary
- Global minimum of 16 weeks of parental leave (moms & dads)
- Generous vacation time and one week of volunteer time off
- Your age is only a number. It doesn't matter if you're just out of college or your children are; we need you for what you can do.