Senior Site Reliability Engineer
BigCommerce, named a” Best Place to Work" in Australia, a “Best and Brightest” place to work in San Francisco, and a “Best Place to Work” in Austin, is looking for a full-time Senior Site Reliability Engineer in our San Francisco, or Downtown Austin office.
Our SRE team is made up of talented and enthusiastic individuals who have a huge amount of experience in the running, managing and scaling of large scale web operations and systems administration. The team works closely with the rest of our Engineering organization to ensure that the platform powering BigCommerce remains reliable, performant and secure, 24x7.
We’re looking for an experienced candidate who brings solid systems administration skills that are backed up by an innate understanding of software engineering, with a mindset skewed towards performance analysis, scalability and high availability.
Day to day you’ll find us with our nose in the terminal, using Terraform and Puppet to manage our Debian hosts in a heterogeneous environment of Docker containers, VMs and Bare metal servers. We rely heavily on Logstash and Grafana to provide the data we need to direct our focus and attention to diagnosing and resolving performance issues across a variety of software built in PHP, Ruby, Scala (JVM) and on occasion, Go.
We are always working to empower the BigCommerce Engineering teams and to deliver a faster and more robust platform.
What makes you tick:
- A software engineer with a curiosity for operations, or an operations engineer that wants to work closely with software engineers to help improve response times, scalability and availability.
- Someone who loves to code and enjoys working with multiple programming languages. We primarily work with PHP, Ruby and Python. Puppet manages all of our configuration.
- A good communicator who works well with geographically distributed teams such as ours. We are split between Sydney, Austin, and San Francisco.
- You're obsessive compulsive, in a good way. Your systems and scripts are clean, well-documented and comprehensible.
- Hates doing the same thing twice, you’d rather spend the time to automate a problem away rather than having to spend time on it again.
- You have a passion for learning when it comes to working with new technologies or languages
- You live and breathe scalable web architectures.
- You’re cool in a crisis and can align with others to ensure complex problems meet a timely and effective resolution.
- While work is a big part of your life, you strive to maintain a good balance between the office and home. The pager is an important part of your day, but you don’t let it rule your life.
Who you are:
Our ideal candidate possesses some or all of the following skills:
- 3-5 years experience operating, building software for or supporting large Linux based web application environments.
- Experience with Linux systems administration, including solid scripting skills (PHP, Python, Ruby, Perl) and Bash.
- Knowledge of configuration management systems such as Puppet, or Chef (we use Puppet).
- Experience with popular tools for monitoring web applications (we use New Relic, Nagios, Graphite, and StatsD to name a few).
- A wealth of knowledge in physical server and cloud environments.
Curious what we’ve been up to?
Curious what our Site Reliability Engineering team has been up to or some of our upcoming roadmap items you could have the opportunity to be involved in?
- We ensured 100% uptime during North America’s busiest shopping period from Thanksgiving through to Cyber Monday (These days are known around here as: Cyber 5).
- We’ve designed and built out software to automatically upgrade system OS packages (in BASH). Systems using this code now install OS upgrades with zero touch, so we can concentrate on more automation and serving our customers needs.
- Developing an intelligent incident management and response process with automation in the form of a chatbot giving anyone on the team the ability to comfortably handle anything that’s been thrown at them.
- Creating automation to identify, remediate and purify our systems of SPAM. This keeps our mail reputation high and ensures we can deliver order email effectively.
- Deploying and scaling our Integration environment out into a second Datacenter, enabling the software engineering teams with additional resources, enabling them to release and test faster and improving parity with our Production environment.