Senior Site Reliability Engineer, Observability
Who We Are
The name ThousandEyes was born from two big ideas: the power to see things not ordinarily possible and the ability to collect insights from a multitude of vantage points. As organizations rely more on cloud services and the Internet, the network has become a black box they can't understand. Our Internet and cloud intelligence platform delivers the only collectively powered view of the Internet, cloud and SaaS platforms, helping enterprises and service providers work together to identify problems before it impacts revenue, damages brand reputation, or halts employee productivity.
In August 2020, Cisco Systems completed the acquisition of ThousandEyes, which now forms the ThousandEyes Business Unit within Cisco’s Network Services Business Group, and is a foundational component of Cisco’s growing Observability business.
About The Role
The Site Reliability Engineering team focused on Observability is responsible for providing the tools, services, and infrastructure to monitor and observe the ThousandEyes platform. Leveraging cloud native tools like Prometheus, Grafana, Kibana, and even ThousandEyes itself, we enable our developers to instrument, analyze, and monitor their applications. The Senior Site Reliability Engineer in this role will work together with the team to own our logging pipeline and monitoring stack while working with developers to continuously improve our view of the platform.
Responsibilities
- Design and implement visibility into our platform as we grow to multi-region scale.
- Design, deploy, and maintain cloud native monitoring services in AWS and GCP that are elastic and resilient to failure.
- Provide standards and best practices for instrumentation of container based services and cloud managed services.
- Maintain our alerting pipeline so that we are notified of the right things, at the right time, in the right places.
- Drive automation wherever possible, enabling our monitoring platforms to scale effortlessly. Think self service.
- Participate in and contribute to improve our 24x7 incident response and on-call rotation.
Required skills
- Strong Infrastructure as Code skills, ideally with Terraform and Kubernetes.
- Strong knowledge of modern logging tool sets, including Logstash or Fluentd.
- Understanding of Prometheus and it’s ecosystem, including Alertmanager.
- Good knowledge of Application Performance Monitoring tools and crash reporting tools, such as Sentry.
- Good knowledge of cloud provider managed services, and how they can be leveraged in our context.
- Ability to write high quality code in Python, Go, or equivalent languages.
We Are Cisco
#WeAreCisco, where each person is unique, but we bring our talents to work as a team and make a difference. Here’s how we do it.
We embrace digital, and help our customers implement change in their digital businesses. Some may think we’re “old” (30 years strong!) and only about hardware, but we’re also a software company. And a security company. An AI/Machine Learning company. We even invented an intuitive network that adapts, predicts, learns and protects. No other company can do what we do – you can’t put us in a box!
But “Digital Transformation” is an empty buzz phrase without a culture that allows for innovation, creativity, and yes, even failure (if you learn from it.)
Day to day, we focus on the give and take. We give our best, we give our egos a break and we give of ourselves (because giving back is built into our DNA.) We take accountability, we take bold steps, and we take difference to heart. Because without diversity of thought and a commitment to equality for all, there is no moving forward.
So, you have colorful hair? Don’t care. Tattoos? Show off your ink. Like polka dots? That’s cool.