Site Reliability Engineer

Sorry, this job was removed at 9:18 a.m. (CST) on Wednesday, June 30, 2021
Find out who's hiring remotely in Austin.
See all Remote Developer + Engineer jobs in Austin
Easy Apply
By clicking Apply Now you agree to share your profile information with the hiring company.

Description

 

SparkCognition is an AI leader that offers business-critical solutions for customers in energy, oil and gas, manufacturing, finance, aerospace, defense, and security. A highly awarded company recognized for cutting-edge technology, SparkCognition develops AI-powered, cyber-physical software for the safety, security, reliability, and optimization of IT, OT, and the Industrial IoT.

SparkCognition is looking for a Site Reliability Engineer who can help drive SparkCognition’s production operations initiatives. The ideal candidate has experience in monitoring and maintaining production systems, issue resolution, automation, and continuous improvement. The position offers opportunities for building and designing a modern, automated platform in the cloud, spanning multiple regions around the globe. This is a high visibility role where the candidate will work across multiple teams to ensure the stability of advanced machine-learning solutions.

Responsibilities

  • Suggest improvements to monitoring processes and tools as needed
  • Deploy production infrastructure, product releases, and maintain systems.
  • Document system requirements, configurations, procedures, changes, incidents, and problem resolution.
  • Work with DevOps to understand product deployments, configurations, and promotion of code from QA to staging and production environments
  • Perform Root Cause Analysis (RCA) of outages & performance issues, and provide feedback to appropriate teams to prevent similar reoccurrences.
  • Participate in on-call rotation with the ability to respond to the needs of a 24x7 environment.
  • Incorporate monitoring of systems & applications to prevent system disruptions and ensure that required Service Level Objectives (SLOs) Agreements (SLAs) are met.
  • Notify appropriate teams of performance issues and trends.
  • Research technologies & perform other related duties as assigned

 Qualifications

  • 3+ years experience on Linux CLI, typically systems administration and/or engineering
  • 1+ year experience with at least one cloud provider platform (AWS, GCP, or Azure)
  • Proficiency in deploying and managing Kubernetes clusters and container-based microservices
  • Proficiency in managing and maintaining both the infrastructure and the key networking components (DNS, Kubernetes ingress, load balancing) of a successful application running on Kubernetes
  • Log monitoring of key metrics using Prometheus/Loki/Grafana to integrate and forward alerts to on-call rotations in Slack/OpsGenie
  • Ability to implement automation and industry best practices to run our large-scale, rapidly growing infrastructure as reliably and securely as possible.
  • Ability to architect system components to be highly available, with ample disaster recovery strategies in place.

 Nice to Have

  • Familiarity with DevOps workflows (Git, Jenkins, Spinnaker, Helm)
  • Proficiency with Terraform (or similar infrastructure-as-code deployment templating tool)
  • Experience with Ansible (or similar configuration management tool: Salt, Chef, Puppet)
  • Experience with PostgreSQL or MySQL databases, particularly the ability to run SQL queries to gather information about and debug the application stack
  • Programming experience in one or more languages such as Python, Ruby, Go
  • Experience with Agile methodologies using Jira to track tasks
  • Experience with event streaming and messaging technologies (Apache Kafka, and GCP Pub/Sub or similar message queue: AWS SQS, Azure Service Bus)

 

Read Full Job Description
Easy Apply
By clicking Apply Now you agree to share your profile information with the hiring company.

Location

Large 2022 renovated office space located near the Arboretum in Austin, TX- including fully stocked beverage and snack areas, along with community spaces that include games and activities.

Similar Jobs

Easy Apply
By clicking Apply Now you agree to share your profile information with the hiring company.
Learn more about SparkCognitionFind similar jobs