Site Reliability Engineer (Mumbai, India)

This is a hands-on technical position to be a team member of the Site Reliability Engineering Group. The primary mission of our SRE is to ensure ZineOne Cloud runs smoothly, efficiently and achieve scalability reliably. The ideal candidate has 3 to 6 years experience managing cloud-based Big Data stack and strives to solve operations problems through automation and software tools. You must possess a high standard of excellence, solid written and verbal communication skills, have a strong customer focus, and technical depth in operating systems, application performance, databases, load balancers, networks, and storage systems.

Location: Mumbai, India

Key Responsibilities: As a ZineOne SRE, you will:

  • Build solutions to problems that impact availability, performance, and stability in our systems, services, and products
  • Develop and maintain cloud infrastructure as code and provision AWS environments for our customers that will automate as many of the ops related work such as build, deploy, install, start/stop, restart, monitoring of several machines on cloud providers such as AWS or on-premise
  • Develop and deploy new automated solutions using frameworks that will enable the core platform and features to automatically scale in the cloud
  • Work closely with other members of the group to enable DevOps automation, continuous integration, test automation execution and continuous delivery of the ZineOne platform & its new features
  • Perform data plumbing/engineering tasks to ensure clean and correct data is ingested into our platform or sent to receiving systems
  • Proficient in software engineering, shell scripting, python, java, and other programming languages as well as the willingness to research and learn new technologies and frameworks to utilize while creating automation solutions
  • Develop and implement instrumentation for monitoring the health and availability of services including fault detection, alerting, and recovery
  • Be accountable for backup and business continuity/disaster recovery procedures
  • Develop and maintain documentation for operational practices and procedures as well as help drive operational cost reduction

Skills in the following areas and/or similar cloud platforms will be a plus:

  • Working experience with AWS using both the AWS Management Console and the AWS Command Line Interface (AWS CLI)
  • Strong experience building and maintaining production systems on AWS using EC2, S3, ELB, CloudFront, Elastic BeanStalk etc
  • Perform technical and system administration to support development, deployment, and delivery. Deep experience in administering Linux systems, NodeJS, Python, Flask, Bash Shell Script modules
  • Experience with real-time, big data platform including HDFS/HBase architecture, Zookeeper, and Kafka clusters