Job description:
- Design and implement scalable infrastructure to support growing systems, ensuring high availability and performance;
- Enhance system reliability and security - identifying potential risks and proactively implementing solutions;
- Automate manual processes and repetitive tasks to improve operational efficiency and reduce human error;
- Lead incident response efforts, including troubleshooting, root cause analysis, and post-incident reviews;
- Maintain high uptime and availability of services, including monitoring, alerting, and ensuring swift recovery from outages;
- Optimize system performance identifying bottlenecks and fine-tuning infrastructure for scalability and efficiency;
- Drive Continuous Integration and Continuous Deployment (CI/CD) processes to enable rapid, safe, and automated code releases;
- Implement and manage robust monitoring and alerting systems to proactively detect issues before they impact end users;
- Collaborate closely with development teams to ensure smooth application deployment and operational excellence;
- Champion DevOps best practices - fostering a culture of collaboration, automation, and continuous improvement across teams.
Requirements:
Technical Expertise:
- Programming and Scripting: Strong proficiency in scripting languages (e.g., Python, Bash) for automation and orchestration, and coding languages (e.g., Go, Java) for building and maintaining systems;
- Containerization and Orchestration: Experience with Docker, Docker Swarm, and Kubernetes for container management, deployment, and orchestration; knowledge of Helm for managing Kubernetes applications;
- Cloud Platforms: Familiarity with cloud services (AWS, GCP, Azure) and cloud-native design patterns, including serverless architecture, cloud storage, and network design in cloud environments;
- Networking Fundamentals: Comprehensive understanding of network protocols (TCP/IP, HTTP/S), load balancing, DNS, VPN, and firewall management to ensure secure, high-performance network operations;
Systems Architecture:
- Infrastructure Design: Deep understanding of scalable, reliable, and cost-effective infrastructure design, including experience with microservices architecture and distributed systems;
- Operating Systems: Strong expertise in Linux (various distributions) and Windows, with a focus on system performance tuning, security hardening, and troubleshooting;
- Resilience and High Availability: Experience in designing fault-tolerant systems with high availability configurations (e.g., clustering, replication, failover), ensuring minimal downtime;
Networking and Security:
- Security Best Practices: Understanding of security protocols, SSL/TLS, SSH, VPN, and IAM policies; experience with implementing zero-trust architecture and robust access controls;
- Vulnerability Management: Conducting vulnerability assessments, identifying security gaps, and deploying patches or mitigations to enhance security posture;
- Network Security: Ability to configure network firewalls, intrusion detection/prevention systems (IDS/IPS), and DDoS protection;
Communication:
- Collaboration: Strong interpersonal skills for collaborating effectively across cross-functional teams, including product, engineering, and leadership;
- Technical Documentation: Ability to articulate complex technical topics through clear documentation, diagrams, and presentations for both technical and non-technical stakeholders;
Problem-solving and Troubleshooting:
- Incident Management: Analyze and resolve incidents quickly and effectively under pressure; provide insights into root causes and proactive solutions to prevent recurrences;
- Diagnostic Skills: Advanced diagnostic skills for analyzing logs, metrics, and traces to troubleshoot complex distributed systems and optimize their performance;
DevOps Practices:
- CI/CD Pipelines: Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) to automate testing, integration, and deployment; understanding of blue-green, canary, and rolling deployment strategies;
- Infrastructure as Code (IaC): Proficiency with IaC tools like Terraform, Ansible, or CloudFormation to automate infrastructure provisioning, configuration, and management;
Experience:
- Managing Large-Scale Systems: Proven track record in managing large-scale distributed systems, with a focus on scalability, reliability, and performance optimization;
- Infrastructure Automation: Ability to design, implement, and improve infrastructure automation, configuration management, and self-healing systems;
- Monitoring and Observability: Extensive experience with monitoring, alerting, and logging tools (e.g., Prometheus, VictoriaMetrics, Grafana, ELK Stack, Datadog, Dynatrace, NewRelic); ability to define and monitor SLOs and SLAs;
- Incident Response and RCA: Lead incident response, conduct root cause analysis (RCA), and implement corrective actions to reduce future incidents and increase system resilience;
- Performance Optimization: Regularly analyze and optimize system and application performance, ensuring efficient resource usage and improved end-user experience;
- Disaster Recovery and Business Continuity: Develop and execute disaster recovery strategies, including backups, failover procedures, and regular testing to ensure data integrity and business continuity;
- Security Compliance: Implement and enforce security policies and standards, conduct periodic audits and vulnerability scans, and ensure compliance with industry regulations (e.g., GDPR, HIPAA, PCI-DSS);
- Documentation and Knowledge Sharing: Create and maintain runbooks, architecture diagrams, and training materials; provide guidance and mentorship on SRE best practices within the organization;
End-to-End SDLC Expertise:
- Full Lifecycle Experience: Expertise across all SDLC phases, including requirements analysis, system design, development, testing, deployment, monitoring, feedback, and continuous optimization.