Site Reliability Engineer

Job description:

Design and implement scalable infrastructure to support growing systems, ensuring high availability and performance;
Enhance system reliability and security - identifying potential risks and proactively implementing solutions;
Automate manual processes and repetitive tasks to improve operational efficiency and reduce human error;
Lead incident response efforts, including troubleshooting, root cause analysis, and post-incident reviews;
Maintain high uptime and availability of services, including monitoring, alerting, and ensuring swift recovery from outages;
Optimize system performance identifying bottlenecks and fine-tuning infrastructure for scalability and efficiency;
Drive Continuous Integration and Continuous Deployment (CI/CD) processes to enable rapid, safe, and automated code releases;
Implement and manage robust monitoring and alerting systems to proactively detect issues before they impact end users;
Collaborate closely with development teams to ensure smooth application deployment and operational excellence;
Champion DevOps best practices - fostering a culture of collaboration, automation, and continuous improvement across teams.

Requirements:

Technical Expertise:

Programming and Scripting: Strong proficiency in scripting languages (e.g., Python, Bash) for automation and orchestration, and coding languages (e.g., Go, Java) for building and maintaining systems;
Containerization and Orchestration: Experience with Docker, Docker Swarm, and Kubernetes for container management, deployment, and orchestration; knowledge of Helm for managing Kubernetes applications;
Cloud Platforms: Familiarity with cloud services (AWS, GCP, Azure) and cloud-native design patterns, including serverless architecture, cloud storage, and network design in cloud environments;
Networking Fundamentals: Comprehensive understanding of network protocols (TCP/IP, HTTP/S), load balancing, DNS, VPN, and firewall management to ensure secure, high-performance network operations;

Systems Architecture:

Infrastructure Design: Deep understanding of scalable, reliable, and cost-effective infrastructure design, including experience with microservices architecture and distributed systems;
Operating Systems: Strong expertise in Linux (various distributions) and Windows, with a focus on system performance tuning, security hardening, and troubleshooting;
Resilience and High Availability: Experience in designing fault-tolerant systems with high availability configurations (e.g., clustering, replication, failover), ensuring minimal downtime;

Networking and Security:

Security Best Practices: Understanding of security protocols, SSL/TLS, SSH, VPN, and IAM policies; experience with implementing zero-trust architecture and robust access controls;
Vulnerability Management: Conducting vulnerability assessments, identifying security gaps, and deploying patches or mitigations to enhance security posture;
Network Security: Ability to configure network firewalls, intrusion detection/prevention systems (IDS/IPS), and DDoS protection;

Communication:

Collaboration: Strong interpersonal skills for collaborating effectively across cross-functional teams, including product, engineering, and leadership;
Technical Documentation: Ability to articulate complex technical topics through clear documentation, diagrams, and presentations for both technical and non-technical stakeholders;

Problem-solving and Troubleshooting:

Incident Management: Analyze and resolve incidents quickly and effectively under pressure; provide insights into root causes and proactive solutions to prevent recurrences;
Diagnostic Skills: Advanced diagnostic skills for analyzing logs, metrics, and traces to troubleshoot complex distributed systems and optimize their performance;

DevOps Practices:

CI/CD Pipelines: Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) to automate testing, integration, and deployment; understanding of blue-green, canary, and rolling deployment strategies;
Infrastructure as Code (IaC): Proficiency with IaC tools like Terraform, Ansible, or CloudFormation to automate infrastructure provisioning, configuration, and management;

Experience:

Managing Large-Scale Systems: Proven track record in managing large-scale distributed systems, with a focus on scalability, reliability, and performance optimization;
Infrastructure Automation: Ability to design, implement, and improve infrastructure automation, configuration management, and self-healing systems;
Monitoring and Observability: Extensive experience with monitoring, alerting, and logging tools (e.g., Prometheus, VictoriaMetrics, Grafana, ELK Stack, Datadog, Dynatrace, NewRelic); ability to define and monitor SLOs and SLAs;
Incident Response and RCA: Lead incident response, conduct root cause analysis (RCA), and implement corrective actions to reduce future incidents and increase system resilience;
Performance Optimization: Regularly analyze and optimize system and application performance, ensuring efficient resource usage and improved end-user experience;
Disaster Recovery and Business Continuity: Develop and execute disaster recovery strategies, including backups, failover procedures, and regular testing to ensure data integrity and business continuity;
Security Compliance: Implement and enforce security policies and standards, conduct periodic audits and vulnerability scans, and ensure compliance with industry regulations (e.g., GDPR, HIPAA, PCI-DSS);
Documentation and Knowledge Sharing: Create and maintain runbooks, architecture diagrams, and training materials; provide guidance and mentorship on SRE best practices within the organization;

End-to-End SDLC Expertise:

Full Lifecycle Experience: Expertise across all SDLC phases, including requirements analysis, system design, development, testing, deployment, monitoring, feedback, and continuous optimization.

İnnovasiya və Rəqəmsal İnkişaf Agentliyi