İnnovasiya və Rəqəmsal İnkişaf Agentliyi İnnovasiya və Rəqəmsal İnkişaf Agentliyi
  • 573
İnnovasiya və Rəqəmsal İnkişaf Agentliyi

İnnovasiya və Rəqəmsal İnkişaf Agentliyi

Government

Site Reliability Engineer

  • Deadline 15 December 2024

Job description:

  • Design and implement scalable infrastructure to support growing systems, ensuring high availability and performance;
  • Enhance system reliability and security - identifying potential risks and proactively implementing solutions;
  • Automate manual processes and repetitive tasks to improve operational efficiency and reduce human error;
  • Lead incident response efforts, including troubleshooting, root cause analysis, and post-incident reviews;
  • Maintain high uptime and availability of services, including monitoring, alerting, and ensuring swift recovery from outages;
  • Optimize system performance identifying bottlenecks and fine-tuning infrastructure for scalability and efficiency;
  • Drive Continuous Integration and Continuous Deployment (CI/CD) processes to enable rapid, safe, and automated code releases;
  • Implement and manage robust monitoring and alerting systems to proactively detect issues before they impact end users;
  • Collaborate closely with development teams to ensure smooth application deployment and operational excellence;
  • Champion DevOps best practices - fostering a culture of collaboration, automation, and continuous improvement across teams.

Requirements:

 Technical Expertise: 

  • Programming and Scripting: Strong proficiency in scripting languages (e.g., Python, Bash) for automation and orchestration, and coding languages (e.g., Go, Java) for building and maintaining systems;
  • Containerization and Orchestration: Experience with Docker, Docker Swarm, and Kubernetes for container management, deployment, and orchestration; knowledge of Helm for managing Kubernetes applications;
  • Cloud Platforms: Familiarity with cloud services (AWS, GCP, Azure) and cloud-native design patterns, including serverless architecture, cloud storage, and network design in cloud environments;
  • Networking Fundamentals: Comprehensive understanding of network protocols (TCP/IP, HTTP/S), load balancing, DNS, VPN, and firewall management to ensure secure, high-performance network operations;

 Systems Architecture: 

  • Infrastructure Design: Deep understanding of scalable, reliable, and cost-effective infrastructure design, including experience with microservices architecture and distributed systems;
  • Operating Systems: Strong expertise in Linux (various distributions) and Windows, with a focus on system performance tuning, security hardening, and troubleshooting;
  • Resilience and High Availability: Experience in designing fault-tolerant systems with high availability configurations (e.g., clustering, replication, failover), ensuring minimal downtime;

Networking and Security: 

  • Security Best Practices: Understanding of security protocols, SSL/TLS, SSH, VPN, and IAM policies; experience with implementing zero-trust architecture and robust access controls;
  • Vulnerability Management: Conducting vulnerability assessments, identifying security gaps, and deploying patches or mitigations to enhance security posture;
  • Network Security: Ability to configure network firewalls, intrusion detection/prevention systems (IDS/IPS), and DDoS protection;

 Communication: 

  • Collaboration: Strong interpersonal skills for collaborating effectively across cross-functional teams, including product, engineering, and leadership;
  • Technical Documentation: Ability to articulate complex technical topics through clear documentation, diagrams, and presentations for both technical and non-technical stakeholders;

 Problem-solving and Troubleshooting: 

  • Incident Management: Analyze and resolve incidents quickly and effectively under pressure; provide insights into root causes and proactive solutions to prevent recurrences;
  • Diagnostic Skills: Advanced diagnostic skills for analyzing logs, metrics, and traces to troubleshoot complex distributed systems and optimize their performance;

 DevOps Practices: 

  • CI/CD Pipelines: Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) to automate testing, integration, and deployment; understanding of blue-green, canary, and rolling deployment strategies;
  • Infrastructure as Code (IaC): Proficiency with IaC tools like Terraform, Ansible, or CloudFormation to automate infrastructure provisioning, configuration, and management;

 Experience: 

  • Managing Large-Scale Systems: Proven track record in managing large-scale distributed systems, with a focus on scalability, reliability, and performance optimization;
  • Infrastructure Automation: Ability to design, implement, and improve infrastructure automation, configuration management, and self-healing systems; 
  • Monitoring and Observability: Extensive experience with monitoring, alerting, and logging tools (e.g., Prometheus, VictoriaMetrics, Grafana, ELK Stack, Datadog, Dynatrace, NewRelic); ability to define and monitor SLOs and SLAs; 
  • Incident Response and RCA: Lead incident response, conduct root cause analysis (RCA), and implement corrective actions to reduce future incidents and increase system resilience;
  • Performance Optimization: Regularly analyze and optimize system and application performance, ensuring efficient resource usage and improved end-user experience;
  • Disaster Recovery and Business Continuity: Develop and execute disaster recovery strategies, including backups, failover procedures, and regular testing to ensure data integrity and business continuity;
  • Security Compliance: Implement and enforce security policies and standards, conduct periodic audits and vulnerability scans, and ensure compliance with industry regulations (e.g., GDPR, HIPAA, PCI-DSS);
  • Documentation and Knowledge Sharing: Create and maintain runbooks, architecture diagrams, and training materials; provide guidance and mentorship on SRE best practices within the organization;

 End-to-End SDLC Expertise: 

  • Full Lifecycle Experience: Expertise across all SDLC phases, including requirements analysis, system design, development, testing, deployment, monitoring, feedback, and continuous optimization.
  • Daily21
  • Weekly256
  • Monthly1933