About the Role:
- We are seeking an experienced MLOps Engineer to design, build, and maintain the infrastructure and tools that enable our data science and machine learning teams to develop, deploy, and monitor production ML systems at scale. You will bridge the gap between data science and operations, ensuring reliable, efficient, and reproducible ML workflows.
Responsibilities:
Infrastructure & Platform Development
- Design and implement scalable ML infrastructure on premises and cloud platforms
- Build and maintain ML experimentation and production environments
- Develop and manage container orchestration systems for ML workloads
- Implement GPU resource management and optimization strategies
- Design storage solutions for datasets, models, and artifacts
ML Pipeline & Automation
- Create CI/CD pipelines for ML model training, validation, and deployment
- Implement automated model retraining and versioning systems
- Build orchestration workflows for data processing and model training
- Develop automated testing frameworks for ML models and pipelines
- Design and implement feature stores for feature engineering and reuse
Monitoring & Operations
- Implement model monitoring systems for performance, drift, and data quality
- Set up logging, alerting, and observability for ML systems
- Establish model governance and compliance tracking
- Create dashboards for model performance and infrastructure metrics
- Develop incident response procedures for production ML systems
Collaboration & Best Practices
- Partner with data scientists and AI engineers to productionize ML models
- Establish MLOps best practices and standards across teams
- Provide technical guidance on deployment architectures
- Document processes, systems, and runbooks
- Mentor junior engineers and data scientists on MLOps practices
Requirements:
Education
- Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)
- Master's degree preferred but not required with sufficient practical experience
Experience
- 2+ years working as ML/Software/DevOps Engineer
- Proven track record of building production ML systems at scale
- Experience supporting data science teams in enterprise environments
Technical Skills
- Strong proficiency in Python and some experience with at least one low-level programming language (C/C++, Go, Rust)
- Deep understanding of containerization (Docker, Kubernetes)
- Hands-on experience with CI/CD tools (Jenkins, GitLab CI, GitHub Actions, etc.)
- Knowledge of ML frameworks (TensorFlow, PyTorch, scikit-learn)
- Experience with workflow orchestration (Airflow, Kubeflow, Prefect, etc.)
- Hands-on experience with experiment tracking tools (MLflow, ClearML)
Core Competencies
- Solid understanding of ML lifecycle and model development processes
- Strong Linux/Unix systems administration skills
- Experience with version control systems (Git) and branching strategies
- Knowledge of networking, security, and compliance in cloud and on-prem environments
- Understanding of distributed computing and parallel processing
- Knowledge of microservices architecture and API design
Soft Skills:
- Strong problem-solving and debugging abilities
- Excellent communication skills with both technical and non-technical stakeholders
- Ability to work independently and manage multiple priorities
- Collaborative mindset with emphasis on enabling others
- Adaptability to rapidly changing technology landscape
- Pragmatic approach to balancing innovation with reliability
Preferred Qualifications:
If you know at least 3+ skills from the sections below, please apply.
Technical skills:
- Experience with cloud platforms (Azure ML, AWS SageMaker, or GCP Vertex AI)
- Experience with GitOps practices and tools (ArgoCD, Flux, GitLab with GitOps) for declarative infrastructure and ML pipeline management
- Experience with feature stores (Feast, Tecton, Hopsworks, or similar)
- Experience with model monitoring solutions (Evidently, WhyLabs, Fiddler, Arize, Whylogs)
- Experience with ML explainability tools (SHAP, LIME, Captum, Alibi, InterpretML)
- Hands-on experience with hyperparameter optimization tools (Optuna, Ray Tune, Hyperopt, Katib)
- Experience with distributed training frameworks (Ray Train, Horovod, DeepSpeed, PyTorch DDP, Megatron)
- Experience with model serving frameworks (TensorFlow Serving, TorchServe, Triton, MLServer, or similar)
- Experience with data versioning tools (DVC, Pachyderm, LakeFS)
- Experience with GPU optimization (CUDA, TensorRT, ONNX Runtime, flash-attention)
- Knowledge of GPU allocation, sharing, management and profiling
LLM Ops:
- Experience with LLM inference frameworks (vLLM, TGI, TensorRT-LLM)
- Familiarity with agent orchestration frameworks (LangChain, LangGraph, LlamaIndex)
- Experience with LLM optimization: quantization, KV cache management, continuous batching
- Experience with prompt engineering and versioning tools (LangSmith, PromptLayer, Weights & Biases Prompts, Helicone)We offer
- 5/2, 09.00-18.00;
- Meal allowance;
- Annual performance bonuses;
- Corporate health program: VIP voluntary insurance and special discounts for gyms;
- Access to Digital Learning Platforms.
Interested candidates can apply by clicking the link provided in the "Apply" button.