MLOps Engineer

About the Role:

We are seeking an experienced MLOps Engineer to design, build, and maintain the infrastructure and tools that enable our data science and machine learning teams to develop, deploy, and monitor production ML systems at scale. You will bridge the gap between data science and operations, ensuring reliable, efficient, and reproducible ML workflows.

Responsibilities:

Infrastructure & Platform Development

ML Pipeline & Automation

Monitoring & Operations

Collaboration & Best Practices

Requirements:

Education

Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)
Master's degree preferred but not required with sufficient practical experience

Experience

Technical Skills

Strong proficiency in Python and some experience with at least one low-level programming language (C/C++, Go, Rust)
Deep understanding of containerization (Docker, Kubernetes)
Hands-on experience with CI/CD tools (Jenkins, GitLab CI, GitHub Actions, etc.)
Knowledge of ML frameworks (TensorFlow, PyTorch, scikit-learn)
Experience with workflow orchestration (Airflow, Kubeflow, Prefect, etc.)
Hands-on experience with experiment tracking tools (MLflow, ClearML)

Core Competencies

Solid understanding of ML lifecycle and model development processes
Strong Linux/Unix systems administration skills
Experience with version control systems (Git) and branching strategies
Knowledge of networking, security, and compliance in cloud and on-prem environments
Understanding of distributed computing and parallel processing
Knowledge of microservices architecture and API design

Soft Skills:

Strong problem-solving and debugging abilities
Excellent communication skills with both technical and non-technical stakeholders
Ability to work independently and manage multiple priorities
Collaborative mindset with emphasis on enabling others
Adaptability to rapidly changing technology landscape
Pragmatic approach to balancing innovation with reliability

Preferred Qualifications:

If you know at least 3+ skills from the sections below, please apply.

Technical skills:

Experience with cloud platforms (Azure ML, AWS SageMaker, or GCP Vertex AI)
Experience with GitOps practices and tools (ArgoCD, Flux, GitLab with GitOps) for declarative infrastructure and ML pipeline management
Experience with feature stores (Feast, Tecton, Hopsworks, or similar)
Experience with model monitoring solutions (Evidently, WhyLabs, Fiddler, Arize, Whylogs)
Experience with ML explainability tools (SHAP, LIME, Captum, Alibi, InterpretML)
Hands-on experience with hyperparameter optimization tools (Optuna, Ray Tune, Hyperopt, Katib)
Experience with distributed training frameworks (Ray Train, Horovod, DeepSpeed, PyTorch DDP, Megatron)
Experience with model serving frameworks (TensorFlow Serving, TorchServe, Triton, MLServer, or similar)
Experience with data versioning tools (DVC, Pachyderm, LakeFS)
Experience with GPU optimization (CUDA, TensorRT, ONNX Runtime, flash-attention)
Knowledge of GPU allocation, sharing, management and profiling

LLM Ops:

Experience with LLM inference frameworks (vLLM, TGI, TensorRT-LLM)
Familiarity with agent orchestration frameworks (LangChain, LangGraph, LlamaIndex)
Experience with LLM optimization: quantization, KV cache management, continuous batching
Experience with prompt engineering and versioning tools (LangSmith, PromptLayer, Weights & Biases Prompts, Helicone)We offer
5/2, 09.00-18.00;
Meal allowance;
Annual performance bonuses;
Corporate health program: VIP voluntary insurance and special discounts for gyms;
Access to Digital Learning Platforms.

Interested candidates can apply by clicking the link provided in the "Apply" button.

Caspian Innovation Center