MLOps Lead, Central Technology
About the positionResponsibilities• Provide technical MLOps leadership for a team of MLOps Engineers, managing and leading the team in operating AI training and inference systems. • Drive the application of MLOps and DevOps principles across multiple platforms, ensuring peak operational efficiency. • Define end to end metrics program including full proactive monitoring and alerting systems for the MLOps team. • Facilitate model training through collaboration with AI Researchers to ensure best practices in machine learning and deep learning.• Optimize Kubernetes based AI Lifecycle platform through IAC practices and integrate with On-Prem HPC systems. • Collaborate on Data systems for AI model training with Data Infrastructure Eng team and Science data teams. • Lead MLOps team supporting on-call rotation with a focus on automation and proactive alerting. Requirements• BS, MS, or PhD degree in Computer Science or a related technical discipline or equivalent experience. • 7+ years of relevant coding and systems experience. • 5+ years of systems Architecture and Design experience, with a broad range of MLOps experience.• Proven technical leadership in SRE and MLOps related experience. • Strong experience scaling containerized applications on Kubernetes or Mesos. • Cloud Platform proficiency with AWS, GCP, or bolthires Azure. • MLOps experience working with medium to large scale GPU clusters in Kubernetes. • Working knowledge of Nvidia CUDA and AI/ML custom libraries. • Knowledge of Linux systems optimization and administration. • Solid Coding experience with a systems language such as Rust, C/C++, C#, Go, Java, or Scala.• Expertise with a scripting language such as Python, PHP, or Ruby. • Experience in integrating Data with the AI Lifecycle. • AI/ML Platform Operations experience in an environment integrated with challenging data and systems platform challenges. • Large scale Streaming data systems integration experience. • Experience with Hadoop, Spark, and/or Kafka deployments. • Workflow scheduling tools experience such as Apache Airflow, Dagster, or Apache Beam. • Understanding of Data Engineering, Data Governance, Data Infrastructure, and AI/ML execution platforms.Nice-to-haves• Experience with PyTorch, Keras, or Tensorflow. • Experience with HPC and Slurm. Benefits• Generous employer match on employee 401(k) contributions. • Annual benefit for employees that can be used for housing, student loan repayment, childcare, commuter costs, or other life needs. • CZI Life of Service Gifts awarded to employees to support causes closest to them. • Paid time off to volunteer at an organization of your choice. • Funding for select family-forming benefits. • Relocation support for employees moving to the Bay Area.Apply tot his job