RamiKrispin / ai-dev-2024-ml-workshop

Materials for the AI Dev 2024 conference workshop "Deploy and Monitor ML Pipelines with Python, Open Source, and Free Applications"

95 stars

18 forks

2 issues

HTMLJavaScriptPython

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing RamiKrispin/ai-dev-2024-ml-workshop in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/RamiKrispin/ai-dev-2024-ml-workshop)

Preview:

Repository Overview (README excerpt)

Crawler view

Deploy and Monitor ML Pipelines with Open Source and Free Applications Materials for the **Deploy and Monitor ML Pipelines with Open Source and Free Applications** workshop at the AI_dev 2024 conference in Paris, France. When 📆: Wednesday, June 19th, 13:50 CEST The workshop is based on the LinkedIn Learning course - Data Pipeline Automation with GitHub Actions, code is available here. The workshop will focus on different deployment designs of machine learning pipelines using open-source applications and free-tier tools. We will use the US hourly demand for electricity data from the EIA API to demonstrate the deployment of a pipeline with GitHub Actions and Docker that fully automates the data refresh process and generates a forecast on a regular basis. This includes the use of open-source tools such as MLflow and YData Profiling to monitor the health of the data and the model's success. Last but not least, we will use Quarto doc to set up the monitoring dashboard and deploy it on GitHub Pages. The Seine River, Paris (created with Midjourney) Table of Content • Milestones • Scope • Set a Development Environment • Data Pipeline • Forecasting Models • Metadata • Dashboard • Deployment • Resources • License Milestones To organize and track the project requirements, we will set up a GitHub Project, create general milestones, and use issues to define sub-milestone. For setting up a data/ML pipeline, we will define the following milestones: • Define scope and requirements: • Pipeline scope • Forecasting scope • General tools and requirements • Set a development environment: • Set a Docker image • Update the Dev Containers settings • Data pipeline prototype: • Create pipeline schema/draft • Build a prototype • Test deployment on GitHub Actions • Set forecasting models: • Create an MLflow experiment • Set backtesting function • Define forecasting models • Test and evaluate the models' performance • Select the best model for deployment • Set a Quarto dashboard: • Create a Quarto dashboard • Track the data and forecast • Monitor performance • Productionize the pipeline: • Clean the code • Define unit tests • Deploy the pipeline and dashboard to GitHub Actions and GitHub Pages: • Create a GitHub Actions workflow • Refresh the data and forecast • Update the dashboard The milestones are available in the repository issues section, and you can track them on the project tracker. The project tracker Scope **Goal:** Forecast the hourly demand for electricity in the California Independent System Operator subregion (CISO). This includes the following four providers: • Pacific Gas and Electric (PGAE) • Southern California Edison (SCE) • San Diego Gas and Electric (SDGE) • Valley Electric Association (VEA) **Forecast Horizon:** 24 hours **Refresh:** Every 24 hours Data The data is available on the EIA API, the API dashboard provides the GET setting to pull the above series. The GET request details from the EIA API dashboard General Requirements • The following functions: • Data backfill function • Data refresh function • Forecast function • Metadata function • Docker image • EIA API key Set a Development Environment To make the deployment to GitHub Actions seamless, we will use Docker. In addition, we will set up a development environment using VScode and the Dev Containers extension. Docker Settings To make the deployment to GitHub Actions seamless, we will use Docker. This will enable us to ship our code to GitHub Actions using the same environment we used to develop and test our code. We will use the below to set the environment: We will use the Python slim image as our baseline, along with a Multi-Stage build approach, to make the image size as minimal as possible. More about Multi-Stage is available To make the image size as minimal as possible, we will use the Python slim image as our baseline along with a Multi-Stage build approach. More details about the Multi-Stage build are available in the Docker documentation and this tutorial. We will use the below Bash script ( ) to build and push the image to the Docker Hub: The Dockerfile and its supporting files are under the docker folder. **Note:** GitHub Actions, by default, does not support ARM64 processer but AMD64 (e.g., Intel). Therefore, if you are using Apple Silicon (M1/M2/M3) or any other ARM64-based machine, you will have to use Docker BuildX or similar to build the image to AMD64 architecture. VScode Dev Containers Setting We will use the following file to set the development environment: If you want to learn more about setting up a dockerized development environment with the Dev Containers extension, please check the Python and R tutorials. Pipeline Design Once we have a clear scope, we can start designing the pipeline. The pipeline has the following two main components: • Data refresh • Forecasting model I typically start with drafting the process using paper and pencil (or the electronic version using iPad and Apple Pencil 😎). This helps me to understand better what functions I need to build: The data pipeline draft Drawing the pipeline components and sub-components helps us plan the required functions that we need to build. Drawing the pipeline components and sub-components helps us plan the required functions that we need to build. Once the pipeline is ready, I usually create a design blueprint to document the process: The pipeline final design The pipeline will have the following two components: • Data refresh function to keep the data up-to-date • Forecast refresh to keep the forecast up-to-date In addition, we will use the following two functions locally to prepare the data and models: • Backfill function to initiate (or reset) the data pipeline • Backtesting function to train, test, and evaluate time series models We will set the pipeline to render and deploy a dashboard on GitHub pages whenever we refresh the data or the forecast. We will use a JSON file to define the pipeline settings. Thi…