Deploying Models into Production with Vetiver

Myles Mitchell @ Jumping Rivers

Before we start…

Who am I?

Data Scientist @ Jumping Rivers:
- Python & R support for clients.
- Teach courses in programming, SQL, ML.
Enjoy the outdoors & travel.
Organise North East & Leeds data science meetups.

Talk plan

Beginners guide to MLOps
Using free / open source software (as much as possible)
Walk through the steps of building and deploying a model
MLOps tips & tricks

Some context…

Jumping Rivers

↗ jumpingrivers.com 𝕏 @jumping_uk

Machine learning
Dashboard development
R packages and APIs
Data pipelines
Code review

Posit

Formerly RStudio
JR is an official partner and assists clients with:
- Posit Connect - host apps and APIs
- Posit Workbench - running code in the cloud
- Posit Package Manager - R / Python package management

Posit frameworks

Posit maintains free and open source frameworks including:
- Quarto - automated reporting
- Shiny - interactive web apps
- Vetiver - MLOps framework
Compatible with R, Python and more!

What is MLOps?

Typical data science workflow

Data is imported and tidied.
Cycle of data transformation, visualisation and modelling.
Results are communicated to an external audience.

From Classical Stats to Machine Learning

Classical statistical modelling prioritises understanding the system behind the data.
By contrast, machine learning tends to prioritise prediction.
As data grows we retrain our ML models to optimise predictive power.
A goal of MLOps is to streamline this cycle.

MLOps: Machine Learning Operations

Framework to continuously build, deploy and maintain ML models.
Encapsulates the “full stack” from data acquisition to model deployment.
Includes versioning, deployment and monitoring.

MLOps frameworks

Amazon SageMaker
Google Cloud Platform
Kubeflow (ML toolkit for Kubernetes)
Vetiver by Posit (free to install, nice for beginners)
And the list goes on…

Vetiver

Integrates with popular ML libraries in R and Python.
Fluent tooling to version, deploy and monitor a trained model.
Deploy to a cloud service or to the localhost.

Let’s build an MLOps stack!

Data

Palmer Penguins dataset:

library("palmerpenguins")

names(penguins)

[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"

Let’s predict species using flipper length, body mass and island!

Scatter plot showing positive relationship between penguin flipper length and penguin body mass. The data points are coloured based on species and shaped based on island. The Gentoo penguins tend to have higher body mass and flipper length than Adelie and Chinstrap.

Palmer Penguin dataset

Data tidying

Using {tidyr} and {rsample}:

# Drop missing data
penguins_data = tidyr::drop_na(penguins)

# Split into train and test sets
penguins_split = rsample::initial_split(
  penguins_data, prop = 0.8
)
train_data = rsample::training(penguins_split)
test_data = rsample::testing(penguins_split)

Modelling

Let’s set up the model recipe in {tidymodels}:

library("tidymodels")

model = recipe(
  species ~ island + flipper_length_mm + body_mass_g,
  data = train_data
) |>
  workflow(nearest_neighbor(mode = "classification")) |>
  fit(train_data)

Model testing

Our model object can now be used to predict species:

model_pred = predict(model, test_data)

# Accuracy for unseen test data
mean(
  model_pred$.pred_class == as.character(
    test_data$species
  )
)

[1] 0.8656716

Enter Vetiver!

Convert our {tidymodels} model to a {vetiver} model:

v_model = vetiver::vetiver_model(
  model,
  model_name = "k-nn",
  description = "penguin-species"
)
v_model


── k-nn ─ <bundled_workflow> model for deployment 
penguin-species using 3 features

Contains all the info needed to version, store and deploy our model!

Model versioning

Use {pins} to store R or Python objects for reuse later.
Store pins using “boards” including Posit Connect, Amazon S3 or even Google drive!

Storing in a temporary directory:

model_board = pins::board_temp(
  versioned = TRUE
)
model_board |>
  vetiver::vetiver_pin_write(v_model)

Retrieving a model

Retrieve a model

model_board |> vetiver::vetiver_pin_read("k-nn")


── k-nn ─ <bundled_workflow> model for deployment 
penguin-species using 3 features

Inspect the stored versions

model_board |> pins::pin_versions("k-nn")

# A tibble: 1 × 3
  version                created             hash 
  <chr>                  <dttm>              <chr>
1 20241017T093607Z-4381f 2024-10-17 10:36:07 4381f

Model deployment

We deploy models as APIs which take input data and send back model predictions.
APIs can be hosted at public endpoints on the web.
We can run them on the localhost (during testing / development).
{vetiver} uses {plumber} to create a model API.

Deploying locally

{vetiver} and {plumber} support local deployment:

plumber::pr() |>
  vetiver::vetiver_api(v_model) |>
  plumber::pr_run()

Query the API via a simple dashboard or the command line.
Great for beginners to MLOps and APIs!

Deploying to Connect

Vetiver integrates nicely with Posit Connect:

vetiver::vetiver_deploy_rsconnect(
  board = model_board, "k-nn"
)

Easier / quicker if pinned model is on Connect.
We can also publish to Amazon SageMaker using vetiver_deploy_sagemaker()

Deploying to other cloud platforms

We start by preparing a Docker container:

vetiver::vetiver_prepare_docker(
  model_board,
  "k-nn"
)

This command:
- Lists R depedencies with {renv}
- Stores the {plumber} API code in plumber.R
- Generates a Dockerfile

Dockerfiles

Our Dockerfile contains a series of commands to:
- Install the system libraries (Windows|Mac|Linux).
- Set the R version and install the required R packages.
- Run the API in the deployment environment.
Use automated CI/CD to build the API in the cloud environment.

Model monitoring

As our data grows, run regular checks of model performance.

Monitor key model metrics over time (requires a date column):

metrics =
  augment(v_model,
          new_data = new_penguins) |>
  vetiver_compute_metrics(date,
                          "year",
                          species,
                          .pred)

Model drift

Store our metrics in our {pins} board:

model_board |>
  vetiver_pin_metrics(metrics, "k-nn_metrics")

Plot the metrics:
```
vetiver::vetiver_plot_metrics(metrics)
```
Over time we may notice a drop in performance…

Model drift

Model performance may drift as the data evolves…
- Data drift: statistical distribution of input feature changes.
- Concept drift: relationship between target and input variables changes.
The context in which your model was trained matters!

Aside: What about Python?

Vetiver is available for both Python and R!
In Python you would use Python ML libraries rather than {tidymodels}
- scikit learn
- PyTorch
- XGBoost
- statsmodels
Vetiver documentation: vetiver.posit.co

MLOps tips & tricks

Data

Move from large CSVs to more efficient formats like Parquet and Arrow.
Tools like Apache Spark can speed up data processing.
Add a data validation step.
Version your data.
Your preferred ML platform probably has built-in tools for data wrangling.

Modelling

Consider creating an R package to encourage proper documentation, testing and dependency management.
Consider auto-ML tools like H2O.ai and SageMaker Autopilot for model selection.
Version and store your models for reuse later.

Deployment

Try deploying locally to check that your model API works as expected.
Use environment managers like {renv} to store model dependencies.
Use containers like Docker to bundle model source code with dependencies.

Cost considerations

Some cloud platforms offer free trials (e.g., SageMaker 2-month trial).
May be cheaper if you’re already invested in a particular cloud platform
- Data services
- App deployment
Costs can rise depending on computational resources consumed.
Model building and deployment use different environments.

Benefits of an MLOps workflow

Retraining and redeployment can happen at the click of a button.
Encourages good practices like model versioning.
Reduces human error.
Well defined and reproducible.
Consider whether it is worth the cost/effort before starting.

Thanks for listening!

Slides: bit.ly/2024-neds-mlops
Vetiver docs: vetiver.posit.co
Jumping Rivers courses: jumpingrivers.com/training/
Jumping Rivers blog: jumpingrivers.com/blog/