Deploying Models into Production with Vetiver

Myles Mitchell @ Jumping Rivers

Before we start…

Who am I?

  • Data Scientist @ Jumping Rivers:

    • Python & R support for clients.

    • Teach courses in programming, SQL, ML.

  • Enjoy the outdoors & travel.

  • Organise North East & Leeds data science meetups.

Talk plan

  • Beginners guide to MLOps

  • Using free / open source software (as much as possible)

  • Walk through the steps of building and deploying a model

  • MLOps tips & tricks

Some context…

Jumping Rivers

↗ jumpingrivers.com   𝕏 @jumping_uk

  • Machine learning
  • Dashboard development
  • R packages and APIs
  • Data pipelines
  • Code review
     

Posit

  • Formerly RStudio

  • JR is an official partner and assists clients with:

    • Posit Connect - host apps and APIs
    • Posit Workbench - running code in the cloud
    • Posit Package Manager - R / Python package management

Posit frameworks

  • Posit maintains free and open source frameworks including:

    • Quarto - automated reporting
    • Shiny - interactive web apps
    • Vetiver - MLOps framework
  • Compatible with R, Python and more!

What is MLOps?

Typical data science workflow

  • Data is imported and tidied.
  • Cycle of data transformation, visualisation and modelling.
  • Results are communicated to an external audience.

From Classical Stats to Machine Learning

  • Classical statistical modelling prioritises understanding the system behind the data.
  • By contrast, machine learning tends to prioritise prediction.
  • As data grows we retrain our ML models to optimise predictive power.
  • A goal of MLOps is to streamline this cycle.

MLOps: Machine Learning Operations

  • Framework to continuously build, deploy and maintain ML models.
  • Encapsulates the “full stack” from data acquisition to model deployment.
  • Includes versioning, deployment and monitoring.

MLOps frameworks

  • Amazon SageMaker
  • Google Cloud Platform
  • Kubeflow (ML toolkit for Kubernetes)
  • Vetiver by Posit (free to install, nice for beginners)
  • And the list goes on…

     

Vetiver

  • Integrates with popular ML libraries in R and Python.
  • Fluent tooling to version, deploy and monitor a trained model.
  • Deploy to a cloud service or to the localhost.

Let’s build an MLOps stack!

Data

  • Palmer Penguins dataset:

    library("palmerpenguins")
    
    names(penguins)
    [1] "species"           "island"            "bill_length_mm"   
    [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
    [7] "sex"               "year"             
  • Let’s predict species using flipper length, body mass and island!

Scatter plot showing positive relationship between penguin flipper length and penguin body mass. The data points are coloured based on species and shaped based on island. The Gentoo penguins tend to have higher body mass and flipper length than Adelie and Chinstrap.

Palmer Penguin dataset

Data tidying

  • Using {tidyr} and {rsample}:

    # Drop missing data
    penguins_data = tidyr::drop_na(penguins)
    
    # Split into train and test sets
    penguins_split = rsample::initial_split(
      penguins_data, prop = 0.8
    )
    train_data = rsample::training(penguins_split)
    test_data = rsample::testing(penguins_split)

Modelling

  • Let’s set up the model recipe in {tidymodels}:
library("tidymodels")

model = recipe(
  species ~ island + flipper_length_mm + body_mass_g,
  data = train_data
) |>
  workflow(nearest_neighbor(mode = "classification")) |>
  fit(train_data)

Model testing

  • Our model object can now be used to predict species:
model_pred = predict(model, test_data)

# Accuracy for unseen test data
mean(
  model_pred$.pred_class == as.character(
    test_data$species
  )
)
[1] 0.8656716

Enter Vetiver!

  • Convert our {tidymodels} model to a {vetiver} model:

    v_model = vetiver::vetiver_model(
      model,
      model_name = "k-nn",
      description = "penguin-species"
    )
    v_model
    
    ── k-nn ─ <bundled_workflow> model for deployment 
    penguin-species using 3 features
  • Contains all the info needed to version, store and deploy our model!

Model versioning

  • Use {pins} to store R or Python objects for reuse later.

  • Store pins using “boards” including Posit Connect, Amazon S3 or even Google drive!

  • Storing in a temporary directory:

    model_board = pins::board_temp(
      versioned = TRUE
    )
    model_board |>
      vetiver::vetiver_pin_write(v_model)

Retrieving a model

  • Retrieve a model

    model_board |> vetiver::vetiver_pin_read("k-nn")
    
    ── k-nn ─ <bundled_workflow> model for deployment 
    penguin-species using 3 features
  • Inspect the stored versions

    model_board |> pins::pin_versions("k-nn")
    # A tibble: 1 × 3
      version                created             hash 
      <chr>                  <dttm>              <chr>
    1 20241017T093607Z-4381f 2024-10-17 10:36:07 4381f

Model deployment

  • We deploy models as APIs which take input data and send back model predictions.

  • APIs can be hosted at public endpoints on the web.

  • We can run them on the localhost (during testing / development).

  • {vetiver} uses {plumber} to create a model API.

Deploying locally

  • {vetiver} and {plumber} support local deployment:

    plumber::pr() |>
      vetiver::vetiver_api(v_model) |>
      plumber::pr_run()
  • Query the API via a simple dashboard or the command line.

  • Great for beginners to MLOps and APIs!

Deploying to Connect

  • Vetiver integrates nicely with Posit Connect:

    vetiver::vetiver_deploy_rsconnect(
      board = model_board, "k-nn"
    )
  • Easier / quicker if pinned model is on Connect.

  • We can also publish to Amazon SageMaker using vetiver_deploy_sagemaker()

Deploying to other cloud platforms

  • We start by preparing a Docker container:

    vetiver::vetiver_prepare_docker(
      model_board,
      "k-nn"
    )
  • This command:

    • Lists R depedencies with {renv}

    • Stores the {plumber} API code in plumber.R

    • Generates a Dockerfile

Dockerfiles

  • Our Dockerfile contains a series of commands to:

    • Install the system libraries (Windows|Mac|Linux).

    • Set the R version and install the required R packages.

    • Run the API in the deployment environment.

  • Use automated CI/CD to build the API in the cloud environment.

Model monitoring

  • As our data grows, run regular checks of model performance.

  • Monitor key model metrics over time (requires a date column):

    metrics =
      augment(v_model,
              new_data = new_penguins) |>
      vetiver_compute_metrics(date,
                              "year",
                              species,
                              .pred)

Model drift

  • Store our metrics in our {pins} board:

    model_board |>
      vetiver_pin_metrics(metrics, "k-nn_metrics")
  • Plot the metrics:

    vetiver::vetiver_plot_metrics(metrics)
  • Over time we may notice a drop in performance…

Model drift

  • Model performance may drift as the data evolves…
    • Data drift: statistical distribution of input feature changes.
    • Concept drift: relationship between target and input variables changes.
  • The context in which your model was trained matters!

Aside: What about Python?

  • Vetiver is available for both Python and R!

  • In Python you would use Python ML libraries rather than {tidymodels}

    • scikit learn
    • PyTorch
    • XGBoost
    • statsmodels
  • Vetiver documentation: vetiver.posit.co

MLOps tips & tricks

Data

  • Move from large CSVs to more efficient formats like Parquet and Arrow.
  • Tools like Apache Spark can speed up data processing.
  • Add a data validation step.
  • Version your data.
  • Your preferred ML platform probably has built-in tools for data wrangling.

Modelling

  • Consider creating an R package to encourage proper documentation, testing and dependency management.
  • Consider auto-ML tools like H2O.ai and SageMaker Autopilot for model selection.
  • Version and store your models for reuse later.

Deployment

  • Try deploying locally to check that your model API works as expected.
  • Use environment managers like {renv} to store model dependencies.
  • Use containers like Docker to bundle model source code with dependencies.

Cost considerations

  • Some cloud platforms offer free trials (e.g., SageMaker 2-month trial).
  • May be cheaper if you’re already invested in a particular cloud platform
    • Data services
    • App deployment
  • Costs can rise depending on computational resources consumed.
  • Model building and deployment use different environments.

Benefits of an MLOps workflow

  • Retraining and redeployment can happen at the click of a button.

  • Encourages good practices like model versioning.

  • Reduces human error.

  • Well defined and reproducible.

  • Consider whether it is worth the cost/effort before starting.

Thanks for listening!