How to use CI/CD for your ML Projects?

Posted on August 13, 2020 by Nagdev Amruthnath in Data science | 0 Comments

This article was first published on python – Hi! I am Nagdev , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

The terms CI/CD stands for Continuous Integration and Continuous Delivery – Deployment. Before we jump into how all these work, let’s take a step back and walk through the process of ML. Most of the data scientists do their data analytics on their laptops. For every data analytics projects there are various steps involved and most common one’s are as follows:
1. Data collection
2. Feature extraction
3. Data cleaning and pre-processing
4. Data validation
5. Model building
6. Model testing
7. Model Deployment

In most cases, each of these steps are performed by different team members. Any changes to these steps could affect the entire process flow (or some time referred to as pipeline). If a part of the pipeline gets clogged, then the project’s stability deteriorates. If you are working for an end customer, the results could be disastrous. This is where CI/CD comes as a savior for data science projects.

Most data scientists today focus on the data part and focus less on integration and deployment part. Data science has turned actionable results to notebook reports. This is one of the reasons, most companies have turned to hiring ML deployment engineers that could work with software engineers to deploy models more gracefully. As a data scientist myself, I am guilty of making changes to a ML model by retraining and adding new features to improve the accuracy. When I do this, it affects the entire pipeline. At this point of time, it feels like we need someone to check and validate if, my new model affect the data pipeline or runs smoothly without affecting anyone. Ahh! that’s exactly what CI/CD does. Before we jump into CI/CD, we will talk a little bit about git and then jump into CI/CD.

What is GIT?

As per Wikipedia, “Git is a distributed version-control system for tracking changes in source code during software development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files. Its goals include speed, data integrity, and support for distributed, non-linear workflows.“

If you ask me how is git important for ML? My answer is “do you remember that time when you trained a ML model with 98% accuracy and couple of days later trying find the parameters but, did not save it. You wish you could go back and check the code for that.”

If you had used git, you could go look at the changes or even pull the commit and build the model with those parameters. This is especially useful when you have multiple team members working on the same code.

What is CI/CD?

Continuous Integration

Continuous integration is a development process where you integrate your changes to git repository by doing a commit and push frequently. In ML, when you want to retrain your model, you will first create a branch, train a model and commit changes to the branch. If you have CI set up, then an automated process would build your code, run tests. Successful CI means new code changes to an app are regularly built and tested.

Continuous Delivery

In continuous delivery, if your changes from CI are successfully built and tested, then CD would deliver the code to the shared repository. The shared repository would have the new code/model that the rest of the team members could access. The goal of continuous delivery is to have a codebase that is always ready for deployment to a production environment.

Continuous Deployment

The final stage of a mature CI/CD pipeline is continuous deployment. Continuous deployment can refer to automatically releasing a developer’s changes from the repository to production, where it is usable by customers.

How can we do CI/CD for ML projects?

To demonstrate how to do CI/CD, I will be using GitLab CI/CD is integrated into their free platform. According to Wikipedia, “GitLab is a web-based DevOps lifecycle tool that provides a Git-repository manager providing wiki, issue-tracking and continuous integration/continuous deployment pipeline features, using an open-source license, developed by GitLab Inc.“

According to tutorials point, some of the features of GitLab are
1. GitLab hosts your (private) software projects for free.
2. GitLab is a platform for managing Git repositories.
3. GitLab offers free public and private repositories, issue-tracking and wikis.
4. GitLab is a user friendly web interface layer on top of Git, which increases the speed of working with Git.
5. GitLab provides its own Continuous Integration (CI) system for managing the projects and provides user interface along with other features of GitLab.

I have created a repository called CICD-in-R on GitLab. https://gitlab.com/nagdevAmruthnath1/CICD-in-R

The following repository has two R scripts; training and scoring. In our CI/CD we want to do continuous testing to see if our model gets built and scored without any errors. You will also see a script called .gitlab-ci-yml. This is the script that we will use to setup CI tasks. We will get to it in a little bit.

trainingScript.R

# load data
data("mtcars")

# view data
head(mtcars)

# predict mileage using linear regression
mpg_model = lm(mpg~., data = mtcars[1:25,])
summary(mpg_model)

# predict data
pds = predict(mpg_model, mtcars[26:32, ])

# Function for Root Mean Squared Error
RMSE = function(m, o){
  sqrt(mean((m - o)^2))
}

# calculate RMSE
RMSE(pds, mtcars$mpg[26:32] )

# view actual vs predicted
data.frame(Actual = mtcars$mpg[26:32], prediction = pds )

# save the model
saveRDS(mpg_model, "mpg_model.RDS")

scoringScript.R

# read data
data("mtcars")

# load saved model
mpg_model = readRDS("mpg_model.RDS")

# score some new data
data.frame(predicted = predict(mpg_model, mtcars), actual = mtcars$mpg)

Before we jump into CI, we need to make sure our CI is set up in GitLab as shown in below image. All our builds and tests will be run on what is called as “runners”. GitLab runner is a build instance which is used to run the jobs over multiple machines and send the results to GitLab and which can be placed on separate users, servers, and local machine.

From tutorials point, You can serve your jobs by using either specific or shared runners.

Shared Runners
These runners are useful for jobs multiple projects which have similar requirements. Instead of using multiple runners for many projects, you can use a single or a small number of Runners to handle multiple projects which will be easy to maintain and update.

Specific Runners
These runners are useful to deploy a certain project, if jobs have certain requirements or specific demand for the projects. Specific runners use FIFO (First In First Out) process for organizing the data with first-come first-served basis. Please look at tutorials point if you want to set up specific runner.

Once we have our runners setup, we can dive deep into .gitlab-ci-yml file on how to setup the tests as shown below. First we need to define our stages. Here, we have two stages build and test. In build, we will install R into our docker image, train a R model using our trainingScript.R and print if the model was built successfully. In our test stage, we will use our scoringScript.R to score our model on some new data. If everything goes well, we should have a successful CI. If you notice in the below script, we have something called tags. What it does it, it specifies what runner should the job use. You can use the tags from any of the above shared runners.

stages:
   - build
   - test
   
# build pipeline  
build container:
  stage: build
  script:
    - apt-get update && apt-get install -y r-base
    - Rscript trainingScript.R
    - echo model trained successfully
  tags:
    - docker

# test pipeline  
test container:
  stage: test
  script:
    - apt-get update && apt-get install -y r-base
    - Rscript scoringScript.R
    - echo model scored successfully
  tags:
    - docker

So, every time you make changes and push commits to the repository, the CI should automatically trigger the the pipeline by building a docker image first and running the job in the container as shown in the below image. If the pipeline successfully completed then you will see a passed icon.

Under stages you could also see that both the stages were successfully built. You could also click each of those stages to see the execution process and a job successful message. This part is very useful when you are trying to troubleshoot failed stages.

Now that, that CI pipeline has successfully executed, we can confidently say that the code in the repository is production ready. Also, you might ask “What about that fancy icons on Read me page? How to add those? You could go under Settings > CI/CD > Pipeline status, select the code and paste it on Read me page. That should give you the icon.

If your repository is dependent of data or a dependent R/Python package and want to make sure you want to run them say once a day or a week, you could schedule cron jobs right on GitLab by selecting CI/CD > Schedules > New Schedule as shown in below image. This would initiate the pipelines every Wednesday at 6 PM and makes sure that your repository is up to date and production ready.

Conclusion

In the above article we discussed what is git, CI/CD and how it can be used for ML workflows for production ready models and scripts. We have only scratched the surface on using CI in this article. As we deal with more complicated projects, we see more CD part through kubernetes clusters and ML models being deployed into production. CI/CD is not only for models and scripts, but could also be extended for notebook reports such as Jupyter or R-markdown. In the last couple of years, I have started to see git and CI/CD trends moving towards data science and could see companies expecting these skill set from their potential data scientists in the future. In a holistic picture git and CI/CD’s enable data scientists to strongly collaborate with development, operations and deployment team.

Hope you enjoyed this article and do checkout my other articles.

The post How to use CI/CD for your ML Projects? appeared first on Hi! I am Nagdev.

To leave a comment for the author, please follow the link and comment on their blog: python – Hi! I am Nagdev .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers