Reproducible Machine Learning with Jupyter and Quilt

Aneesh Karve2017-12-19 | 5 min read

Return to blog home

Jupyter Notebook documents the interaction of code and data. Code dependencies are simple to express:

import numpy as np
import pandas as pd

Data dependencies, on the other hand, are messier: custom scripts acquire files from the network, parse files in a variety of formats, populate data structures, and wrangle data. As a result, reproducing data dependencies across machines, across collaborators, and over time can be a challenge. Domino's Reproducibility Engine meets this challenge by assembling code, data, and models into a unified hub.

We can think of reproducible machine learning as an equation in three variables:

code + data + model = reproducible machine learning

The open source community has produced strong support for reproducing the first variable, code. Tools like git, pip, and Docker ensure that code is versioned and uniformly executable. Data, however, poses entirely different challenges. Data is larger than code, comes in a variety of formats, needs to be efficiently written to disk, and read into memory. In this article, we'll explore an open source data router, Quilt, that versions and marshalls data. Quilt does for data what pip does for code: packages data into reusable versioned building blocks that are accessible in Python.

In the next section, we'll set up Quilt to work with Jupyter. Then we'll work through an example that reproduces a random forest classifier.

Launch a Jupyter notebook with Quilt

In order to access Quilt, Domino cloud users can select the "Default 2017-02 + Quilt" Compute environment in Project settings. Alternatively, add the following lines to requirements.txt under Files:

quilt==2.8.0
 scikit-learn==0.19.1

Next, launch a Jupyter Workspace and open a Jupyter notebook with Python.

Quilt packages for machine learning

Let's build a machine learning model with data from Wes McKinney's Python for Data Analysis, 2nd Edition. The old way of accessing this data was to clone Wes' git repository, navigate folders, inspect files, determine formats, parse files, and then load the parsed data into Python.

With Quilt the process is simpler:

import quilt
quilt.install("akarve/pydata_book/titanic", tag="features",
force=True)
# Python versions prior to 2.7.9 will display an SNIMissingWarning

The above code materializes the data from the "titanic" folder of the akarve/pydata_book package. We use the "features" tag to fetch a specific version of the package where a collaborator has done some feature engineering. Each Quilt package has a catalog entry for documentation, a unique hash, and a historical log ($ quilt log akarve/pydata_book).

We can import data from Wes' book as follows:

from quilt.data.akarve import pydata_book as pb

If we evaluate pb.titanic in Jupyter, we'll see that it's a GroupNode that contains DataNodes:

<GroupNode>featuresgenderclassmodelgendermodelmodel_pkltesttrain

We can access the data in pb.titanic as follows:

features = pb.titanic.features()
train = pb.titanic.train()
trainsub = train[features.values[0]]

Note the parentheses in the code sample above. Parentheses instruct Quilt to "load data from disk into memory." Quilt loads tabular data, as in features, as a pandas DataFrame.

Let's convert our training data into numpy arrays that are usable in scikit-learn:

trainvecs = trainsub.values
trainlabels = train['Survived'].values

Now let's train a random forest classifier on our data, followed by a five-fold cross-validation to measure our accuracy:

from sklearn.model_selection import cross_val_score as cvs
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=4, random_state=0)
rfc.fit(trainvecs, trainlabels)
scores = cvs(rfc, trainvecs, trainlabels, cv=5)
scores.mean()

The model scores 81% mean accuracy. Let's serialize the model.

from sklearn.externals import joblib
joblib.dump(rfc, 'model.pkl')

We can now add the serialized model to a Quilt package so that collaborators can replicate our experiment with both the training data and trained model. For simplicity the titanic sub-package already contains our trained random forest model. You can load the model as follows:

from sklearn.externals import joblib
model = joblib.load(pb.titanic.model_pkl2())
# requires scikit-learn version 0.19.1

To verify that it's the same model we trained above, repeat the cross-validation:

scores = cvs(model, trainvecs, trainlabels, cv=5)
scores.mean()

Expressing data dependencies

Oftentimes a single Jupyter notebook depends on multiple data packages. We can express data dependencies in a quilt.yml as follows:

packages:
  -  uciml/iris
  -  asah/mnist
  -  akarve/pydata_book/titanic:tag:features

In spirit quilt.yml is like requirements.txt, but for data. As a result of using quilt.yml, your code repository remains small and fast. quilt.yml accompanies your Jupyter notebook files so that anyone who wants to reproduce your notebooks can type quilt install in Terminal and get to work.

Conclusion

We demonstrated how Quilt works in conjunction with Domino's Reproducibility Engine to make Jupyter notebooks portable and reproducible for machine learning. Quilt's Community Edition is powered by an open source core. Code contributors are welcome.

Aneesh Karve is the Co-founder and CTO of Quilt. Aneesh has shipped products to millions of users around the globe. He has worked as a product manager and lead designer at companies like Microsoft, NVIDIA, and Matterport.