The Good, the Bad and the Ugly in Cybersecurity – Week 33
August 16, 2019New KB articles published for the week ending 3rd August,2019
August 19, 2019AI & Machine Learning
Itau Unibanco is the largest private sector bank in Brazil, with a mission to put its customers at the center of everything they do as a key driver of success. As a result, one of its projects is AVI (Itau Virtual Assistant), a digital customer service tool that uses natural language processing, built with machine learning, to understand customer questions and respond in real time.
AVI helps about a million customers per month. It answers all but 2% of customer questions, and answers those questions correctly 85% of the time. In instances where AVI is not best suited to help a customer, it transitions to a live agent transparently, and in the same channel.
To help continually improve and evolve AVI, as well as Itau’s other projects that use machine learning, they needed an efficient strategy for machine learning model deployment. However, they quickly found that building a robust tool that helps their data scientists deploy, manage and govern ML models in production proved challenging. As a result, the team began working with Google Cloud to create a CI/CD pipeline based on the open source project Kubeflow, for online machine learning training and deployment. Here’s how they did it.
How Itau built their pipeline
A machine learning project lifecycle mainly comprises four major stages, executed iteratively:
Once a data scientist has a set of well-performing machine learning models, they need to operationalize them for other applications to consume. Depending on the business requirements, predictions are produced either in real time or on a batch basis. For the AVI project, two business requirements were essential: (1) the ability to have multiple models in production (whether using different techniques or models trained using distinct data), and (2) the ability to retrain the production model with new data.
Although the data science and machine learning features are well cared for by the AVI multidisciplinary team, both model training and deployment are still not fully automated at Itau. Additionally, internal change management procedures can mean it takes up to one week to retrain and deploy new models. This has made ML initiatives hard to scale for Itau. Once the CI/CD pipeline is integrated with the AVI platform, the bank hopes that training and deployment will take hours instead of days, or even faster by using GPU or TPU hardware.
Some of the main requirements for this deployment pipeline and serving infrastructure include:
-
The Itau team may work with several ML model architectures in parallel. Each of these models is called a “technique” in the team’s internal jargon.
-
Promoting a new technique to production should be an automated process, triggered by commits to specific branches.
-
It should be possible to re-train each model on new data in the production environment, triggered by the front-end used by agent managers.
-
Several versions of the same or different models could be served simultaneously, for A/B test purposes or to serve different channels.
Architecture
Itau has a hybrid and multi-cloud IT strategy based on open source software and open standards to guarantee maximum portability and flexibility. This created a natural alignment with Google Cloud, which is also committed to open source and hybrid/multi-cloud. Therefore, the architecture was planned around open source platforms, tools and protocols, including Kubeflow, Kubernetes, Seldon Core, Docker, and Git. The goal was to have a single overall solution that could be deployed on GCP or on-premises, according to the needs and restrictions of each team inside the company.
This is the high-level, conceptual view of the architecture:
Models start their lives as code in the source repository, and data in object storage. A build is triggered in the CI server, producing new container images with the model code packaged for training. The CI process also compiles and uploads a pipeline definition to the training platform, and triggers a new training run with the latest data. At the end of the training pipeline, if everything runs well, a new trained model is written to object storage, and a new serving endpoint is started. The front-end server of the customer service application will use these API endpoints to obtain model predictions from a given input. Service administrators use the same application to manage training example data and classes. These users can trigger the training of a new model version with a new dataset. This is accomplished by triggering a new run of the training pipeline, with no need to reload or re-compile source code.
For this project, the concrete architecture was instantiated with the following components:
Itau’s centralized infrastructure teams have selected Jenkins and GitLab as their standard tools for integration and source control, respectively, so these tools were used to build the integrated pipeline. For the container registry and object storage, the cloud-native solutions Container Registry and Cloud Storage were used, since they should be easy to replace with on-premises equivalents without many changes. The core of the system is Kubeflow, the open source platform for ML training and serving that runs on Kubernetes, the industry standard open source container orchestrator. Itau tested the platform with two flavors of Kubernetes: Origin, the open source version of RedHat OpenShift, used by Itau in its private cloud, and Google Kubernetes Engine (GKE), for easier integration and faster development. Kubeflow runs well on both.
The centerpiece of the pipeline is Kubeflow Pipelines (KFP), which provides an optimized environment to run ML-centric pipelines, with a graphical user interface to manage and analyze experiments. Kubeflow Pipelines are used to coordinate the training and deployment of all ML models.
Implementation
In the simplest case, each pipeline should train a model and deploy an endpoint for prediction. This is what such a pipeline looks like in Kubeflow Pipelines:
Since this platform will potentially manage several ML models, Itau agreed on a convention of repository structure that must be followed for each model:
The root of each directory should contain a Dockerfile, to build the image that will train the model, and an optional shell script to issue the docker build and push commands. The src subdirectory contains all source code, including a script called trainer.sh that will initiate the training process. This script should receive three parameters, in the following order: path to the training data set, path to the evaluation data set, and output path where the trained model should be stored.
The pipeline directory contains pipeline.py, the definition of the Kubeflow Pipeline that will perform the training and deployment of the model. We’ll take a better look at this definition later.
Container Images
Each step in a KFP pipeline is implemented as a container image. For our minimum viable product (MVP), Itau created three container images:
- Model trainer (sklearn_spacy_text_trainer)
- Model deployment script (serving_deployer)
- Model serving with Seldon Core (pkl_server)
The model trainer image is built from the model source code tree, with the Dockerfile shown in the file structure above. The other two images are more generic, and can be reused for multiple models, receiving the specific model code as runtime parameters.
The model trainer and deployer containers are built by simple shell scripts from their respective Dockerfiles. The model serving container is built with the s2i utility, which automatically assembles a container from the source code tree, using the Seldon Python 3.6 base image. The shell script below shows how that’s accomplished:
VERSION=$(git log -1 --pretty=%H)
REPO=gcr.io/ml-cicd
IMAGE=pkl_server
SRC_DIR=src
s2i build $SRC_DIR seldonio/seldon-core-s2i-python36:0.7 ${IMAGE}
docker tag $IMAGE ${REPO}/${IMAGE}:${VERSION}
echo "Pushing image to ${REPO}/${IMAGE}:${VERSION}"
docker push ${REPO}/${IMAGE}:${VERSION}
Pipeline definition
A pipeline in Kubeflow Pipelines is defined with a Python-based domain specific language (DSL), which is then compiled into a yaml configuration file. There are two main sections to a pipeline definition: (1) definition of operators and (2) instantiation and sequencing of those operators.
For this sample pipeline, an operator was defined for the trainer container and one for the deployer. They are parameterized to receive relevant dynamic values such as input data path and model endpoint name:
import kfp
from kfp import components
from kfp import dsl
from kfp import gcp
TRAINER_CONTAINER_VERSION =
'75a29c29a6d7b9f27ab3dc45ee47eb2c99751dff'
DEPLOYER_CONTAINER_VERSION =
'75a29c29a6d7b9f27ab3dc45ee47eb2c99751dff'
def train_op(train_data, eval_data, output):
return dsl.ContainerOp(
name='train_sklearn_spacy_text_model',
image='gcr.io/ml-cicd/sklearn_spacy_text_trainer:{}'.format(
TRAINER_CONTAINER_VERSION),
arguments=[
train_data,
eval_data,
output,
],
file_outputs={
'model_path': '/output.txt',
})
def deploy_op(model_name, model_path, model_version):
return dsl.ContainerOp(
name='deploy_sklearn_spacy_text',
image='gcr.io/ml-cicd/serving_deployer:{}'.format(
DEPLOYER_CONTAINER_VERSION),
arguments=[
model_name,
model_path,
model_version,
],
file_outputs={
'output': '/output.txt',
})
The pipeline itself declares the parameters that will be customizable by the user in the KFP UI, then instantiates the operations with relevant parameters. Note that there is no explicit dependency between the train and deploy operations, but since the deploy operation relies on the output of the training as an input parameter, the DSL compiler is able to infer that dependency.
@dsl.pipeline(
name='SKLearn Model Pipeline',
description='A pipeline that trains and deploys an SKLearn model.')
def train_deploy_pipeline(
model_path='gs://ml-cicd/models/sklearn_spacy_text',
train_data='gs://ml-cicd/data/sklearn_spacy_text/train/train.csv',
eval_data='gs://ml-cicd/data/sklearn_spacy_text/train/train.csv',
model_version='000'
):
model_name = 'sklearnspacytext'
output_path = "{}/{}/{}".format(model_path, model_name, model_version)
train_operation = train_op(train_data, eval_data, output_path).apply(
gcp.use_gcp_secret('user-gcp-sa'))
deploy_operation = deploy_op(model_name,
train_operation.outputs["model_path"],
model_version)
if __name__ == '__main__':
kfp.compiler.Compiler().compile(train_deploy_pipeline, __file__ + '.zip')
Pipeline build and deploy
A commit to the main branch will trigger a build in Jenkins. The build script will execute the following steps:
- Build the containers
- Compile the KFP pipeline definition
- Upload the new pipeline to KFP
- Trigger a run of the new pipeline to train the model (this step is optional, depending on what makes sense for each model and the team’s workflow)
The sample script below executes steps 2 and 3, receiving a descriptive pipeline name as an argument:
PIPELINE_NAME=$(echo "$1" | sed 's/ /%20/g')
dsl-compile --py pipeline.py --output pipeline.tar.gz
curl -F "[email protected]" "http://localhost:8080/pipeline/apis/v1beta1/pipelines/upload?name=$PIPELINE_NAME"
Pipeline run
Whenever the training dataset is changed, a user can trigger a model training from the administration UI. Training a model is simply a matter of placing the new data in the right location and starting a new run of the pipeline that is deployed to Kubeflow. If successful, the pipeline will train the model and start a new serving endpoint to be called by the front-end.
curl 'http://localhost:8080/pipeline/apis/v1beta1/runs'
-H 'Content-Type: application/json'
--data-binary '{
"name":"REST run 1",
"pipeline_spec": {
"parameters": [
{"name":"model-path",
"value":"gs://ml-cicd/models/sklearn_spacy_text"},
{"name":"train-data",
"value":"gs://ml-cicd/data/sklearn_spacy_text/train/train.csv"},
{"name":"eval-data",
"value":"gs://ml-cicd/data/sklearn_spacy_text/train/train.csv"},
{"name":"model-version",
"value":"000"}
],
"pipeline_id":"9457b44b-b9c7-46b5-9f62-ffb227a2f4c9"
}
}'
This REST call will return a run ID parameter, which can be used by the UI back end to poll for the run status and update the user when it’s done, or there is an error.
Model prediction serving
The final step of the pipeline is, of course, serving model predictions. Since most of our models are created with Scikit Learn, Itau leveraged Seldon Core, a bundled component of Kubeflow, to implement the serving endpoints. Seldon Core lets you implement just a simple predict method and takes care of all the plumbing for exposing a REST endpoint, with optional advanced orchestration features.
Since the serving API tends to change infrequently, we opted to implement a generic class that can serve any model serialized to a PKL file. The deployment definition parameterizes a storage location with the PKL file and bundled model source code, which is then unpacked and used for serving by the container. The Python code that achieves this is listed below:
import dill
import os
import tarfile
from storage_util import download_blob
PKL_FILE_NAME='model.pkl'
CODE_FILE_NAME='model.tar.gz'
class PredictAPI(object):
def __init__(self, pkl_path=None):
print("pkl_path={}".format(pkl_path))
download_blob(os.path.join(pkl_path, CODE_FILE_NAME), CODE_FILE_NAME)
with tarfile.open(CODE_FILE_NAME, "r:gz") as tar:
tar.extractall()
download_blob(os.path.join(pkl_path, PKL_FILE_NAME), PKL_FILE_NAME)
with open(PKL_FILE_NAME, 'rb') as model_file:
self.model = dill.load(model_file)
def predict(self, X, feature_names, names=None, meta=None):
retval = self.model.predict_ranking(X)
return retval
This serving code is deployed for each endpoint by a shell script in the deployer container. The script takes in the location of the trained model, name and version for the endpoint, generates the necessary configuration and deploys it to Kubernetes:
# Gather arguments
MODEL_NAME=$1
MODEL_GCS_PATH=$2
MODEL_VERSION=$3
# Names of Seldon entities
DEPLOYMENT_NAME=$MODEL_NAME-$MODEL_VERSION
SPEC_NAME=$MODEL_NAME-spec-$MODEL_VERSION
PREDICTOR_NAME=$MODEL_NAME-predictor-$MODEL_VERSION
# PKL Server container image
IMAGE=pkl_server
VERSION=75a29c29a6d7b9f27ab3dc45ee47eb2c99751dff
REPO=gcr.io/ml-cicd
# Generate config file for deployment
SELDON_CONFIG_FILE=./seldon_config.yaml
cat >$SELDON_CONFIG_FILE <<EOF
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: $DEPLOYMENT_NAME
spec:
name: $SPEC_NAME
predictors:
- componentSpecs:
- spec:
containers:
- image: ${REPO}/${IMAGE}:${VERSION}
name: classifier
graph:
children: []
endpoint:
type: REST
name: classifier
type: MODEL
parameters:
- type: STRING
name: pkl_path
value: $MODEL_GCS_PATH
name: $PREDICTOR_NAME
replicas: 1
EOF
# Apply configuration to cluster
kubectl apply -n kubeflow -f $SELDON_CONFIG_FILE
Conclusion
With this relatively simple architecture and very little custom development, Itau was able to build a CI/CD pipeline for machine learning that can accelerate the pace of innovation while simplifying production maintenance for AVI and other teams. It should be fairly easy to replicate and adapt it to many organizations and requirements, thanks to the openness and flexibility of tools like Kubeflow and Kubeflow Pipelines.
Acknowledgments
This work was created by a joint team between Google Cloud and Itau Unibanco:
- Cristiano Breuel (Strategic Cloud Engineer, Google Cloud)
- Eduardo Marreto (Cloud Consultant, Google Cloud)
- Rogers Cristo (Data Scientist, Itau Unibanco)
- Vinicius Carida (Advanced Analytics Manager, Itau Unibanco)