New KB articles published for the week ending 16th February,2020
February 20, 2020Sasa Software Partners with SentinelOne to Offer NextGen AI-driven Security
February 20, 2020Machine learning technology continues to improve at a rapid pace, with increasingly accurate models being used to solve more complex problems. However, with this increased accuracy comes greater complexity. This complexity makes debugging models more challenging. To help with this, last November Google Cloud introduced Explainable AI, a tool designed to help data scientists improve their models and provide insights to make them more accessible to end users.
We think that understanding how models work is crucial to both effective and responsible use of AI. With that in mind, over the next few months, we’ll share a series of blog posts that covers how to use AI Explanations with different data modalities, like tabular, image, and text data.
In today’s post, we’ll take a detailed look at how you can use Explainable AI with tabular data, both with AutoML Tables and on Cloud AI Platform.
What is Explainable AI?
Explainable AI is a set of techniques that provides insights into your model’s predictions. For model builders, this means Explainable AI can help you debug your model while also letting you provide more transparency to model stakeholders so they can better understand why they received a particular prediction from your model.
AI Explanations works by returning feature attribution values for each test example you send to your model. These attribution values tell you how much a particular feature affected the prediction relative to the prediction for a model’s baseline example. A typical baseline is the average value of all the features in the training dataset, and the attributions tell how much a certain feature affected a prediction relative to the average individual.
AI Explanations offers two approximation methods: Integrated Gradients and Sampled Shapley. Both options are available in AI Platform, while AutoML Tables uses Sampled Shapley. Integrated Gradients, as the name suggests, uses the gradients–which show how a prediction is changing at each point–in its approximation. It requires a differentiable model implemented in TensorFlow, and is the natural choice for those models, for example neural networks. Sampled Shapley provides an approximation through sampling to the discrete Shapley value. While it doesn’t scale as well in the number of features, Sampled Shapely does work on non-differentiable models, like tree ensembles. Both methods allow for an assessment of how much each feature of a model led to a model prediction by comparing those against a baseline. You can learn more about them in our whitepaper.
About our dataset and scenario
The Cloud Public Datasets Program makes available public datasets that are useful for experimenting with machine learning. For our examples, we’ll use data that is essentially a join of two public datasets stored in BigQuery: London Bike rentals and NOAA weather data, with some additional processing to clean up outliers and derive additional GIS and day-of-week fields.
Using this dataset, we’ll build a regression model to predict the duration of a bike rental based on information about the start and end stations, the day of the week, the weather on that day, and other data. If we were running a bike rental company, we could use these predictions–and their explanations–to help us anticipate demand and even plan how to stock each location.
While we’re using bike and weather data here, you can use AI Explanations for a wide variety of tabular models, taking on tasks as varied as asset valuations, fraud detection, credit risk analysis, customer retention prediction, analyzing item layouts in stores, and many more.
AutoML Tables lets you automatically build, analyze, and deploy state-of-the-art machine learning models using your own structured data. Once your custom model is trained, you can view its evaluation metrics, inspect its structure, deploy the model in the cloud, or export it so that you can serve it anywhere a container runs.
Of course, AutoML Tables can also explain your custom model’s prediction results. This is what we’ll look at in our example below. To do this, we’ll use the “bikes and weather” dataset that we described above, which we’ll ingest directly from a BigQuery table. This post walks through the data ingestion–which is made easy by AutoML–and training process using that dataset in the Cloud Console UI.
Global feature importance
AutoML Tables automatically computes global feature importance for the trained model. This shows, across the evaluation set, the average absolute attribution each feature receives. Higher values mean the feature generally has greater influence on the model’s predictions.
This information is extremely useful for debugging and improving your model. If a feature’s contribution is negligible–if it has a low value–you can simplify the model by excluding it from future training. Based on the diagram below, for our example, we might try training a model without including bike_id
.