Introduction to Explainable AI (ML Tech Talks)

Name: Introduction to Explainable AI (ML Tech Talks)
Uploaded: 2021-07-15T16:15:21.000Z
Duration: 1 h 30 min 45 s

Introduction to Explainable AI

Overview: This lecture introduces explainable AI, the degree to which humans can understand and trust a ML model's predictions. It covers the distinction between interpretability and explainability, a taxonomy of interpretability techniques, and implementation of one technique called integrative gradients.

Interpretability vs. Explainability

Interpretability focuses on model understanding techniques, while explainability more broadly focuses on model explanations and the interface for translating these explanations in human, understandable terms for different stakeholders.

These terms are often used interchangeably, so we'll tease apart their distinction.

Taxonomy of Interpretability Techniques

We will cover a taxonomy of interpretability techniques before going deeper on the implementation of one technique called integrative gradients.

Integrative Gradients

To build an intuition for model interpretability, consider the image of a fireboat. Integrated gradients can highlight the pixels that were important to that decision, giving insight into what the model sees.

The purple pixels comprising the boat's water cannons and jets of water are highlighted as more important than the boat itself to the model's prediction.

As the modeler, you must be asking yourself how the model will generalize to new fireboats, or fireboats without water jets.

Augmenting Human Performance

Leverage these techniques to explain the model to stakeholders as well. Incorporating model confident scores and scores plus feature attribution heat maps improve the accuracy of physicians grading images by highlighting frequently missed features and complimenting human judgment on hard to predict, lower quality images.

Explainable AI

Explainable AI is a research field on ML interpretability techniques whose aims are to understand machine learning model predictions and explain them in human and understandable terms to build trust with stakeholders.

Since these methods bridge machine learning and human systems, the field draws from computer science and mathematics, as well as economics, behavioral psychology, and human computer interaction.

Stakeholder Groups

There are three stakeholder groups for machine learning systems that drive the dual aims of model understanding and providing human interpretable explanations to different stakeholders in order to build trust.

For engineers building models, the focus is more uninterpretable ML techniques for model understanding and improving model performance.

Trust is key for model consumers who are looking to understand the impact of model predictions more than the model internals. Interpretable explanations build trust with these end users that model decisions are reliable, equitable, and can be influenced to achieve better outcomes.

Future Research Directions

We will then conclude at the frontiers of interpretable ML methods by discussing challenges in a few future research directions.

Model Understanding and Interpretable Explanations

Overview: This section covers the importance of model understanding and interpretable explanations in machine learning systems, as well as common use cases for these techniques. It also provides an overview of the mathematical definition of interpretability and some evaluation criteria for selecting the right interpretable ML method.

Stakeholder Interests and Model Understanding

Interpretable explanations provide audible metadata to regulators to trace unexpected predictions back to their inputs to inform corrective actions.

Model understanding is critical to many tasks as part of building and operating machine learning systems.

Common use cases include explaining predictions to inform and support human decision-making processes, debugging model performance to inform corrective actions, refining modeling and data collection processes, verifying model behavior is acceptable, and presenting the model's predictions to stakeholders.

Interpretability Definition and Evaluation Criteria

Interpretability is about the extent to which a cause-- this can be a change in a future value, group of features, or model parameters-- can affect your model's predictions.

Visualizations such as decision trees and neural networks can provide insight into model internals.

However, these explanations alone are not meaningful for all stakeholders, nor do all stakeholders necessarily need to understand model internals.

It is important to remember that these techniques fall short of explainability of your ML model on their own.

When picking a method, it's important to always consider your models decisions within the context of the requirements of the broader system that it operates in.

General properties of explanations include completeness, accuracy, consistency, and trustworthiness.

Explanations Must Be Accurate

Overview: This section covers the criteria for interpretability methods and introduces a taxonomy to organize them.

Intrinsic vs Post-Hoc Interpretability

Intrinsic interpretability refers to machine learning models that are considered interpretable due to their simple structure, such as decision trees or linear models.

Post-hoc interpretability refers to achieving interpretability after model training, such as permutation feature importance.

Local vs Global Interpretability

Local interpretability refers to whether the scope of interpretability is limited to individual predictions and/or a small part of the model's prediction space.

Global interpretability refers to whether the interpretability ability scope is the entire model's prediction space, typically accomplished with aggregated ranked contributions of input variables.

Model Specific vs Agnostic Interpretability

Model specific interpretability refers to the interpretability method being restricted to usage on specific models due to their definition.

Model agnostic interpretability methods do not rely on model internals, instead relying upon changes and input features or their values to understand how they influence the outputs of your model.

Explanation Output

Many methods return feature statistics, measuring a features proportional contribution to the prediction, while others extract concepts, decision rules, feature summary visualizations, counterfactual data points, or even simpler approximate models.

Integrated Gradients

Overview: This section covers integrated gradients, a post hoc explanatory method that works with any differentiable model.

Integrated Gradients

Integrated gradients aims to explain the relationship between a model's predictions in terms of its features.

It has many use cases, including understanding feature importances, identifying data skew, and debugging model performance.

Integrated gradients is a post hoc explanatory method that works with any differentiable model.

Introduction to Integrated Gradients

Overview: This section provides an introduction to integrated gradients, a part of a group of methods that uses gradients as a measure of importance in the feature space. It explains how integrated gradients can be used to assign future importance scores and how it is better at identifying edges of objects than standard model gradients.

Understanding Model Functions

Integrated gradients assigns importance scores by taking small, evenly spaced steps in the feature space and accumulating pixel x's local gradients to create a global score for how much it adds or subtracts to the model's overall output probability.

Early interpretability methods for neural networks assign future importance scores using gradients, which tell you which features have the steepest local slope relative to the model's prediction at any given point along the model's prediction function.

However, gradients only describe local changes in the model's prediction function and do not fully describe the entire model.

As the model learns the relationship between the range of individual pixel values and the correct class, the gradients of important features will actually saturate, meaning become increasingly small or even go to zero, despite being critical to the prediction.

Visualizing Integrated Gradients

The middle image contains gradients from the last model layer with respect to the input. The gradients are muted, and it's difficult to make out the edges that actually distinguish the correctly classified camera object.

In comparison, notice on the right how integrated gradients is much better at identifying the edges of the camera object, in particular, highlighting the pixels around the lens as being important. It captures a better representation of the camera object that is more human interpretable.

Applying Integrated Gradients

To reinforce this intuition, let's walk through applying integrated gradients to an example fireboat image.

Consider the simplified model function represented in the two graphs on the screen, which plot your models predicted probability on the correct class on the y-axis and feature values along the x-axis.

Integrated gradients can be applied to any differentiable model. In the spirit of the original paper, I used a pre-trained version of the same model, inception V1, which you can download from TensorFlow Hub.

At this point, there's two paths forward in this presentation. You can use this implementation of integrated gradients on tensorflow.org without really understanding the details of how it works.

The next half of this talk is going to dive much more into how integrated gradients works underneath the hood. And for interested parties, we're going to be covering a mix of code and some of the formulas that it's implementing.

Here's the original integrated gradients equation, which we'll be translating from mathematical notation to code in five steps that I will explain in greater detail on subsequent slides.

Calculating Gradients

Overview: This section covers how to calculate gradients in order to measure the relationship between changes to input features and changes in a model's predictions. It also discusses how to visualize gradients and how to use the Riemann trapezoidal method to approximate an integral for integrated gradients.

Calculating Gradients

Constant is consistently increasing each interpolated image's intensity or brightness.

TensorFlow 2 gradient tape object records the gradients between predicted probabilities and each interpolated image.

Visualize gradients to connect theory to practice.

Plot shows how model's confidence in the fireboat class varies across alphas on the left.

Right plot shows average gradient magnitudes over alpha a bit more directly.

Ghostly images represent small changes in the feature space.

Visualize gradients from the previous step.

Equation expresses using the Greek letter delta for partial derivatives.

Model learns the most from gradients at lower values of alpha.

Intuitively, model has learned the pixels to make the correct prediction, sending those pixel gradients to 0.

Riemann Trapezoidal Method

Popular class of methods for calculating numerical approximation of an integral for integrated gradients.

Code block contains implementation of the Riemann trapezoidal method.

Uses two points to calculate the area of a trapezoid and an average to approximate the integral or area underneath the model's prediction function.

Plot shows how the values sharply approach and even briefly dip below zero after 0.6.

Trapezoidal Riemann sum provides a more accurate approximation and converges more quickly over m steps.

Function needs some input parameters, including a TensorFlow model, an input image, a baseline image, m steps parameter, and a batch size parameter.

TF function decorator compiles the function into a high performance caller TensorFlow graph.

Implementing Integrated Gradients

Overview: This section covers the implementation of integrated gradients for object detection models.

Generating a Vector Range

Generate a vector range of evenly spaced m step plus 1 values between 0 and 1.

Note that this is m step plus 1 to ensure inclusion of the outer endpoints of your function.

Batching

Implement batching to control the gradient computation bottleneck and scale your function.

Create a tensor of interpolated images between your baseline and input image.

Calculating Gradients

Calculate gradients between model's output class probabilities and each interpolated image in the batch.

Use the tensor arrays stack method to stack your batch tensors row wise at the end of the batching loop.

Averaging Gradients

Average the total gradients to approximate your model's entire prediction function.

Scale or normalize your average gradients with respect to your input image.

Visualizing Integrated Gradients

Sum the absolute values of IG attributions across the image color channels to return a grayscale attribution mass for standalone visualization or overlaid on the original image.

Preserve the direction of the gradient sign, either a plus or a minus, for visualizing on different channels.

Comparing Integrated Gradients

IG attributions are complete, with the sum of IG attributions for all features equal to the difference in your model's output for its input features and your model's output for the baseline.

IG attributions are sensitive, with all input features that differ between the input and baseline attributions and result in different predictions receiving non-zero attribution.

IG attributions are consistent, with attributions being the same for functionally equivalent models.

IG attributions preserve linear relationships, which are generally more interpretable.

Understanding Integrated Gradients

Overview: This section provides an overview of the limitations of Integrated Gradients (IG) as an interpretable ML technique, and how it can be used for different production ML use cases.

Limitations of IG

IG provides local, not global interpretability. IG attributions can be aggregated using averages or quantiles to describe global properties of a model's predictions, but research is still inconclusive on whether this accurately captures the entirety of global model performance.

IG Interpretability is in relation to individual features, not feature interactions and combinations. This limits the expressiveness of explanations to stakeholders to describe complex modeling processes.

IG only works with differentiable ML models due to needing gradients as their input. Tree-based models are not supported.

It is hard to select good baselines for integrated gradients. Typically, black images of zero-value pixels or a zero-embedding vector for texts are good defaults, but they do have some pitfalls.

Using IG for Production ML Use Cases

Feature Importances: IG can be used to differentiate between two dog breeds without any prior understanding. Visual inspection of IG attributions can provide insight into the underlying causal structure between distinguishing features.

Debugging Data Skew: Tracking IG feature importances across time and data splits allows for meaningful monitoring of train serving skew and drift.

Examples: IG correctly highlights pixels around the shirt collar and tie, military insignia on the jacket and hat, and various pixels around the face of a Rear Admiral and computer scientist Grace Hopper. Conversely, IG does not identify a military uniform in an image taken 120 years earlier of United States General Ulysses Grant.

Understanding Military Uniforms in Machine Learning

Overview: This section discusses the importance of understanding military uniforms in machine learning and how to use integrated gradients feature attributions to debug model performance.

Integrated Gradients Feature Attributions

Using IG feature attributions can help identify important features, how well the model generalizes, and if the model is learning incorrect or spurious features.

Data sets need to have sufficient representation and density of military uniforms in order for the model to accurately predict them.

IG feature attributions provide a useful debugging complement to data set statistics and model evaluation metrics to better inform corrective actions for model performance.

When using IG feature attributions for debugging, you're looking for insights into what features are important, how well the model generalizes, and if the model is learning incorrect or spurious features.

Google Cloud's Explainable AI service supports interpretability methods, including integrated gradients, sample Shapley, and XRAI.

Google Cloud's AutoML tables product also incorporates explainability as a first-class citizen to its predictions with permutation future importance for global interoperability and sample Shapley for future importances on individual predictions.

Baseline selection is an active area of research with various proposals being put forward since the original paper, including averaging across multiple random baselines, blurring inputs, or several other methods as well.

Introduction to XRAI

Overview: This section introduces XRAI, a novel region-based attribution method from Google Research that overcomes the baseline selection problem in Integrated Gradients.

Baseline Selection Problem

XRAI builds upon Integrated Gradients by using a novel region-based attribution method to overcome the baseline selection problem.

Instead of identifying individually important pixels, XRAI creates or highlights a region of the original image as being important to the prediction.

Implementing XRAI

XRAI can be broken down into four steps: dividing an image up into many small overlapping regions, computing IG attributions for both black and white baseline images within those small regions, rank ordering the regions based on which have the most positive attributions, and outputting the most important regions for the prediction of a given class.

XRAI outperforms existing image-based attribution techniques on several standard industry benchmarks with human labeling.

Future Directions for Explainable AI

Overview: This section outlines five predictions for future directions for explainable AI over the next few years.

Predictions

Explainable AI will become a key standardized component of automated production ML pipelines and operational monitoring.

Model-agnostic interpretability methods will continue to be the focus of explainable research.

Explainable AI will converge with causal inference to improve machine learning reliability and generalization.

Explanations from ML interpretability methods will look to add uncertainty estimates as a means of further improving human interpretation and building trust in when explanations are reliable and when they may not be.

Explainable interfaces will improve to better translate model understanding to human understanding and reach more stakeholders.