Literature Survey: A Benchmark for Interpretability Methods in Deep Neural Networks

With the ever-increasing adoption of complex and opaque AI systems such as deep learning, explainability and interpretability in AI is becoming increasingly relevant. The lack of understanding in the decision making mechanisms of these techniques has hindered their adoption in areas where incorrect decisions are catastrophic and in heavily-audited areas (e.g. medicine and finance). Methods which shed light toward how these systems function are paramount to improving the interpretability, reliability, and auditability of these systems.

One of the most common interpretability tasks is estimating the influence of input features in the output of the model. A good understanding of feature importance is extremely helpful in understanding model prediction, isolating undesirable behavior, and proving model reliability. A recent publication by Google Brain submitted to NeurIPS 2019 evaluates and compares many recent interpretability methods regarding feature importance in large-scale image classification. They find that many popular interpretability methods do not produce good estimates of feature importance - in fact, many of these estimates are no better than a random baseline in the following proposed evaluation framework.


The authors propose a evaluation mechanism for feature importance methods which measures how the performance of a retrained model degrades as features with estimated high importance are removed. This method is named ROAR, an abbreviation of RemOve And Retrain. The procedure for a given feature importance estimator $e$ is as follows:

  1. Rank the estimated feature importances generated by $e$ into an ordered set.
  2. For different degredation levels $t = [0, 10, …, 100]$, generate new datasets where the features corresponding to the top $t$ fraction of this ordered set are replaced with the per-channel mean in the original dataset.
  3. Re-train and re-evaluate a new model from random initialization on these new datasets, repeating 5 times for each $t$ to ensure variance in accuracy is low.
  4. Compare the accuracies of the original model and the new model.

Why do we need to retrain?

Retraining is very computationally expensive, but the authors argue this is necessary since models typically assume the train and test data come from similar distributions. Without retraining, it is unclear whether accuracy degredation is due to the introduction of artifacts outside the original training distribution or due to the actual removal of information.

Estimators under consideration

Below are the feature importance estimators used in the following experiments. See the Related Work section for details and references to additional publications.

Base estimators:

  • Gradients or Sensitivity Heatmaps
  • Guided Backprop
  • Integrated Gradients

Ensembling methods:

  • Classic SmoothGrad
  • SmoothGrad$^2$
  • VarGrad

Control variants:

  • Random
    • Random binary importance sampled from $Bernoulli(1-t)$ where $(1-t)$ is the probability of a positive outcome.
  • Sobel Edge Filter
    • Convolution of a hard-coded, separable, integer filter over an image which produces a ranking that assigns high scores to edge areas.


In experimentation, the authors used the ResNet-50 architecture for both generating feature importance estimates and retraining on modified datasets. The datasets used were ImageNet, Birdsnap, and Food 101. For each dataset and estimator, new train and test sets are generated with $t = [0, 10, 30, 50, 70, 90]$.

In total, 18 estimators were evaluated (3 base estimators, 3 ensemble methods wrapped around each of these, and squared estimates for each configuration). Overall, 540 modified datasets were generated (180 new datasets for each original dataset).

For each modified dataset, 5 randomly initialized ResNet-50 models are trained on the test set, and the average accuracy of these 5 models on the test set is reported.

Below are the model performances on unmodified datasets:

  • ImageNet: 76.68%
  • Birdsnap: 66.65%
  • Food 101: 84.54%


Evaluating the Random Ranking

The authors find that when replacing even a large portion of inputs with the uninformative value, the model still performs very well. This suggests many pixels are likely redundant. Additionally, this provides support for the need for retraining. A traditional deletion metric (i.e. re-evalutation on the same model) causes a much sharper decrease in performance under a similar level of input degradation. This indicates that without retraining the model, it is impossible to separate the performance of the ranking from the degradation caused by the input modification.

Evaluating Base Estimators

Every base estimator was found to perform consistently worse than the random baseline over all datasets and all thresholds with larger thresholds resulting in poorer performance. Additionally, across all datsets and thresholds, the base estimators perform comparably or worse than the sobel baseline. Using a traditional deletion metric, these baselines appear to work, but under ROAR they do not outperform baselines. This further supports the need for retraining.

Evaluating Ensemble Approaches

Classic SmoothGrad is comparable or worse than the base estimator it wraps. SmoothGrad-Squared and VarGrad, however, greatly improve accuracy. These methods outperform both control variants by a large margin as well as the base estimators they wrap. The overall ranking of estimator performance varies by dataset, which indicates the choice of best underlying estimator may vary by task.

The reason why these estimators perform the best is an open research question at the time of writing. The authors observe that both of these methods appear to remove whole objects within concentrated areas from images whereas other methods remove less concentrated inputs over a wider area. Additional research must be done to understand why this happens and why this produces better results.


  • ROAR (i.e. removal of input features and retraining) is proposed as a feature importance comparison framework.
  • Gradients, Integrated Gradients, Guided BackProp base estimators as well as SmoothGrad ensembles are comparable or worse than random assignment under ROAR (they don’t work).
  • SmoothGrad-Squared and VarGrad ensembles provide good estimates of feature importance under ROAR.

Related Work

Gradients and Sensitivity Heatmaps

This paper introduces gradients or sensitivity heatmaps as a form of feature importance in nonlinear classifiers. Under this framework, estimated explanations are local gradients that capture how a feature must be moved to influence model outcome. When gradients aren’t able to be explicitly calculated, a probabilistic approximate can be used to generate estimated gradients instead.

Guided Backpropagation

This paper proposes the guided backpropagation method which aims to visualize input patterns that cause neuron activation in higher layers. This is named guided backpropagation because it adds an additional guidance signal from higher layers to usual backpropagation. Under this framework, gradient computation is done under a modified backpropagation setup that stops the backward flow of negative gradients, corresponding to the neurons which decrease the activation of the higher layer unit being visualized. This is a combination of normal backpropagation and ‘deconvnet’ approaches.

Integrated Gradients

This paper is an attribution method which assigns importance by decomposing output activation into contributions from the individual input features. In this particular method, this is done by interpolating a set of gradient estimates for values between a non-informative reference point (in this study, a black image) to the actual input. This interpolation involves an integral over gradients that can be approximated by summing over gradients at points occuring at sufficiently small integrals along the path from the reference point to the input (Riemman approximation).


This paper proposes SmoothGrad as an ensemble wrapper method of estimating feature importance. In essence, SmoothGrad smooths the underlying gradient method by simply averaging the vanilla sensitivity maps of $n$ noisy images generated from the original image. SmoothGrad$^2$ is identical to SmoothGrad, but each estimate is squared before averaging. Both of these methods can be used to augment any gradient-based method such, as seen in this study.


This paper includes VarGrad as an ensemble method of estimating feature importance. This method is extremely similar to SmoothGrad (see above). The only difference is that the summation over vanilla sensitivity maps is replaced with the variance.

Evan Czyzycki
Computer Science PhD Student