In this article, we apply interpretability techniques to a reinforcement learning (RL) model trained to play the video game CoinRun
- Dissecting failure. We perform a step-by-step analysis of the agent’s behavior in cases where it failed to achieve the maximum reward, allowing us to understand what went wrong, and why. For example, one case of failure was caused by an obstacle being temporarily obscured from view.
- Hallucinations. We find situations when the model “hallucinated” a feature not present in the observation, thereby explaining inaccuracies in the model’s value function. These were brief enough that they did not affect the agent’s behavior.
- Model editing. We hand-edit the weights of the model to blind the agent to certain hazards, without otherwise changing the agent’s behavior. We verify the effects of these edits by checking which hazards cause the new agents to fail. Such editing is only made possible by our previous analysis, and thus provides a quantitative validation of this analysis.
Our results depend on levels in CoinRun being procedurally-generated, leading us to formulate a diversity hypothesis for interpretability. If it is correct, then we can expect RL models to become more interpretable as the environments they are trained on become more diverse. We provide evidence for our hypothesis by measuring the relationship between interpretability and generalization.
Finally, we provide a thorough investigation of several interpretability techniques in the context of RL vision, and pose a number of questions for further research.
Our CoinRun model
CoinRun is a side-scrolling platformer in which the agent must dodge enemies and other traps and collect the coin at the end of the level.
CoinRun is procedurally-generated, meaning that each new level encountered by the agent is randomly generated from scratch. This incentivizes the model to learn how to spot the different kinds of objects in the game, since it cannot get away with simply memorizing a small number of specific trajectories
Here are some examples of the objects used, along with walls and floors, to generate CoinRun levels.
There are 9 actions available to the agent in CoinRun:
← | → | ||
↓ | |||
↑ | ↖ | ↗ | |
A | B | C |
We trained a convolutional neural network on CoinRun for around 2 billion timesteps, using PPO
Since the only available reward is a fixed bonus for collecting the coin, the value function estimates the time-discounted
Model analysis
Having trained a strong RL agent, we were curious to see what it had learned. Following
Here is our interface for a typical trajectory, with the value function as the network output. It reveals the model using obstacles, coins, enemies and more to compute the value function.
Dissecting failure
Our fully-trained model fails to complete around 1 in every 200 levels. We explored a few of these failures using our interface, and found that we were usually able to understand why they occurred.
The failure often boils down to the fact that the model has no memory, and must therefore choose its action based only on the current observation. It is also common for some unlucky sampling of actions from the agent’s policy to be partly responsible.
Here are some cherry-picked examples of failures, carefully analyzed step-by-step.
Hallucinations
We searched for errors in the model using generalized advantage estimation (GAE)
Using our interface, we found a couple of cases in which the model “hallucinated” a feature not present in the observation, causing the value function to spike.
Model editing
Our analysis so far has been mostly qualitative. To quantitatively validate our analysis, we hand-edited the model to make the agent blind to certain features identified by our interface: buzzsaw obstacles in one case, and left-moving enemies in another. Our method for this can be thought of as a primitive form of circuit-editing
We evaluated each edit by measuring the percentage of levels that the new agent failed to complete, broken down by the object that the agent collided with to cause the failure. Our results show that our edits were successful and targeted, with no statistically measurable effects on the agent’s other abilities.
Percentage of levels failed due to: buzzsaw obstacle / enemy moving left / enemy moving right / multiple or other:
– Original model: 0.37% / 0.16% / 0.12% / 0.08%
– Buzzsaw obstacle blindness: 12.76% / 0.16% / 0.08% / 0.05%
– Enemy moving left blindness: 0.36% / 4.69% / 0.97% / 0.07%
Each model was tested on 10,000 levels.
We did not manage to achieve complete blindness, however: the buzzsaw-edited model still performed significantly better than the original model did when we made the buzzsaws completely invisible.
Percentage of levels failed due to: buzzsaw obstacle / enemy moving left / enemy moving right / multiple or other:
Original model, invisible buzzsaws: 32.20% / 0.05% / 0.05% / 0.05%
We tested the model on 10,000 levels.
We experimented briefly with iterating the editing procedure, but were not able to achieve more than around 50% buzzsaw blindness by this metric without affecting the model’s other abilities.
Here are the original and edited models playing some cherry-picked levels.
|
|||
The diversity hypothesis
All of the above analysis uses the same hidden layer of our network, the third of five convolutional layers, since it was much harder to find interpretable features at other layers. Interestingly, the level of abstraction at which this layer operates – finding the locations of various in-game objects – is exactly the level at which CoinRun levels are randomized using procedural generation. Furthermore, we found that training on many randomized levels was essential for us to be able to find any interpretable features at all.
This led us to suspect that the diversity introduced by CoinRun’s randomization is linked to the formation of interpretable features. We call this the diversity hypothesis:
Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).
Our explanation for this hypothesis is as follows. For the forward implication (“only if”), we only expect features to be interpretable if they are general enough, and when the training distribution is not diverse enough, models have no incentive to develop features that generalize instead of overfitting. For the reverse implication (“if”), we do not expect it to hold in a strict sense: diversity on its own is not enough to guarantee the development of interpretable features, since they must also be relevant to the task. Rather, our intention with the reverse implication is to hypothesize that it holds very often in practice, as a result of generalization being bottlenecked by diversity.
In CoinRun, procedural generation is used to incentivize the model to learn skills that generalize to unseen levels
Interpretability and generalization
To test our hypothesis, we made the training distribution less diverse, by training the agent on a fixed set of 100 levels. This dramatically reduced our ability to interpret the model’s features. Here we display an interface for the new model, generated in the same way as the one above. The smoothly increasing value function suggests that the model has memorized the number of timesteps until the end of the level, and the features it uses for this focus on irrelevant background objects. Similar overfitting occurs for other video games with a limited number of levels
We attempted to quantify this effect by varying the number of levels used to train the agent, and evaluating the 8 features identified by our interface on how interpretable they were.
– Number of training levels: 100 / 300 / 1000 / 3,000 / 10,000 / 30,000 / 100,000
– Percentage of levels completed (train, run 1): 99.96% / 99.82% / 99.67% / 99.65% / 99.47% / 99.55% / 99.57%
– Percentage of levels completed (train, run 2): 99.97% / 99.86% / 99.70% / 99.46% / 99.39% / 99.50% / 99.37%
– Percentage of levels completed (test, run 1): 61.81% / 66.95% / 74.93% / 89.87% / 97.53% / 98.66% / 99.25%
– Percentage of levels completed (test, run 2): 64.13% / 67.64% / 73.46% / 90.36% / 97.44% / 98.89% / 99.35%
– Percentage of features interpretable (researcher 1, run 1): 52.5% / 22.5% / 11.25% / 45% / 90% / 75% / 91.25%
– Percentage of features interpretable (researcher 2, run 1): 8.75% / 8.75% / 10% / 26.25% / 56.25% / 90% / 70%
– Percentage of features interpretable (researcher 1, run 2): 15% / 13.75% / 15% / 23.75% / 53.75% / 90% / 96.25%
– Percentage of features interpretable (researcher 2, run 2): 3.75% / 6.25% / 21.25% / 45% / 72.5% / 83.75% / 77.5%
Percentages of levels completed are estimated by sampling 10,000 levels with replacement.
Our results illustrate how diversity may lead to interpretable features via generalization, lending support to the diversity hypothesis. Nevertheless, we still consider the hypothesis to be highly unproven.
Feature visualization
Feature visualization
Gradient-based feature visualization has previously been shown to struggle with RL models trained on Atari games
- Transformation robustness. This is the method of stochastically jittering, rotating and scaling the image between optimization steps, to search for examples that are robust to these transformations
. We tried both increasing and decreasing the size of the jittering. Rotating and scaling are less appropriate for CoinRun, since the observations themselves are not invariant to these transformations. - Penalizing extremal colors.
By an “extremal” color we mean one of the 8 colors with maximal or minimal RGB values (black, white, red, green, blue, yellow, cyan and magenta). Noticing that our visualizations tend to use extremal colors towards the middle, we tried including in the visualization objective an L2 penalty of various strengths on the activations of the first layer, which successfully reduced the size of the extremally-colored region but did not otherwise help. - Alternative objectives. We tried using an alternative optimization objective
, such as the caricature objective. The caricature objective is to maximize the dot product between the activations of the input image and the activations of a reference image. Caricatures are often an especially easy type of feature visualization to make work, and helpful for getting a first glance into what features a model has. They are demonstrated in this notebook. A more detailed manuscript by its authors We also tried using dimensionality reduction, as described below, to choose non-axis-aligned directions in activation space to maximize.is forthcoming. - Low-level visual diversity. In an attempt to broaden the distribution of images seen by the model, we retrained it on a version of the game with procedurally-generated sprites. We additionally tried adding noise to the images, both independent per-pixel noise and spatially-correlated noise. Finally, we experimented briefly with adversarial training
, though we did not pursue this line of inquiry very far.
As shown below, we were able to use dataset examples to identify a number of channels that pick out human-interpretable features. It is therefore striking how resistant gradient-based methods were to our efforts. We believe that this is because solving CoinRun does not ultimately require much visual ability. Even with our modifications, it is possible to solve the game using simple visual shortcuts, such as picking out certain small configurations of pixels. These shortcuts work well on the narrow distribution of images on which the model is trained, but behave unpredictably in the full space of images in which gradient-based optimization takes place.
Our analysis here provides further insight into the diversity hypothesis. In support of the hypothesis, we have examples of features that are hard to interpret in the absence of diversity. But there is also evidence that the hypothesis may need to be refined. Firstly, it seems to be a lack of diversity at a low level of abstraction that harms our ability to interpret features at all levels of abstraction, which could be due to the fact that gradient-based feature visualization needs to back-propagate through earlier layers. Secondly, the failure of our efforts to increase low-level visual diversity suggests that diversity may need to be assessed in the context of the requirements of the task.
Dataset example-based feature visualization
As an alternative to gradient-based feature visualization, we use dataset examples. This idea has a long history, and can be thought of as a heavily-regularized form of feature visualization
Short left-facing wall
Velocity info or left edge of screen
Long left-facing wall
Left end of platform
Right end of platform
Buzzsaw obstacle or platform
Coin
Top/right edge of screen
Left end of platform
Step
Agent or enemy moving right
Left edge of box
Right end of platform
Buzzsaw obstacle
Top left corner of box
Left end of platform or bottom/right of screen?
Unlike gradient-based feature visualization, this method finds some meaning to the different directions in activation space. However, it may still fail to provide a complete picture for each direction, since it only shows a limited number of dataset examples, and with limited context.
Spatially-aware feature visualization
CoinRun observations differ from natural images in that they are much less spatially invariant. For example, the agent always appears in the center, and the agent’s velocity is always encoded in the top left. As a result, some features detect unrelated things at different spatial positions, such as reading the agent’s velocity in the top left while detecting an unrelated object elsewhere. To account for this, we developed a spatially-aware version of dataset example-based feature visualization, in which we fix each spatial position in turn, and choose the observation with the strongest activation at that position (with a limited number of reuses of the same observation, for diversity). This creates a spatial correspondence between visualizations and observations.
Here is such a visualization for a feature that responds strongly to coins. The white squares in the top left show that the feature also responds strongly to the horizontal velocity info when it is white, corresponding to the agent moving right at full speed.
Attribution
Attribution
Dimensionality reduction for attribution
We showed above that a dimensionality reduction method known as non-negative matrix factorization (NMF) could be applied to the channels of activations to produce meaningful directions in activation space
Following
Observation | Positive attribution (good news) | Negative attribution (bad news) |
---|---|---|
Agent
or enemy
moving right
For the full version of our interface, we simply repeat this for an entire trajectory of the agent playing the game. We also incorporate video controls, a timeline view of compressed observations
Attribution discussion
Attributions for our CoinRun model have some interesting properties that would be unusual for an ImageNet model.
- Sparsity. Attribution tends to be concentrated in a very small number of spatial positions and (post-NMF) channels. For example, in the figure above, the top 10 position–channel pairs account for more than 80% of the total absolute attribution. This might be explained by our earlier hypothesis that the model identifies objects by picking out certain small configurations of pixels. Because of this sparsity, we smooth out attribution over nearby spatial positions for the full version of our interface, so that the amount of visual space taken up can be used to judge attribution strength. This trades off some spatial precision for more precision with magnitudes.
- Unexpected sign. Value function attribution usually has the sign one would expect: positive for coins, negative for enemies, and so on. However, this is sometimes not the case. For example, in the figure above, the red channel that detects buzzsaw obstacles has both positive and negative attribution in two neighboring spatial positions towards the left. Our best guess is that this phenomenon is a result of statistical collinearity, caused by certain correlations in the procedural level generation together with the agent’s behavior. These could be visual, such as correlations between nearby pixels, or more abstract, such as both coins and long walls appearing at the end of every level. As a toy example, supposing the value function ought to increase by 2% when the end of the level becomes visible, the model could either increase the value function by 1% for coins and 1% for long walls, or by 3% for coins and −1% for long walls, and the effect would be similar.
- Outlier frames. When an unusual event causes the network to output extreme values, attribution can behave especially strangely. For example, in the buzzsaw hallucination frame, most features have a significant amount of both positive and negative attribution. We do not have a good explanation for this, but perhaps features are interacting in more complicated ways than usual. Moreover, in these cases there is often a significant component of the attribution lying outside the space spanned by the NMF directions, which we display as an additional “residual” feature. This could be because each frame is weighted equally when computing NMF, so outlier frames have little influence over the NMF directions.
These considerations suggest that some care may be required when interpreting attributions.
Questions for further research
The diversity hypothesis
- Validity. Does the diversity hypothesis hold in other contexts, both within and outside of reinforcement learning?
- Relationship to generalization. What is the three-way relationship between diversity, interpretable features and generalization? Do non-interpretable features indicate that a model will fail to generalize in certain ways? Generalization refers implicitly to an underlying distribution – how should this distribution be chosen?
For example, to measure generalization for CoinRun models trained on a limited number of levels, we used the distribution over all possible procedurally-generated levels. However, to formalize the sense in which CoinRun is not diverse in its visual patterns or dynamics rules, one would need a distribution over levels from a wider class of games. - Caveats. How are interpretable features affected by other factors, such as the choice of task or algorithm, and how do these interact with diversity? Speculatively, do big enough models obtain interpretable features via the double descent phenomenon
, even in the absence of diversity? - Quantification. Can we quantitatively predict how much diversity is needed for interpretable features, perhaps using generalization metrics? Can we be precise about what is meant by an “interpretable feature” and a “level of abstraction”?
Interpretability in the absence of diversity
- Pervasiveness of non-diverse features. Do “non-diverse features”, by which we mean the hard-to-interpret features that tend to arise in the absence of diversity, remain when diversity is present? Is there a connection between these non-diverse features and the “non-robust features” that have been posited to explain adversarial examples
? - Coping with non-diverse levels of abstraction. Are there levels of abstraction at which even broad distributions like ImageNet remain non-diverse, and how can we best interpret models at these levels of abstraction?
- Gradient-based feature visualization. Why does gradient-based feature visualization break down in the absence of diversity, and can it be made to work using transformation robustness, regularization, data augmentation, adversarial training, or other techniques? What property of the optimization leads to the clouds of extremal colors?
- Trustworthiness of dataset examples and attribution. How reliable and trustworthy can we make very heavily-regularized versions of feature visualization, such as those based on dataset examples?
Heavily-regularized feature visualization may be untrustworthy by failing to separate the things causing certain behavior from the things that merely correlate with those causes What explains the strange behavior of attribution, and how trustworthy is it?.
Interpretability in the RL framework
- Non-visual and abstract features. What are the best methods for interpreting models with non-visual inputs? Even vision models may also have interpretable abstract features, such as relationships between objects or anticipated events: will any method of generating examples be enough to understand these, or do we need an entirely new approach? For models with memory, how can we interpret their hidden states
? - Improving reliability. How can we best identify, understand and correct rare failures and other errors in RL models? Can we actually improve models by model editing, rather than merely degrading them?
- Modifying training. In what ways can we train RL models to make them more interpretable without a significant performance cost, such as by altering architectures or adding auxiliary predictive losses?
- Leveraging the environment. How can we enrich interfaces using RL-specific data, such as trajectories of agent–environment interaction, state distributions, and advantage estimates? What are the benefits of incorporating user–environment interaction, such as for exploring counterfactuals?
What we would like to see from further research and why
We are motivated to study interpretability for RL for two reasons.
- To be able to interpret RL models. RL can be applied to an enormous variety of tasks, and seems likely to be a part of increasingly influential AI systems. It is therefore important to be able to scrutinize RL models and to understand how they might fail. This may also benefit RL research through an improved understanding of the pitfalls of different algorithms and environments.
- As a testbed for interpretability techniques. RL models pose a number of distinctive challenges for interpretability techniques. In particular, environments like CoinRun straddle the boundary between memorization and generalization, making them useful for studying the diversity hypothesis and related ideas.
We think that large neural networks are currently the most likely type of model to be used in highly capable and influential AI systems in the future. Contrary to the traditional perception of neural networks as black boxes, we think that there is a fighting chance that we will be able to clearly and thoroughly understand the behavior even of very large networks. We are therefore most excited by neural network interpretability research that scores highly according to the following criteria.
- Scalability. The takeaways of the research should have some chance of scaling to harder problems and larger networks. If the techniques themselves do not scale, they should at least reveal some relevant insight that might.
- Trustworthiness. Explanations should be faithful to the model. Even if they do not tell the full story, they should at least not be biased in some fatal way (such as by using an approval-based objective that leads to bad explanations that sound good, or by depending on another model that badly distorts information).
- Exhaustiveness. This may turn out to be impossible at scale, but we should strive for techniques that explain every essential feature of our models. If there are theoretical limits to exhaustiveness, we should try to understand these.
- Low cost. Our techniques should not be significantly more computationally expensive than training the model. We hope that we will not need to train models differently for them to be interpretable, but if we do, we should try to minimize both the computational expense and any performance cost, so that interpretable models are not disincentivized from being used in practice.
Our proposed questions reflect this perspective. One of the reasons we emphasize diversity relates to exhaustiveness. If “non-diverse features” remain when diversity is present, then our current techniques are not exhaustive and could end up missing important features of more capable models. Developing tools to understand non-diverse features may shed light on whether this is likely to be a problem.
We think there may be significant mileage in simply applying existing interpretability techniques, with attention to detail, to more models. Indeed, this was the mindset with which we initially approached this project. If the diversity hypothesis is correct, then this may become easier as we train our models to perform more complex tasks. Like early biologists encountering a new species, there may be a lot we can glean from taking a magnifying glass to the creatures in front of us.
Supplementary material
- Code. Utilities for computing feature visualization, attribution and dimensionality reduction for our models can be found in
lucid.scratch.rl_util
, a submodule of Lucid. We demonstrate these in a notebook. - Model weights. The weights of our model are available for download, along with those of a number of other models, including the models trained on different numbers of levels, the edited models, and models trained on all 16 of the Procgen Benchmark
games. These are indexed here. - More interfaces. We generated an expanded version of our interface for every convolutional layer in our model, which can be found here. We also generated similar interfaces for each of our other models, which are indexed here.
- Interface code. The code used to generate the expanded version of our interface can be found here.
Appendix A: Model editing method
Here we explain our method for editing the model to make the agent blind to certain features.
The features in our interface correspond to directions in activation space obtained by applying attribution-based NMF to layer 2b of our model. To blind the agent to a feature, we edit the weights to make them project out the corresponding NMF direction.
More precisely, let be the NMF direction corresponding to the feature we wish to blind the model to. This is a vector of length , the number of channels in activation space. Using this we construct the orthogonal projection matrix , which projects out the direction of from activation vectors. We then take the convolutional kernel of the following layer, which has shape , where is the number of output channels. Broadcasting across the height and width dimensions, we left-multiply each matrix in the kernel by . The effect of the new kernel is to project out the direction of from activations before applying the original kernel.
As it turned out, the NMF directions were close to one-hot, so this procedure is approximately equivalent to zeroing out the slice of the kernel corresponding to a particular in-channel.
Appendix B: Integrated gradients for a hidden layer
Here we explain the application of integrated gradients
Let be the value function computed by our network, which accepts a 64×64 RGB observation. Given any layer in the network, we may write as , where computes the layer’s activations. Given an observation , a simple method of attribution is to compute , where and denotes the pointwise product. This tells us the sensitivity of the value function to each activation, multiplied by the strength of that activation. However, it uses the sensitivity of the value function at the activation itself, which does not account for the fact that this sensitivity may change as the activation is increased from zero.
To account for this, the integrated gradients method instead chooses a path in activation space from some starting point to the ending point . We then compute the integrated gradient of along , which is defined as the path integral Note the use of the pointwise product rather than the usual dot product here, which makes the integral vector-valued. By the fundamental theorem of calculus for line integrals, when the components of the vector produced by this integral are summed, the result depends only on the endpoints and , equaling . Thus the components of this vector provide a true decomposition of this difference, “attributing” it across the activations.
For our purposes, we take to be the straight line from to .
This has the same dimensions as , and its components sum to . So for a convolutional layer, this method allows us to attribute the value function (in excess of the baseline ) across the horizontal, vertical and channel dimensions of activation space. Positive value function attribution can be thought of as “good news”, components that cause the agent to think it is more likely to collect the coin at the end of the level. Similarly, negative value function attribution can be thought of as “bad news”.
Appendix C: Architecture
Our architecture consists of the following layers in the order given, together with ReLU activations for all except the final layer.
- 7×7 convolutional layer with 16 channels (layer 1a)
- 2×2 L2 pooling layer
- 5×5 convolutional layer with 32 channels (layer 2a)
- 5×5 convolutional layer with 32 channels (layer 2b)
- 2×2 L2 pooling layer
- 5×5 convolutional layer with 32 channels (layer 3a)
- 2×2 L2 pooling layer
- 5×5 convolutional layer with 32 channels (layer 4a)
- 2×2 L2 pooling layer
- 256-unit dense layer
- 512-unit dense layer
- 10-unit dense layer (1 unit for the value function, 9 units for the policy logits)
We designed this architecture by starting with the architecture from IMPALA
- We used fewer convolutional layers and more dense layers, to allow for more non-visual processing.
- We removed the residual connections, so that the flow of information passes through every layer.
- We made the pool size equal to the pool stride, to avoid gradient gridding.
- We used L2 pooling instead of max pooling, for more continuous gradients.
The choice that seemed to make the most difference was using 5 rather than 12 convolutional layers, resulting in the object-identifying features (which were the most interpretable, as discussed above) being concentrated in a single layer (layer 2b), rather than being spread over multiple layers and mixed in with less interpretable features.
Acknowledgments
We would like to thank our reviewers Jonathan Uesato, Joel Lehman and one anonymous reviewer for their detailed and thoughtful feedback. We would also like to thank Karl Cobbe, Daniel Filan, Sam Greydanus, Christopher Hesse, Jacob Jackson, Michael Littman, Ben Millwood, Konstantinos Mitsopoulos, Mira Murati, Jorge Orbay, Alex Ray, Ludwig Schubert, John Schulman, Ilya Sutskever, Nevan Wichers, Liang Zhang and Daniel Ziegler for research discussions, feedback, follow-up work, help and support that have greatly benefited this project.
Author Contributions
Jacob Hilton was the primary contributor.
Nick Cammarata developed the model editing methodology and suggested applying it to CoinRun models.
Shan Carter (while working at OpenAI) advised on interface design throughout the project, and worked on many of the diagrams in the article.
Gabriel Goh provided evaluations of feature interpretability for the section Interpretability and generalization.
Chris Olah guided the direction of the project, performing initial exploratory research on the models, coming up with many of the research ideas, and helping to construct the article’s narrative.
Discussion and Review
Review 1 – Anonymous
Review 2 – Jonathan Uesato
Review 3 – Joel Lehman
References
- Quantifying generalization in reinforcement learning https://distill.pub/2020/understanding-rl-vision
Cobbe, K., Klimov, O., Hesse, C., Kim, T. and Schulman, J., 2018. arXiv preprint arXiv:1812.02341. - Deep inside convolutional networks: Visualising image classification models and saliency maps [PDF]
Simonyan, K., Vedaldi, A. and Zisserman, A., 2013. arXiv preprint arXiv:1312.6034. - Visualizing and understanding convolutional networks [PDF]
Zeiler, M.D. and Fergus, R., 2014. European conference on computer vision, pp. 818–833. - Striving for simplicity: The all convolutional net [PDF]
Springenberg, J.T., Dosovitskiy, A., Brox, T. and Riedmiller, M., 2014. arXiv preprint arXiv:1412.6806. - Grad-CAM: Visual explanations from deep networks via gradient-based localization [PDF]
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. and Batra, D., 2017. Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. - Interpretable explanations of black boxes by meaningful perturbation [PDF]
Fong, R.C. and Vedaldi, A., 2017. Proceedings of the IEEE International Conference on Computer Vision, pp. 3429–3437. - PatternNet and PatternLRP–Improving the interpretability of neural networks [PDF]
Kindermans, P., Schutt, K.T., Alber, M., Muller, K. and Dahne, S., 2017. stat, Vol 1050, pp. 16. - The (un)reliability of saliency methods [PDF]
Kindermans, P., Hooker, S., Adebayo, J., Alber, M., Schutt, K.T., Dahne, S., Erhan, D. and Kim, B., 2019. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 267–280. Springer. - Axiomatic attribution for deep networks [PDF]
Sundararajan, M., Taly, A. and Yan, Q., 2017. Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. - The Building Blocks of Interpretability
Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K. and Mordvintsev, A., 2018. Distill. DOI: 10.23915/distill.00010 - Leveraging Procedural Generation to Benchmark Reinforcement Learning https://distill.pub/2020/understanding-rl-vision
Cobbe, K., Hesse, C., Hilton, J. and Schulman, J., 2019. - Proximal policy optimization algorithms https://distill.pub/2020/understanding-rl-vision
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. arXiv preprint arXiv:1707.06347. - High-dimensional continuous control using generalized advantage estimation [PDF]
Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. arXiv preprint arXiv:1506.02438. - Thread: Circuits
Cammarata, N., Carter, S., Goh, G., Olah, C., Petrov, M. and Schubert, L., 2020. Distill. DOI: 10.23915/distill.00024 - General Video Game AI: A multi-track framework for evaluating agents, games and content generation algorithms https://distill.pub/2020/understanding-rl-vision
Perez-Liebana, D., Liu, J., Khalifa, A., Gaina, R.D., Togelius, J. and Lucas, S.M., 2018. arXiv preprint arXiv:1802.10363. - Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning [PDF]
Juliani, A., Khalifa, A., Berges, V., Harper, J., Henry, H., Crespi, A., Togelius, J. and Lange, D., 2019. arXiv preprint arXiv:1902.01378. - Observational Overfitting in Reinforcement Learning [PDF]
Song, X., Jiang, Y., Du, Y. and Neyshabur, B., 2019. arXiv preprint arXiv:1912.02975. - Feature Visualization
Olah, C., Mordvintsev, A. and Schubert, L., 2017. Distill. DOI: 10.23915/distill.00007 - Visualizing higher-layer features of a deep network [PDF]
Erhan, D., Bengio, Y., Courville, A. and Vincent, P., 2009. University of Montreal, Vol 1341(3), pp. 1. - Deep neural networks are easily fooled: High confidence predictions for unrecognizable images [PDF]
Nguyen, A., Yosinski, J. and Clune, J., 2015. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. - Inceptionism: Going deeper into neural networks [HTML]
Mordvintsev, A., Olah, C. and Tyka, M., 2015. Google Research Blog. - Plug & play generative networks: Conditional iterative generation of images in latent space [PDF]
Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A. and Yosinski, J., 2017. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4467–4477. - Imagenet: A large-scale hierarchical image database [PDF]
Deng, J., Dong, W., Socher, R., Li, L., Li, K. and Fei-Fei, L., 2009. Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. DOI: 10.1109/cvprw.2009.5206848 - Going deeper with convolutions [PDF]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. - An Atari model zoo for analyzing, visualizing, and comparing deep reinforcement learning agents [PDF]
Such, F.P., Madhavan, V., Liu, R., Wang, R., Castro, P.S., Li, Y., Schubert, L., Bellemare, M., Clune, J. and Lehman, J., 2018. arXiv preprint arXiv:1812.07069. - Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents [PDF]
Rupprecht, C., Ibrahim, C. and Pal, C.J., 2019. arXiv preprint arXiv:1904.01318. - Caricatures
Cammerata, N., Olah, C. and Satyanarayan, A., unpublished. Distill draft. Author list not yet finalized. - Towards deep learning models resistant to adversarial attacks
Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017. arXiv preprint arXiv:1706.06083. - Intriguing properties of neural networks [PDF]
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R., 2013. arXiv preprint arXiv:1312.6199. - Visualizing and understanding Atari agents [PDF]
Greydanus, S., Koul, A., Dodge, J. and Fern, A., 2017. arXiv preprint arXiv:1711.00138. - Explain Your Move: Understanding Agent Actions Using Specific and Relevant Feature Attribution https://distill.pub/2020/understanding-rl-vision
Puri, N., Verma, S., Gupta, P., Kayastha, D., Deshmukh, S., Krishnamurthy, B. and Singh, S., 2019. International Conference on Learning Representations. - Video Interface: Assuming Multiple Perspectives on a Video Exposes Hidden Structure https://distill.pub/2020/understanding-rl-vision
Ochshorn, R.M., 2017. - Reconciling modern machine-learning practice and the classical bias–variance trade-off [PDF]
Belkin, M., Hsu, D., Ma, S. and Mandal, S., 2019. Proceedings of the National Academy of Sciences, Vol 116(32), pp. 15849–15854. National Acad Sciences. - Adversarial examples are not bugs, they are features https://distill.pub/2020/understanding-rl-vision
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. arXiv preprint arXiv:1905.02175. - A Discussion of ‘Adversarial Examples Are Not Bugs, They Are Features’
Engstrom, L., Gilmer, J., Goh, G., Hendrycks, D., Ilyas, A., Madry, A., Nakano, R., Nakkiran, P., Santurkar, S., Tran, B., Tsipras, D. and Wallace, E., 2019. Distill. DOI: 10.23915/distill.00019 - Human-level performance in 3D multiplayer games with population-based reinforcement learning https://distill.pub/2020/understanding-rl-vision
Jaderberg, M., Czarnecki, W.M., Dunning, I., Marris, L., Lever, G., Castaneda, A.G., Beattie, C., Rabinowitz, N.C., Morcos, A.S., Ruderman, A. and others,, 2019. Science, Vol 364(6443), pp. 859–865. American Association for the Advancement of Science. - Solving Rubik’s Cube with a Robot Hand https://distill.pub/2020/understanding-rl-vision
Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R. and others,, 2019. arXiv preprint arXiv:1910.07113. - Dota 2 with Large Scale Deep Reinforcement Learning https://distill.pub/2020/understanding-rl-vision
Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C. and others,, 2019. arXiv preprint arXiv:1912.06680. - Does Attribution Make Sense?
Olah, C. and Satyanarayan, A., unpublished. Distill draft. Author list not yet finalized. - IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures [PDF]
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I. and others,, 2018. arXiv preprint arXiv:1802.01561.
Updates and Corrections
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Reuse
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
Citation
For attribution in academic contexts, please cite this work as
Hilton, et al., "Understanding RL Vision", Distill, 2020.
BibTeX citation
@article{hilton2020understanding, author = {Hilton, Jacob and Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris}, title = {Understanding RL Vision}, journal = {Distill}, year = {2020}, note = {https://distill.pub/2020/understanding-rl-vision}, doi = {10.23915/distill.00029} }