Friday, 26 September 2014

Philosophy of Mind and Psychology Reading Group --The Predictive Mind chapter 8

Alex Kiefer
Welcome to the Philosophy of Mind and Psychology Reading Group hosted by the Philosophy@Birmingham blog. This month, Alex Kiefer (CUNY), introduces chapter 8 of Jakob Hohwy's The Predictive Mind (OUP, 2013).

Chapter 8 - Surprise and Misrepresentation
Presented by Alex Kiefer

Chapter 8 of The Predictive Mind explores longstanding and unresolved debates about the nature of mental representation and representational content from the point of view of the Prediction Error Minimization framework. The chapter is concerned primarily with perceptual representation, in keeping with the emphasis on perception throughout the book.

Jakob offers novel perspectives on a wide range of topics in the theory of content and the philosophy of mind more generally. In this post I'll focus on the two topics that I take to be most crucial for characterizing the account of representational content that best fits with the PEM framework: misrepresentation and causal VS descriptive theories of content. The positions sketched in the chapter with respect to these topics can be summarized in the following two claims:

Misrepresentation: Misrepresentation is perceptual inference that minimizes short-term prediction error while undermining long-term prediction error minimization.

Causal VS descriptive theories: Cognitive systems that minimize prediction error represent the world by maintaining causally guided descriptions (modes of presentation) of states of affairs in the world.

In what follows I'll discuss these claims and Jakob's arguments for them in more detail, then consider challenges for each position as well as connections between the two topics.

Misrepresentation and the disjunction problem

A central goal of the chapter is to consider how a reductive theory of content might (a) explain the difference between accurate perception and misperception, (b) in terms of the PEM framework. As Jakob notes, such a theory must overcome what Fodor (1990) and others call “the disjunction problem”: suppose that a dog in a field at night is misperceived as a sheep. What makes it the case that the perceptual representation tokened on this occasion is an inaccurate representation of a dog as a sheep, rather than an accurate representation of a sheep-or-dog-at-night as such?

Jakob relies on the statistical theories of content offered by Eliasmith (2000) and Usher (2001) as points of departure for his own account. According to these theories, the content of a representation is (roughly) whatever in the world enjoys the highest average statistical dependence (specifically, mutual information (MI), definable in terms of either joint or conditional probabilities and marginal probabilities) with that representation under all stimulus conditions. Misrepresentation occurs on those occasions in which the content of a representation, so defined, differs from whatever has the highest MI with that representation under the then current stimulus conditions.

As Jakob discusses in chapter 2 of the book, a cognitive system that minimizes prediction error will exhibit high MI between the parameters of its internal model and states of the world. Given this, the contrast between short-term and on-average MI relations may map onto a contrast between short-term and on-average prediction error minimization. The hypothesis selected to explain away the sensory input caused by the dog in the example (i.e. that that thing is a sheep) will best minimize prediction error in the short term (otherwise it wouldn't have been selected), and the representational vehicle with that content will presumably enjoy higher MI with the presence of the dog than with any other external object under those perceptual circumstances. But that same vehicle carries the most information about the presence of sheep on average, so its tokening in the present situation will undermine prediction error minimization in the long term, which would best be served by representing the dog-at-night as a dog. Misperceptions can then be characterized as perceptual inferences that minimize prediction error in the short term but “undermine average, long-term prediction error minimization” (p. 176).

Causal and descriptive theories of content

As Jakob puts it, “philosophers typically are divided in two camps on representation: either aboutness [i.e. a representation's having content] is said to be a causal relation between things in the world and states of the mind, or it is said to be a relation that comes about when the state of the mind uniquely fits or describes the way the world is” (p. 173). One aim of Chapter 8 is to assess the extent to which the PEM framework favors one or the other of these theories. It seems that there are considerations that pull in both directions.

On the one hand, the appeal to mutual information to explain the relation between the parameters of the generative model and states of affairs in the world suggests a causal covariance theory, of which Usher's (2001) theory, for example, can be taken to be a probabilistic generalization. As Jakob puts it, “the aim of the perceptual system is to fine-tune the causal relation between states of the internal model and states of affairs in the that they tend to predict each others' occurrence” (p. 182).

On the other hand, the hierarchical structure of the generative model and the functional importance of interrelations between hypotheses at various levels about properties of the environment at different spatiotemporal scales suggest a description theory. “For a given object or event in the world, the perceptual hierarchy builds up a highly structured definite description, which will attempt to pick out just that object or event” (p. 182). Though the account in this chapter is only a sketch, the picture Jakob suggests is of a statistical network in which the content of each variable is determined by its probabilistic relations to all the others (its internal causal or inferential role).

The conclusion is that the theory of content that fits best with PEM incorporates both causal and descriptive factors: representation of the world is “not just a matter of causal relations but rather a matter of causally guided modes of representation maintained in the brain” (p. 183).

Prediction error and mutual information

I turn now to a critical discussion of the positions sketched so far. One issue is that the contrast between short-term and long-term prediction error minimization that Jakob relies on to confront the disjunction problem seems to be at least conceptually independent of that between mutual information relations under current conditions VS on average. Even if the two approaches classify the same inferences as misperceptions (which is not immediately clear), it's not mutual information but the inferential consequences of erroneous perceptual inference that do the explanatory work in the story about why long-term PEM is undermined.

Jakob's argument, I take it, is something like the following: Selection of a false hypothesis at one level of the hierarchy in order to minimize local prediction error is bound to impair representation at other levels, given the interdependence of the hypotheses, and misperception occurs to the extent to which the total revisions to the model due to an inference tend to raise long-term prediction error (p. 176). The case in which I infer that the dog-at-night is a sheep may, for example, lead to inaccurate beliefs about the “wheareabouts of sheep and dogs” (p. 175). The case can also be made without appeal to inferences distinct from the perceptual inference in question: when I adopt the hypothesis that the dog-at-dusk is a sheep, I alter (however subtly) the priors that apply to perceptually similar situations, so that I become more likely to draw the same faulty inference in similar situations in the future.

A concern about this account is that, while drawing a certain type of perceptual inference regularly may predictably lead to an increase in prediction error, there may be single instances of perceptual inference that would intuitively count as misrepresentations but that never in fact lead to prediction errors. It seems that Jakob could meet this challenge by identifying cases of misperception as those that increase the risk (rather than the actual incidence) of future prediction error, given the way the world actually is. On this account, what marks off some perceptual inferences as cases of misperception is the fact that they result in an overall worse model of the world.

Troubles with statistical causal theories

The considerations of the previous section weaken the case for a causal component to content determination within the PEM framework, because the account of content in terms of mutual information seems to carry no explanatory weight with regard to misrepresentation. There are additional challenges to the view that mutual information relations can be used to isolate the causes relevant to determining content. I mention here only one such challenge that is particularly salient in the context of hierarchical PEM.

The challenge, considered by Eliasmith (2000, p. 59-60), is that any physical state will carry more information about its immediate causes than about its more distant causes, since each link in the causal chain introduces noise. Given this, we should expect representational vehicles to covary more reliably with the intermediate links in the causal chains connecting them to distal stimuli than with the distal stimuli themselves, and most reliably with other internal vehicles, for example the states of sensory transducers (see also Fodor 1990, p. 108-111). Eliasmith replies by ruling out dependencies that can be fully explained by computational links within the cognitive system as content-determining, but this seems too strong as it rules out the possibility of a system representing its own states in a way that's unmediated by exteroception.

The latter is potentially problematic for hierarchical PEM systems in particular, in which the causes of sensory input are represented in virtue of the system's ability to predict its own states. If “prediction” is univocal and each prediction corresponds to a hypothesis with representational content, then it should be the case that PEM systems represent states of the environment by virtue of many layers of higher-order representation of their own states. This is how Hinton et al (1995, p.2), for example, characterize the representational properties of a Helmholtz machine that includes a multi-layer generative model whose parameters are fit to data by minimizing free energy.

This claim about the ubiquity of higher-order representation in PEM systems is contentious, but is supported by the fact (discussed in my reply to Zoe's post on chapter 4) that representations high in the perceptual hierarchy function both as parts of the overall hypothesis about the external world and as predictions about the properties (such as precision) of hypotheses at lower levels in the hierarchy (see again Hinton et al. (1995)). And if each vehicle plays multiple representational roles, as this consideration suggests, then the contents of a vehicle can't be limited to the unique thing with which it covaries most reliably.

How do generative models represent?

The prospects for a content-determining role for one-one causal relations between individual vehicles and environmental states thus don't seem promising within the PEM framework. In addition to the argument just discussed, there are considerations in favor of the view that inferential roles determine the contents of the states of a generative model in a way that can't be explained by reference to their individual external causes.

First, a consistent theme in Jakob's book is that the internal model represents the world by recapitulating its causal structure. This suggests that the relation between the model and the world that's of representational interest is isomorphism (or, more weakly, homomorphism): the statistical relations between parameters of the model define a structure that is similar (in the ideal case, identical) to the causal structure of the bit of the world that's currently being represented, which explains the ability of an organism possessing such a model to respond rationally to features of the environment and their relations to one another. The role of the world is only to provide the error signal used to update the model in response to selective sampling by the senses (p. 183).

Second, the idea (discussed in chapter 5) that binding is inference favors of a descriptivist or inferentialist account of content. The proposal is that the error signal in a given case may be explained away in terms of a hypothesis involving one cause or one involving multiple causes, depending only on which has the highest posterior probability. This suggests that perception of objects in a scene is in fact a special case of perception of the scene as having some property (in this case, as containing some determinate number of objects).

Despite this, causal relations to the world clearly play an explanatory role, as Jakob says. That role is in explaining how model parameters are updated, and thus in explaining the etiology and maintenance of representations. Thus, it may be said that from the PEM perspective external causes play a diachronic (and genetic or etiological) role in fixing content, and inferential roles play a synchronic (and individuating) one.


Eliasmith, C. (2000). How Neurons Mean: A Neurocomputational Theory of Representational Content. Ph.D., Washington University St. Louis.

Fodor, Jerry A. (1990). A Theory of Content and Other Essays. MIT Press.

Hinton, G. E., Dayan, P., Frey, B. J. and Neal, R. (1995). “The Wake-Sleep Algorithm for Unsupervised Neural Networks”. Science 268, 1158-1161.

Usher, M. (2001). “A Statistical Referential Theory of Content: Using Information Theory to Account for Misrepresentation”. Mind & Language 16(3): 311-34.


  1. Thanks for this excellent discussion of this chapter, Alex. I am very happy that you have read it in the spirit in which it is offered. In particular, I intended this chapter to open up a range of core discussions in philosophy of mind and language, rather than (impossibly) offering conclusive treatments. I view it very much as an invitation to work more on these types of issues – much like Alex suggests doing here.

    Misrepresentation. I very much like Alex’ rendition and improvements of the view here. It is correct that more work is needed to fully make the link between the mutual information account (MI) and the PEM account. The central idea from the MI account is what I use in the book, though it is important to see that PEM is not derived from information theory (PEM is meant to be more fundamental). The way I think about it is indeed that as prediction error is minimized long term, MI should increase and thereby allow appeal to the Eliasmith/Usher view of mispresentation.

    Perhaps it is slightly uncomfortable that we won’t know what is a misrepresentation until we know the whole story about the individual in question. This is brought out in the kind of case raised by Eliasmith, where a purported misrepresentation is followed by immediate death, and less morbidly by the kind of case Alex raises, where a purported misrepresentation leaves no mark. It is tempting to appeal to counterfactuals or probabilities here, as Alex, and Eliasmith too, suggest. A mispresentation occurs when it would (or may) impair PEM/MI.

    It is clear however that much work is needed to work out precisely whether it is possible to make clear and useful distinctions between short term and long term PEM, and between actual and counterfactual (or probable and improbable) PEM. As I briefly discuss in the chapter, it is not even obvious there is a clear distinction between representation and misrepresentation, since there is always some measure of prediction error.

    Behind all this lies the more thorny issue, which I briefly throw in in the chapter, of how one might make the distinction between the normative and the non-normative, according to PEM. It is tempting to go a very short route: it seems that probability density functions (and so normativity) arise directly out of the non-normative story about organisms maintaining themselves in their expected states. In so far as a creature exists it must be minimizing free energy, and if it does that then it will necessarily be inferring the causes of its sensory input. In this story, inference, and thereby representation, becomes a kind of necessary consequence of the fact that the creature exists. These are the murky waters we find ourselves in by operating with PEM. Here, it seems we may have to reconceive the nature of representation itself.

  2. Causes and descriptions. Alex argues that the problem of robust statistical relations to proximal causes is especially pertinent for a hierarchical account, like PEM. I agree. And the points listed by Alex in favour of something like a description theory seem right. The last point about diachronic and synchronic aspects appears intriguing to me – perhaps it points to a new kind of conciliation of the causal and descriptive elements?

    Overall, I think the occurrence of this ‘depth’ problem makes sense in light of the point I made above, namely that representation is a side-product of existence. What’s crucial for existence is that the internal states operate to maintain the integrity of the organism. This means that from the perspective of the organism the relation between sensory organs, active states and internal states is paramount. So from the perspective of the organism, it is rather good news that the internal states are tightly coupled to changes in sensory organs. In this sense, we might say that PEM makes a virtue of necessity. Moreover, PEM explains how this tight process between internal and sensory states entails internal representation of the distal causes of sensory input, in terms of isomorphism or homomorphism with (or ‘recapitulation of’) the external world. Though much needs to be worked out in detail, I think there are the contours here of a new and interesting approach to representation.

    Of course, there appears to be a tension in the kind of position we speculate about here: how can there be isomorphism with distal causes if there is most MI with proximal causes? It seems isomorphism would be watered down as the causal depth increases. However, as Alex also suggests, we need to look at the hierarchy as a whole bounded by the sensory states. This hierarchy is said to recapitulate the causal structure of the world. Here it seems that a parameter at a deep level might covary best and most precisely with a well hidden cause in the world rather than with the more rapidly changing states of sensory organs or very proximate causes. This is what the hierarchy makes possible, since it models and filters out the noise arising from interacting causes in the world.

    This is really key to the PEM notion of representation: the model can predict the rapid flux in the sensory input because deeper levels hypothesize deeper causes modulating the flux. In this way, it seems to me, PEM can deal with the depth problem since it ‘triangulates’ on a cause at a particular depth, the modeling of which minimizes error best. Hypothesizing a more shallow or more deeply hidden cause would carry a prediction error cost in the long run. This story however can only be told by really accentuating the element of definite description in PEM rather than the causal element alone.

  3. Hi Jakob,

    Thanks so much for the comments. I think these large-scale theoretical issues about representation are difficult to tackle in blog-comment-sized chunks, but I wanted to follow up on a few of the points you made in your replies.

    First, on the claim from the chapter that "All perceptual inference is misperception to some degree": This is an interesting view that I'd wanted to cover in my initial post. It seems to follow straightforwardly from the hypothesis that the epistemic function of bottom-up connections is to propagate an error signal: given this, any representation token caused by the presence of a given distal stimulus must have been activated to explain away prediction error at a lower level in the hierarchy. If the prediction at the sensory periphery had been perfect, no downstream processing would have occurred at all. So (assuming there is a causal condition on a mental state's counting as perception, which there may be even if content can't be explained in causal terms) all perception involves at least some misrepresentation. Irreducible noise is another way to arrive at this conclusion, but the argument there seems to be a general one that would apply to any information-processing system, whether governed by PEM principles or not.

    But there are subtle issues here. For one thing, it's not controversial that perception usually involves finding out what was not already expected about the world. If any mismatch between what was already expected and what is seen counts as "misrepresentation" then we may be using the term too permissively.

    In particular, to the extent that top-down influence is used to resolve ambiguities, rather than to correct mistakes (thanks to Zoe Jenkin for calling my attention to the importance of this distinction in the present context), it seems what we have is incomplete representation rather than misrepresentation. In such cases, perhaps, the feedback carried by the "error signal" isn't about errors of commission but errors of omission--about what hasn't yet been settled perceptually. To describe the latter kind of filtering of information up the hierarchy as "misrepresentation", or to interpret the negative feedback signal as in every case a signal carrying information about error in the full-blooded epistemic sense, may be too quick.

    In the hierarchical Bayesian framework, however, this is a tricky line to draw. Ambiguity resolution amounts to the selection of one hypothesis among competing hypotheses. But the idea (if I understand PEM) is that the competing hypotheses at a given level of abstraction are all simultaneously in play and assigned relatively low probabilities until one wins out. "Falsified" hypotheses are just those that end up getting assigned probability near 0 after updating, "confirmed" hypotheses near 1. If we suppose that A, B, C, and D are mutually exclusive and jointly exhaustive hypotheses at a given level of abstraction, then an ideally ambiguous stimulus (with respect to that level) should assign each a probability of 0.25. If a higher-level representation (prior) then raises the probability of B to 0.75, there must be a compensating decrease in the probabilities of the others. This sounds like revision of what was represented previously, rather than epistemically blameless ambiguity resolution. This is an extremely simplified example, but the principle that hypotheses at a given level of abstraction compete for representational control is, I take it, fairly central to the framework.

    Still, it's unclear whether the above argument shows that perceptual content always involves misrepresentation of objects and properties in the environment. Ambiguity resolution is typically a very fast process, and it's arguable that stable perceptual contents of the kind that feature in subjective awareness and drive verbal reports normally only exist after such resolution takes place (binocular rivalry and other unusual cases aside).

    1. Just to pick up on the final point, conditions for binocular rivalry are considered by some to be relatively common (disparate images, inter-ocular competition etc), although there is some debate about the issue:

      In many cases ambiguity resolution does appear to be very fast but for bistable images, or for instances of the Müller-Lyer illusion for example, can be quite slow for the ambiguity to be resolved and often it is never fully resolved.

    2. Hi Bryan,

      Thanks for the references! There are a bunch of interesting issues to discuss here, but just to follow up on one point: I wasn't aware, as O'Shea mentions in his paper, that the blurry image from an occluded eye (for example, with closed eyelid) will come to rival a sharper image from the other given enough time. I tried this, and can confirm that in my case at least, the image of the closed eyelid does begin to dominate after several seconds, though only briefly and most clearly in the periphery.

      A possible PEM explanation is that default priors explain away temporary occlusions of one eye (which makes sense given their irrelevance in most cases to the representation of the overall scene, and is consistent, so far as I can tell, with Arnold's proposal). Then when the eye is closed for some time, very low-level priors are revised to minimize the error signal from the closed eye, which boosts the signal from this eye to the point where BR occurs.

  4. This comment has been removed by the author.

  5. Second, on free-energy minimization: this is of course central to the PEM framework, and I don't understand statistical physics well enough to comment much. I do understand that there is a very tight link, demonstrated over the years by Hinton and fellow researchers in papers I've cited previously (but see especially p. 8), between the minimization of Helmholtz free energy, which happens at thermal equilibrium, and maximizing the efficiency of the encoding-decoding scheme employed to transmit information through a neural network. At equilibrium, in the appropriate type of network, the description length (in terms of Shannon information) of any new input to the system is minimized, because the probabilities of the network being in various internal states given its top-down generative connections and biases are the same as the posterior probabilities of those states given input (Hinton et al p. 8).

    What seems to be controversial is whether this link between physics and information theory has anything to tell us about representation in what Tyler Burge has called the "distinctively psychological" sense ( Burge's idea is that genuinely representational explanations in psychology, such as those in vision science, appeal to states that have accuracy conditions with respect to environmental features, such as states underlying perceptual constancies. This type of representation, according to Burge, is importantly different from the more generic sort that can be defined in terms of statistical covariation.

    Contemporary neural network models, however, seem to show how many features of genuinely psychological representation could in principle be explained by invoking several layers of representation in the generic, statistical covariation sense, among internal states and between peripheral states and proximal stimuli (I take it this is one of the promises of the PEM framework). This needn't commit us to the claim that individual retinal ganglion cells, for instance, or simple systems governed by negative feedback, represent the world in the same sense that human beings do, since many layers of representation (in Burge's "generic" sense) might be required before a system begins to exhibit anything like the properties associated with psychological representation.

  6. Finally (for now), on deeply hidden causes and "triangulation": Jakob, you suggest that a key part of representation within a hierarchical PEM system is that “deeper levels hypothesize deeper causes modulating the flux" (of sensory input). Thus we may expect these “deep” model parameters to covary more reliably with external-world properties than with the fleeting lower-level representations they modulate. I think this is of course right, but not in a way that solves the "depth problem" we've been discussing, as far as I can see.

    It's certainly true that a representation at a high level of abstraction will be a better indicator of a certain regularity in the environment than it is of any particular set of token representations at lower levels, since higher-level hypotheses must of course generalize over collections of lower-level ones to be of any use. In a perceptual case, for instance, the hypothesis that the thing I'm looking at is a cat is compatible with a range of lower-level representations of cat-features of various sorts. And thanks to the potential for nonlinear mappings from input to prediction built into the multiple layers of predictions, that range may be very wide.

    Still, it's not the case that anything goes. I can't perceive something as a cat, for example, if I don't perceive it as in at least some particular respect cat-like. So there will only be so many ways in which the lower-level perceptual parameters tokened on an occasion can vary, consistent with my higher-level cat-representation being activated on that occasion (things get trickier for purely conceptual representation, but perception is hard enough so I'll stick with that.)

    So, it seems that we can define a class P of all possible settings of lower-level parameters consistent with the activation of the higher-level one, and the higher-level one can be expected to have higher MI with instantiation of some member of P or other than with a member of the class of external causes of Ps. This is so despite the fact that P is abstract enough to allow for all the variation in lower-level parameters that we want out of the hierarchical representation.

    One solution to this problem is to appeal, as Eliasmith does in a slightly different context (2000, p.58), and as Usher does (2001, p.318), to the idea that our best overall theories of the world posit certain classes of things and not others. Maybe it's legitimate for a theory of representation to restrict candidates for perceptual content to classes countenanced by some assumed background ontology, as against classes like P. I find this kind of "reference magnetism" philosophically unsatisfying as a solution to indeterminacy problems (see Wolfgang Schwarz's paper on this concept in David Lewis's work:, particularly in the present context. The notion of mental representation as model fitting has barely begun to be explored and might yield interesting new ways of thinking about representation (as you suggest, Jakob) that are less hostage to ontological background assumptions.