A neglected problem in the computational theory of

A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap Zenon Pylyshyn Rutgers Center for Cognitive Science

Before I begin I would like you to see a ‘video game’ that will figure in the last part of my talk l The demonstration shows a task called “Multiple Object Tracking” l Track the initially-distinct (flashing) items through the trial (here 10 secs) and indicate at the end which items are the “targets” l After each example I’d like you to ask yourself, “How do I do it? ” l If you are like most of our subjects you will have no idea, or a false idea…

Keep track of the objects that flash 512 x 6. 83 172 x 169

How do we do it? What properties of individual objects do we use?

Going behind occluding surfaces does not disrupt tracking Scholl, B. J. , & Pylyshyn, Z. W. (1999). Tracking multiple items through occlusion: Clues to visual objecthood. Cognitive Psychology, 38(2), 259 -290.

Not all well-defined features can be tracked: Track endpoints of these lines Endpoints move exactly as the squares did!

The basic problem of cognitive science Ø What determines our behavior is not how the world is, but how we represent it as being § As Chomsky pointed out in his review of Skinner, if we describe behavior in relation to the objective properties of the world, we would have to conclude that behavior is essentially stimulus-independent § Every naturally-occurring behavioral regularity is cognitively penetrable üAny information that changes beliefs can systematically and rationally change behavior

Representation and Mind Why representations are essential l Do representations only come into play in “higher level” mental activities, such as reasoning? l Even at early stages of perception many of the states that must be postulated are representations (i. e. what they are about plays a role in explanations).

Examples from vision (1): Intrapercept constraints Epstein, W. (1982). Percept-percept couplings. Perception, 11, 75 -83.

Examples from vision (2): The Pogendorf i. Ilusion depends on perceived contours – they need not be physical edges

The rules of color mixing apply to perceived color ‘Red light and yellow light mix to produce orange light’ Ø This ‘law” holds regardless of how the red light and yellow light are produced; Ø The yellow may be light of 580 nanometer wavelength, or it may be a mixture of light of 530 nm and 650 nm wavelengths. ☺So long as one light looks yellow and the other looks red the “law” will hold – the mixture will look orange.

Another example of a classical representation

Other forms of representation…. a) b) c) d) e) f) g) Lines FG, BC are parallel and equal. Lines EH, AD are parallel and equal. Lines FB, GC are parallel and equal. Lines EA, HD are parallel and equal. Vertices EF, HG, DC and AB are joined. . Part-Of{Cube, Top-Face(EFGH), Bottom. Face(ABCD), Front-Face(FGCB), Back. Face(EHDA)} Part-Of{Top-Face(Front-Edge(FG), Back. Edge(EH), Left-Edge(EF), Right-Edge(HG)}, …

What’s wrong with this picture? What’s wrong is that the CTM is incomplete — it does not address a number of fundamental questions Ø It fails to specify how representations connect with what they represent – it’s not enough to use English words in the representation (that’s been a common confusion in AI) or to draw pictures (a common confusion in theories of mental imagery) English labels and pictures may help theorist recall which objects are being referred to … § But what makes it the case that a particular mental symbol refers to one thing rather than another? § How are concepts grounded? (Symbol Grounding Problem) §

Another way to look at what the Computational Theory of Mind lacks l The missing function in the CTM is a mechanism that allows perception to refer to individual things in the visual field directly and nonconceptually: § Not as “whatever has properties P 1, P 2, P 3, . . . ”, but as a singular term that refers directly to an individual and does not appeal to a representation of the individual’s properties. § Such a reference is like a proper name or a pointer in a computer data structure, or like a demonstrative term (like this or that) in natural language. F Note that in a computer a pointer does not refer via a location, despite what the term “pointer” suggests

An example from personal history: Why we need to pick out individual things without referring to their properties We wanted to develop a computer system that would reason about geometry by actually drawing a diagram and noticing adventitious properties of the diagram from which it would conjecture lemmas to prove l We wanted the system to be as psychologically realistic as possible so we assumed that it had a narrow field of view and noticed only limited, spatiallyrestricted information as it examined the drawing l This immediately raised the problem of coordinating noticings and led us to the idea of visual indexes to keep track of previously encoded parts of the diagram. l

Begin by drawing a line…. L 1

Now draw a second line…. L 2

And draw a third line…. L 3

Notice what you have so far…. (noticings are local – you encode what you attend to) L 1 V 6 L 2 There is an intersection of two lines… But which of the two lines you drew are they? There is no way to indicate which individual things are seen again without a way to refer to individual (token) things

Look around some more to see what is there …. L 5 L 2 V 12 Here is another intersection of two lines… Is it the same intersection as the one seen earlier? Without a special way to keep track of individuals the only way to tell would be to encode unique properties of each of the lines. Which properties should you encode?

In examining a geometrical figure only gets to see a sequence of local glimpses

The incremental construction of visual representations requires solving a correspondence problem over time We have to determine whether a particular individual element seen at time t is identical to another individual element seen at a previous time t- . This is one manifestation of the correspondence problem. l Solving the correspondence problem is equivalent to picking out and tracking the identity of token individuals as they change their appearance, their location or the way they are encoded or conceptualized l To do that we need the capacity to refer to token individuals (I will call them objects) without doing so by appealing to their properties. This requires a special form of demonstrative reference I call a Visual Index. l

A note about the use of labels in this example l There are two purposes for figure labels. One is to specify what type of individual it is (line, vertex, . . ). The other is to specify which individual it is so it is individuated and thus can be selected or bound to the argument of a predicate. l The second of these is what I am concerned with because indicating which individual it is is essential in vision. § Many people (e. g. , Marr, Yantis) have suggested that individuals may be marked by tags, but that won’t do since one cannot literally place a tag on an object and even if we could it would not obviate the need to individuate and index just as labels don’t help. l Labeling things in the world is not enough because to refer to the line labeled L 1 you would have to be able to think “this is line L 1” and you could not think that unless you had a way to first picking out the referent of this.

The difference between a direct (demonstrative) and a descriptive way of picking something out has produced many “You are here” cartoons. It is also illustrated in this recent New Yorker cartoon…

The difference between descriptive and demonstrative ways of picking something out (illustrated in this New Yorker cartoon by Sipress )

‘Picking out’ l Picking out entails individuating, in the sense of separating something from a background (what Gestalt psychologists called a figure-ground distinction) l This sort of picking out has been studied in psychology under the heading of focal or selective attention. § Focal attention appears to pick out and adhere to objects rather than places l In addition to a unitary focal attention there is also evidence for a mechanism of multiple references (about 4 or 5), that I have called a visual index or a FINST § Indexes are different from focal attention in many ways that we have studied in our laboratory (I will mention a few later) § A visual index is like a pointer in a computer data structure – it allows access but does not itself tell you anything about what is being pointed to

The requirements for picking out and keeping track of several individual things reminded me of an early comic book character called Plastic Man

Imagine being able to place several of your fingers on things in the world without recognizing their properties while doing so. You could then refer to those things (e. g. ‘what finger #2 is touching’) and could move your attention to them. You would then be said to possess FINgers of INSTantiation (FINSTs)

FINST Theory postulates a limited number of pointers in early vision that are elicited by certain events in the visual field and that enable vision to refer to those things without doing so under concept or a description

FINSTs and Object Files form the link between the world and its conceptualization The only nonconceptual Object File contents in thisare picture contents are FINST indexes! conceptual! Information (causal) link FINST Demonstrative reference link

A note on terminology l l l A FINST provides a reference to an individual visible ‘thing’ I sometimes call this referent a FING by analogy with FINST and sometimes an object to conform with usage in psych, but FINGs are nonconceptual so they do not pick out something as an object, because OBJECT us a concept. Maybe “proto object”? I have also called it a pointer, but that erroneously suggests that it “points to” the location of an object, as opposed to the object itself. In a computer, a pointer is the name of a stored datum. I have said that a FINST is a visual demonstrative like ‘this’ or ‘that’, but that too is misleading because the reference of a demonstrative depends on the intentions of the speaker I have also noted that a FINST is like a proper name but that won’t do since a name can pick out something not in sensory contact whereas a FINST can only refer to a visible item (or one that is briefly out of sight).

A quick tour of some evidence for FINSTs The correspondence problem l The binding problem l Evaluating multi-place visual predicates (recognizing multi-element patterns) l Operating over several visual elements at once without having to search for them first • § Subitizing § Subset search ● Multiple-Object Tracking • Cognizing space without requiring a spatial display in the head

A quick tour of some evidence for FINSTs The correspondence problem (mentioned earlier) l The binding problem l Evaluating multi-place visual predicates (recognizing multi-element patterns) l Operating over several visual elements at once without having to search for them first • § Subitizing § Subset selection Multiple-Object Tracking • Cognizing space without requiring a spatial display in the head è

Individual objects and the binding problem We can distinguish scenes that differ by conjunctions of properties, so early vision must somehow keep track of how properties co-occur – conjunction must not be obscured. This is the called the binding problem l The most common proposal is that vision keeps track of properties according to their location and binds together co-located properties. l 1 2

The proposal of binding conjunctions by the location of conjuncts does not work when feature location is not punctate and becomes even more problematic if they are co-located – e. g. , if their relation is “inside”

Binding as object-based l The proposal that properties are conjoined by virtue of their common location has many problems § In order to assign a location to a property you need to know its boundaries, which requires distinguishing the object that has those properties from its background (figure-ground individuation) § Properties are properties of objects, not of locations – which is why properties move when objects move. Empty locations have no causal properties. l The alternative to conjoining-by-location is conjoining by object. According to this view, solving the binding problem requires first selecting individual objects and then keeping track of each object’s properties (in its object file) § If only properties of selected objects are encoded and if those properties are recorded in object files specific to each object, then all conjoined properties will be recorded in the same object file, thus solving the binding problem

Attention spreads over perceived objects Spreads to B and not C Spreads to C and not B * Spreads to B and not C Spreads to C and not B Using a priming method (Egly, Driver & Rafal, 1994) showed that the effect of a prime spreads to other parts of the same visual object

A quick tour of some evidence for FINSTs The correspondence problem (mentioned earlier) l The binding problem l Evaluating multi-place visual predicates (recognizing multi-element patterns) l Operating over several visual elements at once without having to search for them first • § Subitizing § Subset selection Multiple-Object Tracking • Cognizing space without requiring a spatial display in the head è

Being able to pick out and refer to individual distal elements is essential for encoding patterns Ø Encoding relational predicates; e. g. , Collinear (x, y, z, . . ); Inside (x, C); Above (x, y); Square (w, x, y, z), requires simultaneously binding the arguments of n-place predicates to n elements in the visual scene ü Evaluating such visual predicates requires individuating and referring to the objects over which the predicate is evaluated: i. e. , the arguments in the predicate must be bound to individual elements in the scene.

Several objects must be picked out at once in making relational judgments When we judge that certain objects are collinear, we must first pick out the relevant objects while ignoring their properties

Several objects must be picked out at once in making relational judgments l The same is true for other relational judgments like inside or on-thesame-contour… etc. We must pick out the relevant individual objects first. Are dots Inside-same contour? On-same contour?

A quick tour of some evidence for FINSTs The correspondence problem l The binding problem l Evaluating multi-place visual predicates (recognizing multi-element patterns) l Operating over several visual elements at once without first having to search for them • § Subitizing § Subset selection Multiple-Object Tracking • Cognizing space without requiring a spatial display in the head è

More functions of FINSTs Further experimental explorations using different paradigms Recognizing the cardinality of small sets of things: Subitizing vs counting (Trick, 1994) l Searching through subsets – selecting items to search through (Burkell, 1997) l § Selecting subsets and maintaining the selection during a saccade (Currie, 2002) Application of FINST index theory to infant cardinality studies (Carey, Spelke, Leslie, Uller, etc) FIndexes explain how children are able to acquire words for objects by ostension without suffering Quine’s Gavagai problem. l

A quick tour of some evidence for FINSTs The correspondence problem (mentioned earlier) l The binding problem l Evaluating multi-place visual predicates (recognizing multi-element patterns) l Operating over several visual elements at once without having to search for them first • § Subitizing § Subset selection Multiple-Object Tracking • Cognizing space without requiring a spatial display in the head è

Another example of MOT: With self occlusion 5 x 5 1. 75 x 1. 75

Self occlusion dues not seriously impair tracking

Some findings of Multiple Object Tracking § Basic finding: Most people can track at least 4 targets that move randomly among identical non-target objects (even 5 year old children can track 3 objects) § Object properties do not appear to be recorded during tracking and tracking is not improved if all objects are visually distinct (no two objects have the same color, shape or size) § How is it done? Ø We showed that it is unlikely that the tracking is done by keeping a record of the targets’ locations and updating them by serially visiting the objects (Pylyshyn & Storm, 1998) Ø Other strategies may be employed (e. g. , tracking a single deforming pattern), but they do not explain tracking Ø Hypothesis: FINST Indexes get assigned to targets. At the end of the trial these pointers can be used to move attention to the targets and hence to select them

What role do visual properties play in MOT? l Certain properties may have to be present in order for an object to be indexed, and certain properties (probably different properties) may be required in order for the index to keep track of the object, but this does not mean that such properties are encoded, stored, or used in tracking. § Compare this with Kripke’s distinction between properties that fix the referent of a proper name and the property that the name refers to. The former only plays a role at the name’s initial “baptism. ” l Is there something special about location? Do we record and track properties-at-locations? § Location in time & space may be essential for individuating objects, but locations need not be encoded or made cognitively available § The fact that an object is actually at some location or other does not mean that it is represented as such. Representing property ‘P’ (where P happens to be at location L) ≠ Representing property ‘P-is-at-L’.

A way of viewing what goes on in MOT l According Kahneman & Treisman’s Object File theory, the appearance of a new visual object causes a new Object File to be created. Each object file is associated with its respective object – presumably through a FINST Index. l The object file may contain information about the object to which it is attached. But according to FINST Theory, keeping track of the object’s identity does not require the use of this information. The evidence suggests that in MOT, little or nothing is stored in the object file except maybe in special cases (e. g. , when the object suddenly changes or disappears). l What makes something the same object over time is that it remains connected to the same object-file (by the same FINST). Thus, for vision to treat something as the same enduring individual does not require appeal to properties or concepts.

Why is this relevant to foundational questions in the philosophy of mind? According to Quine, Strawson, and most philosophers, you cannot pick out or track individuals without concepts (sortals) l But you also cannot pick out individuals with only concepts l § Sooner or later you have to pick out individuals using nonconceptual causal connections between thoughts and things The present proposal is that FINSTs provide the needed non -conceptual mechanism for individuating objects and for tracking their identity, which works most of the time in our kind of world. It relies on a natural constraint (Marr) l FINST indexes provide the right sort of connection for predicating properties of the world by allowing the arguments of predicates to be bound to objects prior to the predicates being evaluated. They may thus be the basis for early vocabulary learning. l

But there must be some properties that cause indexes to be grabbed! Of course there are properties that are causally responsible for indexes being grabbed, and also properties (probably different ones) that make it possible for objects to be tracked; l But these properties need not be represented (encoded) and used in tracking l The distinction between object properties that cause indexes to be assigned and those that are represented (in Object Files) is similar to Kripke’s distinction between properties that are needed to pick out name an object and those that constitute its meaning l

Effect of target properties on MOT Changes of target properties are not reported nor even noticed during MOT l Keeping all targets at different color, size, or shape does not improve tracking l Observers do not use target speed or direction in tracking (e. g. , by anticipating where the targets will be when they reappear after occlusion) l

Some open questions l We have arrived at the view that only properties of selected (indexed) objects enter into subsequent conceptualization and perception-based thought (i. e. , only information in object files is made available to cognition) So what happens to the rest of the visual information? l Visual information seems rich and fine-grained while this theory only allows for the properties of 4 or 5 objects to be encoded! § The present view leaves no room for nonconceptual representations whose content corresponds to the content of conscious experience § According to the present view, the only content that

An intriguing possibility…. Maybe theoretically relevant information we take in is less than (or at least different from) what we experience § This possibility has received attention recently with the discovery of various “blindnesses” (e. g. , changeblindness, inattentional blindness, blindsight…) as well as the discovery of independent-vision systems (e. g. , recognition and motor control) § The qualitative content of conscious experience may not play a role in explanations of cognitive processes § Even if unconceptualized information enters into causal process (e. g. , motor control) it may not be represented or made available to the cognitive mind it – not even as a nonconceptual representation

Vision science has always been deeply ambivalent about role of conscious experience Isn’t how things appear one of the things that our theories must explain? Answer: There is no a priori ‘must explain’! ● The content of subjective experience is a major type of evidence. But it may turn out not to be the most reliable source for inferring the relevant functional states. It competes with other types of evidence. ● How things appear cannot be taken at face value: it carries substantive theoretical assumptions. It also draws on many levels of processing. ØIt was a serious obstacle to early theories of vision (Kepler) ØIt has been a poor guide in the case of theories of mental imagery (e. g. , color mixing, image size, image distances). ‘Reading X off an

What next? This picture leaves many unanswered questions, but it does provide a mechanism for solving the binding problem and also explaining how mental representations could have a nonconceptual connection with objects in the world (something required if mental representations are to connect with actions)

l For a copy of these slides see: http: //ruccs. rutgers. edu/faculty/pylyshyn/Selection. Refere nce. ppt l Or MIT Press Paperback

You are now here X But you are also here