Audiovisual source localization and tracking using a network




















- Slides: 20
Audio-visual source localization and tracking using a network of neural oscillators Stuart Wrigley and Guy Brown Speech and Hearing Research Group Department of Computer Science University of Sheffield
Introduction Goal • Localisation and tracking of static and moving sound sources using both audio and video cues Objectives • Audio-based localisation using binaural auditory models (data from manikin) • Video-based localisation using simple frame-based analyses (data from the 3 fixed cameras) • Integration of the AV localisation and tracking techniques Example Application • Video conferencing: estimated speaker location could be used to determine which video stream to send to the remote participant. Alternatively, could be used to drive a pan-tilt USB camera.
Video segmentation • Objective: Locate regions of the frame which contain faces and/or torsos. • We have a relatively unchanging environment: cameras are stationary, lighting is consistent. • Therefore, begin with extremely simple techniques for: – Object detection – Motion detection – Face detection
Video segmentation Object and Motion detection • Calculate the frame difference between either reference frame (objects) or previous frame (motion). • Produce contiguous regions (greater than min size). - =
Video segmentation Face detection • An RGB pixel is classified as skin if: (Solina et al. , 2003†) R>95 && G>40 && B>20 && (max. RGB - min. RGB)>15 && abs(R-G)>15 && R>G && R>B eliminates gray ensures fair complexion red component must be the largest • Areas constrained to be oval and larger than a given number of pixels
Audio localisation • Cochlear filtering by 128 gammatone filters, centre frequencies equally spaced on ERB scale, 50 Hz - 8 k. Hz. • Auditory nerve firing rate is approximated by half-wave rectifying and square root compressing the output of each filter. • Interaural time difference (ITD) is a major localisation cue used by the human auditory system. • In the median plane, listeners can detect ITDs of 10 -15 µs: roughly equivalent to 1º - 5º.
Audio localisation • The conventional technique for estimating the lateralisation of a signal is by calculating a cross-correlation function using the left and right channels. • This technique can be considered to be equivalent to the neural coincidence model of Jeffress (1948). • Precomputed ITD: Azimuth mapping used to calculate the signal’s lateralisation in degrees.
Audio localisation -90 Azimuth (degrees) 90 • If a sound originates from the right side, it reaches the right earlier than the left. • The signals from the right travel further along the delay line than those from the left before coincidence occurs.
Audio-Visual Model Motion regions Face regions Object regions -90 Azimuth (degrees) 90 Audio segmentation network Video segmentation network Audio-visual activity location
Oscillatory correlation framework • The oscillatory correlation theory (Wang, 1996) suggests that neural oscillations are responsible for encoding the ‘link’ between features. • A possible solution to the Binding Problem. person 1 speech person 2 speech person 1 face person 3 face
Relaxation oscillators • Reciprocally connected excitatory unit and inhibitory unit whose activities are represented by x and y. Input x x activity when stimulated: + y • Conceptually, an oscillator can represent – the mean activity of a population of neurons or – the behaviour of a single neuron’s membrane potential and ion channels.
Neural networks • Video network: 72 x 58 grid of neural oscillators. Excitatory connections are placed between stimulated neighbouring nodes. • Audio network: 181 neural oscillators. Each node corresponds to a particular audio azimuth from -90° to 90°. • Each oscillator feeds excitatory input to the global inhibitor (GI). The GI, in turn, feeds inhibitory input back to each oscillator. Ensures only one block of synchronised oscillators can be active at any one time.
Video network: input Binary motion pixels Binary face pixels Binary object pixels Per pixel: At least two features present? Binary video input Oscillator input ON: 0. 2 OFF: -5. 0
Video network: connections • Each node has max value for incoming connection weights. • Equally shared between four nearest active neighbours (if possible). • Distance of connections is only 1 unit. • Hence this type of network is called a locally excitatory globally inhibitory oscillator network (LEGION).
Video network: behaviour • All red nodes automatically interconnected, as are blue nodes. • Noise in system and the GI mean these two groups segregate.
Audio-Visual mapping • The camera introduces image distortion and does not provide a 180° field of view. • Hebbian learning phase used to learn a mapping between audio azimuth activity and activity in a particular range of video frame columns. • Training data consists of a subject speaking at 10° intervals around the manikin whilst video recorded. • A-V mapping determines the connection weights between nodes in the video network and nodes in the audio network
Audio-Visual connections • Each audio node is connected to all nodes in one or more video columns (subject to the A-V mapping). Video network Audio network
Network output • Active nodes exhibit oscillating output. • Segregation takes a finite duration (in oscillator time) to occur • Nodes said to be ‘grouped’ if activities are temporally synchronised • Segregated groups will be active at different times person 1 speech person 2 speech person 1 face person 3 face
Segmentation results (trust me!) • Tested on special recording made in June 2004. • The network successfully groups video and audio activity when at the same position and segregates incongruous audio and video data. Example of consistent A-V Oscillator time t Video activity and audio activity occurs at same time: grouped. Video activity and audio activity occurs at different times: segregated. Example of inconsistent A-V Oscillator time t+n
Future work • The video feature of motion can be used to enhance the reliability of the audio azimuth estimates. – Initially, the amount of motion could simply be used to control the degree of smoothing applied to azimuth estimates of n previous and subsequent time frames. High motion = low smoothing. – Ultimately, the video motion information could be integrated into the azimuth estimation algorithm and used to determine the degree of temporal integration for particular azimuth ranges. • Work is also concentrating on employing attentional processes within the oscillator networks to investigate physiologically plausible tracking behaviour and competition between segregated sources. • Looking at discriminating speaker distance from audio cues. • Formal evaluation metric