- Slides: 16
Image-Language Association: are we looking at the right features? Katerina Pastra Language Technology Applications, Institute for Language and Speech Processing, Athens, Greece
The pervasive digital video context nt en t IPTV, i. TV co File-swapping networks ac ce s st o. M M (P 2 P), (video files & video blogs) Auto-analysis of image-language relations Video search engines Conversational robots, MM presentation systems. . . generation of MM content equivalence complementarity independence
Overview Focus on semantic equivalence relation = Multimedia Integration = image-language association v Brief review of state of the art association mechanisms – feature sets used v The Onto. Vis feature set suggestion v Using Onto. Vis in the VLEMA prototype v Prospects for going from 3 D to 2 D v Future plans and conclusions
Association Mechanisms in prototypes Intelligent MM systems from SHRDLU to conversational robots of new millennium (Pastra and Wilks 2004): v Simulated or manually abstracted visual input is used to avoid difficulties in image analysis v Integration resources used with a priori known associations (e. g. image X on screen is a “ball”), or allowing simple inferences e. g. matching an input image to an object-model in the resource, which is in its turn linked to a “concept/word” ) to avoid difficulties in associating V-L v Applications are restricted to blocksworlds/miniworlds scaling issues
Association algorithms To be embedded in prototypes: v Probabilistic approaches for learning (e. g. Barnard et al. 2003) use word/phrase + image/image region (f-v vectors) require properly annotated corpora (IBM, Pascal etc. ) v Logic-based approaches (e. g. Dasiopoulou et al. 2004) Scaling? use feature-augmented ontologies match low-level image features + leaf nodes v Use of both approaches reported too (Srikanth et al. 2005) Feature set used: shape, colour, texture, position, size
The quest for the appropriate f-set Cognitive thesis: No feature set is fully representative of the characteristics of an object, but one may be more or less successful in fixing the reference of the corresponding concept (word) Constraints in defining a f-set: v Features must be distinctive of object classes (at the basic-level) v Feature values must be detectable by image analysis modules
The Onto. Vis suggestion A domain model Ontology + KBase for static indoor scenes (sitting rooms in 3 D – XI KR language) Feature-set suggested • physical structure: the number of parts into which an object is expected to be decomposed in different dimensions • visually verifiable functionality: visual characteristics an object may have which are related to its function, & • interrelations: relative location of objects, relative size
The Onto. Vis suggestion y x z
Onto. Vis – KB examples props(sofa(X), [has_xclusters_more. Than(X, 1)]). armchairs? props(sofa(X), [has_yclusters_equal. More. Than(X, 2)]). props(sofa(X), [has_ yclusters_equal. Less. Than(X, 4)]). props(sofa(X), [has_ zclusters_equal. More. Than(X, 2)]). stools? props(sofa(X), [has_zclusters_equal. Less. Than(X, 3)]). props(sofa(X), [on_floor(X, yes)]). props(sofa(X), [has_surface(X, yes)]). props(sofa(X), [size(X, XCLUSTERS)]). props(chair(X), [has_xclusters (X, 1)]). props(chair(X), [has_ yclusters_equal. More. Than(X, 2)]). props(chair(X), [has_ yclusters_equal. Less. Than(X, 4)]). props(chair(X), [has_zclusters_equal. More. Than(X, 2)]). props(chair(X), [has_zclusters_equal. Less. Than(X, 3)]). props(chair(X), [on_floor(X, yes)]). props(chair(X), [has_surface(X, yes)]). Props(chair(X), [size(X, XCLUSTER_YValue, Table. YDIM_Upper. Constraint)]).
Onto. Vis – KB examples props(table(X), [has_xclusters(X, 1)]). props(table(X), [has_yclusters(X, 2)]). props(table(X), [has_zclusters(X, 1)]). props(table(X), [on_floor(X, yes)]). props(table(X), [has_surface(X, yes)]). props(table(X), [size(X, YDIM, XDIM, Relative_to_Room_YXDIM)]). props(heater(X), [has_xclusters(X, 1)]). props(heater(X), [has_yclusters(X, 1)]). props(heater(X), [has_zclusters(X, 1)]). props(heater(X), [on_wall(X, yes)]). props(heater(X), [on_floor(X, no)]). props(heater(X), [has_surface(X, yes)]). props(heater(X), [size(X, XDIM, YDIM, Relative_to_Wall_YXDIM)]).
Onto. Vis F-set advantages v It generalizes over visual appearance differences (e. g. different styles of sofas) v It goes beyond viewpoint (view angle + distance) differences v It can be used to reason on object id by analogy (e. g. to describe “sofa-like” objects if not certain)
Using Onto. Vis VLEMA: A Vision-Language int. Egration Mech. Anism v Input: automatically re-constructed static scenes in 3 D (VRML format) from RESOLV (robot-surveyor) v Integration task: Medium Translation from images (3 D sitting rooms) to text (what and where in EN) v Domain: estates surveillance v Horizontal prototype v Implemented in shell programming and Pro. Log
System Architecture Onto. Vis + KB Data Transformations Object Segmentation Object Naming Description “…a heater … and a sofa with 3 seats…”
The Output Wed Jul 7 13: 22 GMTDT 2004 VLEMA V 1. 0 Katerina [email protected] of Sheffield Description of the automatically constructed VRML file “development-scene. wrl” This is a general view of a room. We can see the front wall, the left-side wall, the floor, A heater on the lower part of the front-wall and a sofa with 3 seats. The heater is shorter in length than the sofa. It is on the right of the sofa.
Future Plans & Conclusions v Extension of Onto. Vis and testing in VRML worlds v Modular description of clusters/parts (not rely just on their number in each dimension) v Exploration of portability of f-set to 2 D images Initial signs of feasibility: cf. research on detecting spatial relations in 2 D, structure-identification in 2 D, algorithms for 3 D reconstruction from photographs) Complementary or alternative to current approaches? To what extent scalable even in 3 D? Indications of Onto. Vis scalability & feasibility that worth further exploration