The TREC Interactive Video Track and Contentbased Retrieval

The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan http: //www. cs. cmu. edu/~christel/MM 2002/syllabus. htm

State-of-the-art Multimedia Search Engines • Recalling the Homework 1 and Homework 2 • Better for simple concepts, e. g. Two people kissing, A picture of a giraffe • Don’t work for complex queries e. g. A picture of a brick home with black shutters and white pillars, with a pickup truck in front of it (image) © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 2 Carnegie Mellon

Examples • Find the pictures of giraffe • • Keyword: giraffe http: //images. google. com/images? hl=en&lr=lang_zh. CN%7 Clang_en&ie=UTF-8&oe=UTF-8&q=giraffe+ • A picture of a brick home with black shutters and white pillars, with a pickup truck in front of it (image) • brick home shutters • http: //images. google. com/images? hl=en&lr=lang_zh- CN%7 Clang_en&ie=UTF-8&oe=UTF 8&q=brick+home+shutters+ © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 3 Carnegie Mellon

Why this happens? • Most of these search engines are keyword based • • • “False” multi-media search engine Have to represent your idea in keywords These keywords are expected to appear in the filename, or corresponding webpage • Therefore…… • • • Unable to handle semantic meaning of images Unable to handle visual position Unable to handle time information Unable to use images as query ………. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 4 Carnegie Mellon

Solution • Excerpted from your homework • ……I found that the Google Image Search was not as good as expected. Altavista was the more useful multimedia search engine. However, most of them just did a search based on the filename or the matching keywords within the site it was located. I think it would be great to have multimedia search engine intelligent enough to associate its own keywords based on what's in the image. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 5 Carnegie Mellon

Solution • Excerpted from your homework • ……I found that the Google Image Search was not as good as expected. Altavista was the more useful Our However, Solution: most of them multimedia search engine. just did a search based on the filename or the Content-based matching keywords within the site it was located. I Information think it would be great to have multimedia search Retrieval(CBIR) engine intelligent enough to associate its own keywords based on what's in the image. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 6 Carnegie Mellon

Solution • Excerpted from your homework Mainly Video Retrieval in this lecture • ……I found that the Google Image Search was not as good as expected. Altavista was the more useful Our However, Solution: most of them multimedia search engine. just did a search based on the filename or the Content-based matching keywords within the site it was located. I Information think it would be great to have multimedia search Retrieval(CBIR) engine intelligent enough to associate its own keywords based on what's in the image. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 7 Carnegie Mellon

Content-based Video Retrieval • Application • Implementation • Effort on TREC 02 video track • • Feature Extraction Task (High-level Semantics Feature) Manual Retrieval Task (One-run Retrieval) Interactive Retrieval Task (Multiple-run with Feedback) Results & Demo • Conclusion © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 8 Carnegie Mellon

Application • Increasing demand for visual information retrieval • • Retrieve useful information from databases Sharing and distributing video data through computer networks • Example: BBC • BBC archive has +500 k queries plus 1 M new items … per year; • From the BBC … • Police car with blue light flashing • Government plan to improve reading standards • Two shot of Kenneth Clarke and William Hague © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 9 Carnegie Mellon

Application ( Cont. ) • Video Surveillance • Find where else the person appears • Experience On-Demand • Help to remember previous events • Provide useful information on traveling • Equipment on cars to retrieve useful multimedia information according to your location/preference • ……… • Video content is plentiful … its now available digitally … we can work on it directly … so it follows © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 10 Carnegie Mellon

Typical Retrieval Framework • User : provide query information that represents his information needs • Database: store a large collection of video data • Goal: Find the most relevant shots from the database • Shots: “paragraph” in video, typically 20 – 40 seconds, which is the basic unit of video retrieval © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 11 Carnegie Mellon

Sample Query • Text : Find pictures of George Washington • Image: © Copyright

Bridging the Gap Video Database User Result © Copyright 2002 Michael G. Christel and

Automatically Structure Video Data • The first step for video retrieval: Video “programmes” are structured into logical scenes, and physical shots • If dealing with text, then the structure is obvious: • paragraph, section, topic, page, etc. • All text-based indexing, retrieval, linking, etc. builds upon this structure; • Automatic shot boundary detection and selection of representative keyframes is usually the first step; © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 14 Carnegie Mellon

Typical automatic structuring of video a video document A set of shots Keyframe browser combined with transcript or objectbased search © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 15 Carnegie Mellon

Bridging the Gap Video Database User Information Need Video Structure Result © Copyright 2002

Ideal solution Video Database Video Structure User Understanding the semantic meaning and retrieve Information Need Result © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 17 Carnegie Mellon

Ideal solution Video Database Video Structure However, 1. Hard to represent query in natural language and for User computer to understand 2. Computers have no experience 3. Other representation restriction like position, time Information Need Understanding the semantic meaning and retrieve Result © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 18 Carnegie Mellon

Alternative Solution Video Database Video Structure User Provide evidence of relevant information ( text, image, audio) Information Need Match and combine Result © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 19 Carnegie Mellon

Evidence-based Retrieval System • General framework for current video retrieval system • Video retrieval based on the evidence from both users and database, including • • Text information Image information Motion information Audio informaiton • Return a relevant score for each evidence • Combination of the scores © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 20 Carnegie Mellon

Keyword-based System Video Database User Automatic Annotation Keyword Information Need Video Structure Including filename, video title, caption, related web page © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 21 Carnegie Mellon

Keyword-based System Video Database User Automatic Annotation Keyword Video Structure Information Need Manual Annotation © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 22 Carnegie Mellon

Manual Annotation • Manually creating annotation/keywords for image / video data • Examples: Gettyimage. com (image retrieval) • Pros: • Represent the semantic meaning of video • Cons • Time-consuming, labor-intensive • Keyword is not enough to represent information need © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 23 Carnegie Mellon

Speech and OCR transcription Video Database User Annotation Keyword Video Structure Information Need Speech Transcription OCR Transcription © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 24 Carnegie Mellon

Query using speech/OCR information Query: Find pictures of Harry Hertz, Director of the National Quality Program, NIST Speech: We’re looking for people that have a broad range of expertise that have business knowledge that have knowledge on quality management on quality improvement and in particular … OCR: H, arry Hertz a Director aro 7 wa, i, , ty Program , Harry Hertz a Director © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 25 Carnegie Mellon

What we lack? Video Database User Annotation Keyword Video Structure Information Need Speech Transcription Image Information OCR Transcription © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 26 Carnegie Mellon

Image-based Retrieval Video Database User Text Information Keyword Information Need Video Structure Image Feature © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann Query Images 27 Carnegie Mellon

Global Low-level Image Feature • Color-based Feature • • Color Histogram Color Pecentage Color Correlogram Color Moments • Texture-based Feature • Gabor Filter • Wavelet • Shape/Structure Feature © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 28 Carnegie Mellon

Regional Low-level Image Feature • Segmentation into objects • Extract low-level features from each

Image Search • Feature Representation • Image: represented as a series of real number, or a vector of features, (f 1, …. , fn) • Distance Function: The distance between two vectors, typically Euclidean Distance • We believe “Nearest is relevant” • The nearest images in the database is relevant to the query images. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 30 Carnegie Mellon

Finding Similar Images © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 31

But…. . • Low-level feature doesn’t work in all the cases © Copyright 2002

High-level Image Feature • Objects: Persons, Roads, Cars, Skies… • Scenes: Indoors, Outdoors, Cityscape, Landscape, Water, Office, Factory… • Event: Parade, Explosion, Picnic, Playing Soccer… • Generated from low-level features © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 33 Carnegie Mellon

Image-based Retrieval Video Database User Text Information Keyword Information Need Video Structure Image Feature Low-level Feature Query Images High-level Feature © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 34 Carnegie Mellon

More Evidence in Video Retrieval Video Database User Text Information Keyword Information Need Video Structure Image Information Query Images Motion Information Motion Audio Information Audio © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 35 Carnegie Mellon

Combination of multi-modal results • Difference characteristics between multi-modal information • Text-based Information: better for middle and high level queries • e. g. Find the video clip of dancing women wearing dresses • Image-based Information: better for low and middle level queries • e. g. Find the video clip of green trees • Combination of multi-modal information © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 36 Carnegie Mellon

Other Useful Technique • Query Expansion • Cross-Modal Relation • Relevance Feedback © Copyright

Recap • Video Retrieval is to bridge the gap between user information need and video database • Multi-modal evidence • • Text-based (most popular) Image-based Motion-based Audio-based • Combination of the evidence © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 38 Carnegie Mellon

Content-based Video Retrieval • Application • Implementation • TREC video track • • Feature Extraction Task (High-level Semantics Feature) Manual Retrieval Task (One-run Retrieval) Interactive Retrieval Task (Multiple-run with Feedback) Results & Demo • Conclusion © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 39 Carnegie Mellon

Introduction to TREC Video Retrieval Track • Full Name: Text REtrieval Conference • TREC Video Track web site: http: //wwwnlpir. nist. gov/projects/trecvid/ • TREC series sponsored by the National Institute of Standards and Technology (NIST) with additional support from other U. S. government agencies • Goal is to encourage research in information retrieval © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 40 Carnegie Mellon

Introduction to TREC Video Retrieval Track • Video Retrieval Track started in 2001 • Goal is investigation of content-based retrieval from digital video • Focus on the shot as the unit of information retrieval rather than the scene or story/segment/clip • Current state-of-the-art Video Retrieval Competition • 17 active participants, including groups from CMU, IBM Research, Microsoft Research Asia, Media. Mill, LIMSI, Dublin City University. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 41 Carnegie Mellon

Main tasks in TREC • Shot boundary detection • Semantic Feature Extraction Task • Video Retrieval Task • Manual Retrieval: Human formulate a query and then automatically retrieve from collection • Interactive Retrieval: Full human access and feedback © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 42 Carnegie Mellon

Where are they? Retrieval Task Video Database User Text Information Keyword Information Need Video Structure Image Feature Low-level Feature Shot Boundary Detection © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann High-level Feature 43 Query Images Feature Extraction Carnegie Mellon

Video Data • Difficult to get video data for use in TREC because © • Used mainly Internet Archive • advertising, educational, industrial, amateur films 1930 -1970 • produced by corporations, non-profit organisations, trade groups, etc. • Noisy, strange color, but real archive data • 73. 3 hours partitioned as follows: © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 44 Carnegie Mellon

Shot Boundary Detection • Fundamental primitive of most/all work in content-based video retrieval a video document A set of video shots © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 45 Carnegie Mellon

Feature Extraction • Extracted high-level semantic feature from video • Assign a video clip to one or more of several categories of video High-level features: Cityscape, Lake, Trees, Water, Sky © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 46 Carnegie Mellon

Feature Extraction • Interesting itself but when it serves to help video navigation and search then its importance increases • Benefits: • Retrieval - Find video from a particular class • Filtering - Remove irrelevant and distracting information categories from summaries and visualizations © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 47 Carnegie Mellon

The Features Face Clip contains at least one human face with the nose, mouth, and both eyes visible. Pictures of a face meeting the above conditions count People Clip contains a group of two more humans, each of which is at least partially visible and is recognizable as a human On-screen Text Clip contains superimposed text large enough to be read © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 48 Carnegie Mellon

The Features Indoor Clip contains a recognizably indoor location, i. e. , inside a building Outdoor Clip contains a recognizably outdoor location, i. e. , one outside of buildings Cityscape Clip contains a recognizably city/urban/suburban setting Landscape Clip contains a predominantly natural inland setting, i. e. , one with little or no evidence of development by humans. Scenes with bodies of water that are clearly inland may be included © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 49 Carnegie Mellon

Non-Video (Audio) Features Speech A human voice uttering words is recognizable as such in this segment Instrumental Sound produced by one or more musical instruments is recognizable as such in this segment Monologues Segment contains an event in which a single person is at least partially visible and speaks for a long time without interruption by another speaker. Pauses are ok if short © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 50 Carnegie Mellon

La p e n d sc a Te o v xt erl ay Sp h eec Ins a l tr u m e 0. 8 ou Mso n og nold p ca ys e Ci t op Pe Fa ce le rs oo Ind Ou td o or s nt TREC 02 Results CMU_r 1 A_CMU_r 2 Average precision CLIPS-LIT_GEOD CLIPS-LIT-LIMSU 0. 6 DCUFE 2002 Eurecom 1 Fudan_FE_Sys 2 0. 4 IBM-1 IBM-2 Media. Mill 1 Media. Mill 2 0. 2 MSRA Univ. O_MT 1 Univ. O_MT 2 Avg Prec Cap © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 51 10 9 8 Feature 7 6 5 4 3 2 1 0 Random baseline Carnegie Mellon

Video Search Task • The most important task and final goal • Manual &

Queries for 2002 TREC Video Track • Specific item or person • Eddie Rickenbacker, James Chandler, George Washington, Golden Gate Bridge, Price Tower in Bartlesville, OK • Specific fact • Arch in Washington Square Park in NYC, map of continental US • Instances of a category • football players, overhead views of cities, one or more women standing in long dresses • Instances of events/activities • people spending leisure time at the beach, one or more musicians with audible music, crowd walking in an urban environment, locomotive approaching the viewer © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 53 Carnegie Mellon

Sample Query • XML Representation <!DOCTYPE video. Topic SYSTEM "video. Topics. dtd"> <video. Topic num="077"> <text. Description text="Find pictures of George Washington" /> <image. Example src="http: //www. cia. gov/csi/monograph/firstln/955 pres 2. gif" desc="face" /> <video. Example src="01681. mpg" start="09 m 25. 938 s" stop="09 m 29. 308 s" desc="face" /> </video. Topic> © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 54 Carnegie Mellon

Evaluation Metric • Goal: Maximize the Mean Average Precision • Result set limited to 100 shots • Precision = (# relevant shots retrieved)/(total # shots retrieved) • Average precision: compute precision after each retrieved relevant shot and then average these precisions over the total number of retrieved relevant shots in the collection for that topic • Submitting the maximum number of shots per result set can never lower the average precision for that submission • Mean Average Precision = average of the average precision measures for each topic © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 56 Carnegie Mellon

CMU Manual Retrieval System Query Text Movie Info Image Text Score Image Score PRF

Snapshot of the system © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann

Manual Search Result 1 Prous Science 0. 9 IBM-2 0. 8 CMU_MANUAL 1 Precision 0. 7 IBM-3 0. 6 LL 10_T 0. 5 CLIPS+ASR 0. 4 Fudan_Search_Sys 4 0. 3 CLIPS+ASR+X 0. 2 ICMKM-2 0. 1 UMDMqtrec 1 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 0 Recall © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 59 Carnegie Mellon

CMU Interactive Search System • New Interface based on Informedia system • Multiple document storyboards • Query context plays a key role in filtering image sets to manageable sizes • TREC 2002 image feature set offers additional filtering capabilities for indoor, outdoor, faces, people, etc. • Displaying filter count and distribution guides their use in manipulating the storyboard views © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 60 Carnegie Mellon

Filter Interface for using Image Features © Copyright 2002 Michael G. Christel and Alexander

Interactive runs top 10 (of 13) 1 Prous Science CMUInf. Int 1 0. 9 DCUTrec 11 B. 1 0. 8 IBM-2 DCUTrec 11 C. 2 0. 7 CMU_INTERACTIVE_2 Precision 0. 6 CMU_MANUAL 1 Univ. O_MT 5 0. 5 IBM-4 0. 4 DCUTrec 11 B. 3 0. 3 DCUTrec 11 C. 4 0. 2 UMDIqtrec 0. 1 MSRA. Q-Video. 2 a 1 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 0 Recall © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 63 Carnegie Mellon

TE C R B U A. 1 T C re TI c D 11 B CU. 3 T re c 1 D 1 C C U. 2 T re c 1 D 1 C 4 U Tr ec IB 11 M C -4. IC M K M -1 M Vi SR de A o. . QM 1 SR A. Q -V U M id D eo Iq. 2 tr. A e U c N iv O _M T 5 N D _I C V E MU n t MU 1 In C Iu Vf 1 f. I Mean Avg. P vs mean elapsed time Wide variation in elapsed time. Not the dominant factor in effectiveness Mean average precision Mean elapsed time (mins. ) © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 64 Carnegie Mellon

Demo • CMU Interactive Search System • IBM Video Retrieval System http: //mp 7.

Conclusion • The goal of content-based video retrieval is to build more intelligent video retrieval engine via semantic meaning • Many applications in daily life • Combine evidence from different aspects • Hot research topic, few business system • State-of-the-art performance is still unacceptable for normal users, space to improve © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 66 Carnegie Mellon