MACHINE LEARNING FOR MULTIMEDIA INFORMATION RETRIEVAL Zengchang Qin

About Us ICMLL @ BUAA: icmll. buaa. edu. cn

Outline • Motivation & Introduction • Semantic Gaps • Text Modeling • Content-based Image

Motivation Massive explosion of “content” on the web. Content rich in multiple modalities —

Motivation Current retrieval systems are predominantly uni-modal. The query and retrieved results are from

Related Works Several multi-modal systems have been proposed [TRECVID, Image. CLEF, Iria’ 09, Wang’

Cross-modality Given query from modality A, retrieve results from modality B. The query and

Multimedia Information • Multimedia information is universal. The basic idea is how to convey

Semantic Gap • Population of the Capital of China. • Population, Capital, China

Road Map D(1) D(2) D(3) Query (Q) n Rock – (geology, music)? Context could

Latent Semantic Index (Analysis) n A is the original term-by-document matrix, S is a

Generative Models • Topics for generating observable words Ref: M. Steyvers, and T. Griffiths

Simplex Representation • Topic models --hierarchical Bayesian models for language modeling and document analysis.

Topic Model Ref: D. Blei, A. Ng, and M. Jordan (2003), Latent Dirichlet allocation,

LDA- Latent Dirichlet Allocation • Latent Dirichlet Allocation (LDA) is one of the most

Semantic Distance D(1) D(2) D(3) …… D(n-1) D(n) KL-Distance Query (Q) Query can be

Chinese Language • Its basic structural unit is character instead of word. • Chinese

Test Data • NEWS 1 W: • This corpus is a news archive for

Bilingual Experiments • Train LDA model based on Chinese character, Chinese word and English

Word-Character Relations • Former comparison experiments showed topic models based on Chinese character can

Word-Character Relation Model The document-topic (left) part of the model is exactly the same

Discussions • Some of the words relevant to sports, even not appeared in the

Bag-of-Features Ref: X. Yuan, J. Yu, Z. Qin and T. Wan, A SIFT-LBP image

SIFT-LBP Feature Our Framework Retrieval Results Ref: Jing Yu, Zengchang Qin, Tao Wan and

Basic Idea No natural correspondence between representations of different modalities. For example, we use

Semantic Space Learn mappings that maps different modalities into intermediate spaces. Given an image

System Overview Original Image Space Image Classifiers Semantic Concept 1 Bag-of-Features Semantic Concept V

Data Sample Geography and Places The population of Turkey stood at 71. 5 million

Data for Experiments Test the model on a dataset English Wikipedia built by UCSD

TVGraz , built by I. Khan et. al. using Google image search results, contains

Experiments on the English Datasets Experimental Results Mean Average Precision (MAP) of the Retrieval

Experiments on the English Datasets Experimental Results Comparison of retrieval MAP on all the

Experiments on the English Datasets Sample Results Visually Comparision of TCM for image query

Experiments on the English Datasets Sample Results Visually Comparision of TCM for text query

Study on the Chinese Case Chinese Experiments Most of the models proposed for cross-modal

Study in the Chinese Case Chinese Experiments Text representation: texts are represented as the

Chinese Experiments The curves show the performance with increasing topic numbers. The model trained

Sample Result in Chinese Visually Comparision of TCM for image query task on the

Sample Result in Chinese Visually Comparision of TCM for text query task on the

Future Work: Nonparametric Bayesian Model for Crowdsoucing Motivation Take the advantages of crowdsourcing properties

Slides: 59

Download presentation

MACHINE LEARNING FOR MULTIMEDIA INFORMATION RETRIEVAL Zengchang Qin (zengchang. qin@gmail. com) Intelligent Computing and Machine Learning Beihang University IHSMC, Hangzhou, Aug. 2013

About Us ICMLL @ BUAA: icmll. buaa. edu. cn

Outline • Motivation & Introduction • Semantic Gaps • Text Modeling • Content-based Image Retrieval • Cross-modal Information Retrieval • Future Work

Motivation Massive explosion of “content” on the web. Content rich in multiple modalities — Text, Images, Videos, Music etc. There is a need for retrieval systems that are robust to any modalities. Cross modal text query, eg. retrieval of images from photoblogs using textual query. Finding images to go along with a text article Finding music to enhance videos. Position an image in the text. Etc.

Motivation Current retrieval systems are predominantly uni-modal. The query and retrieved results are from the same modality Text Images Music Videos Is Google Image search cross-modal retrieval? No, text is matched to text metadata for the image The operation would fail, in absence of text modality for the retrieval set.

BBC’s Problem

Related Works Several multi-modal systems have been proposed [TRECVID, Image. CLEF, Iria’ 09, Wang’ 09, Escalante’ 08, Pham’ 07, Snoek’ 05, Westerveld’ 02, etc. ] Given a query consisting of multiple modalities, retrieve examples containing the same multiple modalities. Eg. Combining the modalities into a single modality, combining the outputs of multiple uni-modal systems. Text Images Music Videos • Annotations systems [TRECVID, Image. CLEF, Carneiro’ 07, Feng’ 04, Lavrenko’ 03, Barnard’ 03, etc] modality (say image), assign text labels. – Are true cross-modal systems. – However, text modality is constrained to a few keywords. – Given a query from a

Cross-modality Given query from modality A, retrieve results from modality B. The query and retrieved items are not required to share a common modality. Text Images Music Videos . . . Text Images Music Videos In this work – I restrict to text and image modalities Although similar ideas can be applied to other modalities. Thus, the retrieval of text in response to a query image. And, the retrieval of images in response to a query text.

SEMANTIC GAP

An Example – Search Engine

Conceptual Idea

Multimedia Information • Multimedia information is universal. The basic idea is how to convey the semantic concept hidden in different modalities.

Semantic Gap • Population of the Capital of China. • Population, Capital, China

TEXT MODELING

Road Map D(1) D(2) D(3) Query (Q) n Rock – (geology, music)? Context could help to solve the polysemy of words. n Similarity: n …… D(n-1) D(n)

Latent Semantic Index (Analysis) n A is the original term-by-document matrix, S is a diagonal matrix, T and D are orthonormal matrices Ref: K. Gimpel (2006), Modeling topics, CMU Technical Report

LSI Results

Generative Models • Topics for generating observable words Ref: M. Steyvers, and T. Griffiths (2004), Finding scientific topics, PNAS, 101 (sup. 1), 5228 -5235.

Simplex Representation • Topic models --hierarchical Bayesian models for language modeling and document analysis. • A document can be represented by a mixture of latent topics and each topic is a distribution over the vocabulary.

Topic Model Ref: D. Blei, A. Ng, and M. Jordan (2003), Latent Dirichlet allocation, Journal of Machine Learning Research, 3: 993 -1022 (cited over 3107, from Google scholar)

LDA- Latent Dirichlet Allocation • Latent Dirichlet Allocation (LDA) is one of the most popular and commonly used topic models Ref: D. Blei, A. Ng, and M. Jordan (2003), Latent Dirichlet allocation, Journal of Machine Learning Research, 3: 993 -1022 (cited over 3107, from Google scholar)

Semantic Distance D(1) D(2) D(3) …… D(n-1) D(n) KL-Distance Query (Q) Query can be extended using Word. Net to Improve the quality of topic modeling. Ref: Z. Qin, M. Thint and Z. Huang (2009), IEA/AIE, LNCS 5579, pp. 103 -112.

Chinese Language • Its basic structural unit is character instead of word. • Chinese words are written without spaces between them • “计算机会下象棋” • （Computer can play chess） • (Computing the chance of playing chess) • Character is expressive in meaning though word has more specified meaning. • “病” (illness) “人” (person) “病人”(patient) • If two words share the same character, they potentially have sort of semantic relations. • “比赛” (game) “锦标赛(tournament)” Ref: T. Xu (2001), Fundamental structural principles of Chinese semantic syntax in terms of Chinese characters, Applied Linguistics, 1, 3 -13.

Test Data • NEWS 1 W: • This corpus is a news archive for classification provided by the Sogou laboratory. • • We selected 10000 documents with 1000 documents per class from the corpus as our experiment data which is named as NEWS 1 W • BIL 3200: • A Chinese-English bilingual corpus collected from the Yeeyan. com. We select 3200 documents for experiments and named the corpus as BIL 3200

Chinese Characters and Words

Bilingual Experiments • Train LDA model based on Chinese character, Chinese word and English respectively. • Extracted topics on bilingual corpus BIL 3200 Top 20 terms of 3 topics extracted independently from 3 models. These 3 topics are semantically close which are talking about internet service or technology

Word-Character Relations • Former comparison experiments showed topic models based on Chinese character can also capture the semantic meaning of text. • Chinese words “学习”(study) and “学生”(student) are literally related by sharing the same character “学”. • And these two words are semantically related and may have a high probability to occur in the same context. • New words are created constantly while the characters constitute the new word probably already appeared in the history docs • We proposed Chinese character-word topic model (CWTM) to consider the character-word relation by placing an asymmetric prior on the topicword distribution of the standard LDA.

Word-Character Relation Model The document-topic (left) part of the model is exactly the same as LDA. Wd, n-- nth word in the dth document. -- kth topic over Chinese word -- kth topic over Chinese character R -- Character-Word relation Ref: Q. Zhao, Z. Qin and T. Wan (2011), ICONIP.

Discussions • Some of the words relevant to sports, even not appeared in the training data, were assigned relatively higher probabilities in topics about sports by CWTM. • For example, the word “主客场” (home and away), though never appeared in training documents, its component characters “主”, “ 客”, “场”are the components of those words appeared in training data. Therefore, it obtained a higher probability in CWTM. • On the other hand, not surprisingly, none of these words were ranked into top 1000 words in the corresponding topic extracted by standard LDA model.

Topics

Classification Results

Topic-based Browser

CONTENT-BASED IMAGE RETRIEVAL

Bag-of-Features Ref: X. Yuan, J. Yu, Z. Qin and T. Wan, A SIFT-LBP image retrieval model based on a bag of features, ICIP 2011

SIFT-LBP Feature Our Framework Retrieval Results Ref: Jing Yu, Zengchang Qin, Tao Wan and Xi Zhang, Feature Integration Analysis of Bag-of-Features Model for Image Retrieval , Neurocomputing. , 2013.

What Color is an Object?

Single Label and Multi-labels

CROSS-MODAL INFORMATION RETRIEVAL

Basic Idea No natural correspondence between representations of different modalities. For example, we use Bag-of-Features and Topic Model to represent images and texts, respectively. Images: vectors over visual textures ( Text: vectors of topics ( ) ) Image Space Text Space Like most of the UK, the Manchester area mobilised extensively during World War II. For example, casting and machining expertise at Beyer, Peacock and Company's locomotive works in Gorton was switched to bomb making; Dunlop's rubber works in Chorlton-on-Medlock made barrage balloons; ? Sky. Bomb Terrorist India Success Weather Prime President Navy Army Sun Books Music Food Poverty Iran America The population of Turkey stood at 71. 5 million with a growth rate of 1. 31% per annum, based on the 2008 Census. It has an average population density of 92 persons per km². The proportion of the population residing in urban areas is 70. 5%. People within the 15– 64 age group constitute 66. 5% of the total population, the 0– 14 age group corresponds 26. 4% of th In 1920, at the age of 20, Coward starred in his own play, the light comedy ''I'll Leave It to You''. After a tryout in Manchester, it opened in London at the How do we compute similarity? ? Martin Luther King's presence in Birmingham was not welcomed by all in the black community. A black attorney was quoted in ''Time'' magazine as saying, "The new administration should have been given a chance to confer with the various groups interested in change. … In 1920, at the age of 20, Coward starred in his own play, the light comedy ''I'll Leave It to You''. After a tryout in Manchester, it opened in London at the New Theatre (renamed the Noël Coward Theatre in 2006), his first fulllength play in the West End. Thaxter, John. British Theatre Guide, 2009 Neville Cardus's praise in ''The Manchester Guardian''

Semantic Space Learn mappings that maps different modalities into intermediate spaces. Given an image query in the Topics of Features Space, the cross-modal retrieval system finds the nearest neighbor of text in the Semantic Space. Similarly for image query. The task now is to design these mappings.

System Overview Original Image Space Image Classifiers Semantic Concept 1 Bag-of-Features Semantic Concept V Topic Model Original Text Space Retrieval Procedure Text Classifiers Semantic Concept 2 Correlated Semantic Space

Sample Results

Data Sample Geography and Places The population of Turkey stood at 71. 5 million with a growth rate of 1. 31% per annum, based on the 2008 Census. It has an average population density of 92 persons per km². The proportion of the population residing in urban areas is 70. 5%. People within the 15– 64 age group constitute 66. 5% of the total population, the 0– 14 age group corresponds 26. 4% of the population, while 65 years and higher of age correspond to 7. 1% of the total population. Life expectancy stands at 70. 67 years for men and 75. 73 years for women, with an overall average of 73. 14 years for the populace as a whole. Education is compulsory and free from ages 6 to 15. The literacy rate is 95. 3% for men and 79. 6% for women, with an overall average of 87. 4%. The low figures for women are mainly due to the traditional customs of the Arabs and Kurds who live in the southeastern provinces of the country. Article 66 of the Turkish Constitution defines a "Turk" as "anyone who is bound to the Turkish state through the bond of citizenship"; therefore, the legal use of the term "Turkish" as a citizen of Turkey is different from the ethnic definition. (…) A number of variants were built on the same chassis as the TAM tank. The original program called for the design of an infantry fighting vehicle, and in 1977 the program finished manufacturing the prototype of the ''Vehículo de Combate Transporte de Personal'' (Personnel Transport Combat Vehicle), or VCTP. The VCTP is able to transport a squad of 12 men, including the squad leader and nine riflemen. The squad leader is situated in the turret of the vehicle; one rifleman sits behind him and another six are seated in the chassis, the eighth manning the hull machine gun and the ninth situated in the turret with the gunner. All personnel can fire their weapons from inside the vehicle, and the VCTP's turret is armed with Rheinmetall's Rh-202 20 millimeter (. 79 in) autocannon. The VCTP holds 880 rounds for the autocannon, including subcaliber armor-piercing DM 63 rounds. It is also armed with a 7. 62 millimeter FN MAG 60 -20 mounted on the turret roof. Infantry can dismount through a door on the rear of the hull. (…) Culture and Society Warfare Despite agreeing on most issues regarding the protection of national parks, friction between the NPA and NPS was seemingly unavoidable. Mather and Yard disagreed on many issues; whereas Mather was not interested in the protection of wildlife and accepted the Biological Survey's efforts to exterminate predators within parks, Yard vehemently criticized the program as early as 1924 (Fox, p. 204). Yard was also highly critical of Mather's administration of the parks. Mather advocated plush accommodations, city comforts and various entertainments to encourage park visitation. These plans clashed with Yard's ideals, and he considered such urbanization of the nation's parks misguided. While visiting Yosemite National Park in 1926, he stated that the valley was "lost" after finding crowds, automobiles, jazz music and even a bear show (Sutter, p. 126). In 1924, the United States Forest Service initiated a program to set aside "primitive areas" in the national forests that protected wilderness while opening it to use. (…)

Data for Experiments Test the model on a dataset English Wikipedia built by UCSD SVCL using Wikipedia’s featured articles – 2700 articles, selected and reviewed by Wikipedia’s editors – – since 2009. The articles are accompanied by one or more pictures from the Wikimedia Commons Each article is split into sections that may or may not have an assigned image (sections without images were dropped) Each article is categorized into one of 29 categories (only the 10 most populated categories were chosen) Each ‘document’ in the proposed set is a ‘section of Wikipedia featured article’ and its ‘associated image’. Test the model on a dataset TVGraz built by I. Khan et. al. using Google image search results. – Containing 2596 image-text pairs coming from 10 categories.

TVGraz , built by I. Khan et. al. using Google image search results, contains 2596 image-text pairs coming from 10 categories.

Experiments on the English Datasets Experimental Results Mean Average Precision (MAP) of the Retrieval Performance [1] [1] N. Rasiwasia, J. Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimediaretrieval[A]. In ACM Multimedia[C], pp. 251 -260, 2010.

Experiments on the English Datasets Experimental Results Comparison of retrieval MAP on all the categories on the En-Wikipedia dataset. (Left) MAP of the query images; (Middle) MAP of the query texts; (Right) Average performance for both the query images an the query texts.

Experiments on the English Datasets Experimental Results Comparison of retrieval MAP on all the categories on the TVGraz dataset. (Left) MAP of the query images; (Middle) MAP of the query texts; (Right) Average performance for both the query images an the query texts.

Experiments on the English Datasets Sample Results Visually Comparision of TCM for image query task on the En-Wikipedia dataset.

Experiments on the English Datasets Sample Results Visually Comparision of TCM for text query task on the En-Wikipedia dataset.

Study on the Chinese Case Chinese Experiments Most of the models proposed for cross-modal information retrieval are focused on English corpus that can be easily extended to other alphabetic languages as their basic units are words ● ● However, the basic units in Chinese are not words, but characters. Both words and characters play indispensable roles in building the Chinese structure. ● 包（bag）钱（money）钱包（wallet）书（book）书包（schoolbag）皮（leather）皮包（briefcase） …… …… An example of compound words of using “包” as the center character in Chinese.

Study in the Chinese Case Chinese Experiments Text representation: texts are represented as the distributions over topics of words or topics of characters based on Latent Dirichlet Allocation (LDA). Semantic illustration of word-based and character-based Chinese corpus representations.

Chinese Experiments The curves show the performance with increasing topic numbers. The model trained by character-based topic model achieved better performance than word-based one. Topic Model Image Query Text Query Average Performance Word-Based 0. 241 0. 281 0. 261 Character. Based 0. 314 0. 298 0. 306

Sample Result in Chinese Visually Comparision of TCM for image query task on the Ch-Wikipedia dataset.

Sample Result in Chinese Visually Comparision of TCM for text query task on the Ch-Wikipedia dataset.

Future Work: Nonparametric Bayesian Model for Crowdsoucing Motivation Take the advantages of crowdsourcing properties to build the correlation between texts and images. Method We propose Dual-view Dirichlet Process Mixture Models for analyzing crossmodal data. [9] Renjie liao, Zengchang Qin, Dual-view Dirichlet Process Mixture Models for Cross-modal Data Analysis. Submitted to PAMI.

THANK YOU!