Multimedia Segmentation and Summarization Dr JiaChing Wang Honorary

Outline n Introduction n Speaker Segmentation n Video Summarization n Conclusion Multimedia Segmentation and

What is Multimedia? n Image n Video n Speech n Audio n Text Multimedia

Multimedia Everywhere n Fax machines: transmission of binary images n Digital cameras: still images

What is Multimedia Content? n Multimedia content: the syntactic and semantic information inherent in

Why We Need to Know Multimedia Content? n Why we need to know multimedia

How to Know Multimedia Content? n How to Know Multimedia Content? l Multimedia content

Multimedia Segmentation and Summarization n Multimedia segmentation l Syntactic content n Multimedia summarization l

Multimedia Segmentation n Image segmentation n Video segmentation Scene change, shot change n Audio

Multimedia Summarization n Image summarization Region of interest Video summarization l Storyboard, highlight Audio

What is Speaker Segmentation? n It can also be called speaker change detection (SCD)

Supervised v. s. Unsupervised SCD n Supervised manner: acoustic data are made up of

Supervised Speaker Segmentation -- Gaussian Mixture Model n Gaussian mixture modeling (GMM) x is

Supervised Speaker Segmentation -- Hidden Markov Model Multimedia Segmentation and Summarization 14 / 47

Unsupervised Speaker Segmentation -- Sliding Window Strategy & Detection Criterion n Metric-based criterion (The

Bayesian Information Criterion n Model selection l Choose one among a set of candidate

Unsupervised Segmentation Using Bayesian Information Criterion n First model n Second model n Bayesian

Disadvantages of Conventional Unsupervised Speaker Change Detection Disadvantage: n For metric based methods, it’s

Proposed Method -- Misclassification Error Rate n Sliding window pairs n Feature vector distribution

Mathematical Analysis Multimedia Segmentation and Summarization 20 / 47

Mathematical Analysis Multimedia Segmentation and Summarization 21 / 47

Discussion n Generative and discriminant classifiers are both applicable n Key Point: Discriminant classifiers

Speaker Segmentation Using Misclassification Error Rate n Steps l Preprocessing 4 Framing, Feature extraction

Experimental Results EXPERIMENTAL RESULTS Method F-score Precision Recall Proposed 71. 8 70. 2 81.

Video Summarization n Dynamic v. s. Static Video Summarization l Dynamic video summarization 4

Static Video Summarization -- Visual Based Approach n Example n Problem l Is the

How to Generate Effective Storyboard n Question: Assume there are n frames and the

How to Generate Effective Storyboard n In visual viewpoint l Most visually distinct frames

How to Maximize the Dissimality Sum of the Extracted Images n Lattice-based representative frame

How to Maximize the Adjacent Dissimality Sum of the Extracted Images n Original images:

How to Maximize the Adjacent Dissimality Sum of the Extracted Images Multimedia Segmentation and

Complexity Comparison n Select 4 images from an 8 -image sequence l Lattice-based approach

Segment-Based Solution Multimedia Segmentation and Summarization 33 / 47

Experimental Results Multimedia Segmentation and Summarization 34 / 47

Incorporation of the Semantic Information n Conventional l The static summarized images are extracted

The Proposed Architecture n Shot annotation: mapping visual content to text n Concept expansion:

Concept Tree Construction n The concept tree denotes the dependent structure of the expanded

Concept Tree Reorganization n Who: names of people, subset of "person" in Word. Net

Relational Graph Construction -- Relation of Two Concept Trees n The relation of the

Relational Graph Construction -- Remove Unimportant Vertices and Edges n Remove edges with smaller

The Final Relational Graph n Comparison with conventional storyboard Multimedia Segmentation and Summarization 41

Conclusion n A novel speaker segmentation criterion is proposed Misclassification error rate The unsupervised

Future Work n Multimedia segmentation l Speech segmentation l Audio segmentation l Video segmentation

Slides: 44

Download presentation

Multimedia Segmentation and Summarization Dr. Jia-Ching Wang Honorary Fellow, ECE Department, UW-Madison

Outline n Introduction n Speaker Segmentation n Video Summarization n Conclusion Multimedia Segmentation and Summarization 2 / 47

What is Multimedia? n Image n Video n Speech n Audio n Text Multimedia Segmentation and Summarization 3 / 47

Multimedia Everywhere n Fax machines: transmission of binary images n Digital cameras: still images n i. Pod / i. Phone & MP 3 n Digital camcorders: video sequences with audio n Digital television broadcasting n Compact disk (CD), Digital video disk (DVD) n Personal video recorder (PVR, Ti. Vo) n Images on the World Wide Web n Video streaming, video conferencing n Video on cell phones, PDAs n High-definition televisions (HDTV) n Medical imaging: X-ray, MRI, ultrasound n Military imaging: multi-spectral, satellite, microwave Multimedia Segmentation and Summarization 4 / 47

What is Multimedia Content? n Multimedia content: the syntactic and semantic information inherent in a digital material. n Example: text document l Syntactic content: chapter, paragraph l Semantic content: key words, subject, types of text document, etc. n Example: video document l Syntactic content: scene cuts, shots l Semantic content: motion, summary, index, caption, etc. Multimedia Segmentation and Summarization 5 / 47

Why We Need to Know Multimedia Content? n Why we need to know multimedia content? l Information processing, in terms of archiving, indexing, delivering, accessing and other processing, require in-depth knowledge of content to optimize the performance. Multimedia Segmentation and Summarization 6 / 47

How to Know Multimedia Content? n How to Know Multimedia Content? l Multimedia content analysis 4 The computerized understanding of the semantic/syntactic of a multimedia document n Multimedia content analysis usually involves l Segmentation 4 Segmenting l Classification 4 Classifying l each unit into a predefined type Annotation 4 Annotating l the multimedia document into units the multimedia document Summarization 4 Summarizing Multimedia Segmentation and Summarization the multimedia document 7 / 47

Multimedia Segmentation and Summarization n Multimedia segmentation l Syntactic content n Multimedia summarization l Semantic/syntactic content n The result of the temporal segmentation can benefit the video summarization Multimedia Segmentation and Summarization 8 / 47

Multimedia Segmentation n Image segmentation n Video segmentation Scene change, shot change n Audio segmentation l Audio class change n Speech segmentation l Speaker change detection n Text Segmentation l word segmentation, sentence segmentation, topic change detection l Multimedia Segmentation and Summarization 9 / 47

Multimedia Summarization n Image summarization Region of interest Video summarization l Storyboard, highlight Audio summarization l Main theme in music, Corus in song, event sound in environmental sound stream Speech summarization l Speech abstract Text summarization l Abstract l n n Multimedia Segmentation and Summarization 10 / 47

What is Speaker Segmentation? n It can also be called speaker change detection (SCD) n Assumption: there is no overlapping between any of the two speaker streams speaker 2 speaker 1 Multimedia Segmentation and Summarization 11 / 47 speaker 3

Supervised v. s. Unsupervised SCD n Supervised manner: acoustic data are made up of distinct speakers who are known a priori l Recognition based solution n Unsupervised manner: no prior knowledge about the number and identities of speakers l Metric-based criterion l Model selection-based criterion Multimedia Segmentation and Summarization 12 / 47

Supervised Speaker Segmentation -- Gaussian Mixture Model n Gaussian mixture modeling (GMM) x is a d-dimensional random vector. , i=1, …, M is the mixture weight. , the mean vector. , the covariance matrix. n Incoming audio stream is classified into one of D classes in a maximum likelihood manner at time t Multimedia Segmentation and Summarization 13 / 47

Supervised Speaker Segmentation -- Hidden Markov Model Multimedia Segmentation and Summarization 14 / 47

Unsupervised Speaker Segmentation -- Sliding Window Strategy & Detection Criterion n Metric-based criterion (The dissimilarities between the acoustic feature vectors are measured) l Kullback-Leibler distance l Mahalanobis distance l Bhattacharyya distance n Model selection-based criterion l Multimedia Segmentation and Summarization 15 / 47 Bayesian information criterion (BIC)

Bayesian Information Criterion n Model selection l Choose one among a set of candidate models Mi , i=1, 2, . . . , m and corresponding model parameters to represent a given data set D = (D 1, D 2, …, DN). n Model Posterior Probability n Bayesian information criterion Maximized log data likelihood for the given model with model complexity penalty l Bayesian information criterion of model Mi l where di is the number of independent parameters in the mode parameter set Multimedia Segmentation and Summarization 16 / 47

Unsupervised Segmentation Using Bayesian Information Criterion n First model n Second model n Bayesian information criterion Multimedia Segmentation and Summarization 17 / 47

Disadvantages of Conventional Unsupervised Speaker Change Detection Disadvantage: n For metric based methods, it’s not easy to decide a suitable threshold n For BIC, it’s not easy to detect speaker segment less than 2 seconds Multimedia Segmentation and Summarization 18 / 47

Proposed Method -- Misclassification Error Rate n Sliding window pairs n Feature vector distribution Same speaker Multimedia Segmentation and Summarization Different speakers 19 / 47

Mathematical Analysis Multimedia Segmentation and Summarization 20 / 47

Mathematical Analysis Multimedia Segmentation and Summarization 21 / 47

Discussion n Generative and discriminant classifiers are both applicable n Key Point: Discriminant classifiers have the benefit that smaller data are required l We can have smaller scanning window size l The ability to detect short speaker change segment increases Multimedia Segmentation and Summarization 22 / 47

Speaker Segmentation Using Misclassification Error Rate n Steps l Preprocessing 4 Framing, Feature extraction l Hypothesized speaker change point selection l Forcing 2 -class labels l Training a discriminat hyperplane l Inside data recognition & calculating misclassification error rate l Accept/reject the hypothesized speaker change point n Significance l The unsupervised speaker segmentation problem is solved by supervised classification Multimedia Segmentation and Summarization 23 / 47

Experimental Results EXPERIMENTAL RESULTS Method F-score Precision Recall Proposed 71. 8 70. 2 81. 3 BIC 63. 3 54. 4 75. 7 Multimedia Segmentation and Summarization 24 / 47

Video Summarization n Dynamic v. s. Static Video Summarization l Dynamic video summarization 4 Sport l highlight, movie trailer Static video summarization 4 Storyboard – Visual-based approach – Incorporation of the semantic Information Multimedia Segmentation and Summarization 25 / 47

Static Video Summarization -- Visual Based Approach n Example n Problem l Is the summarization ratio adjustable? l How to generate effective storyboard under a given summarization ratio? Multimedia Segmentation and Summarization 26 / 47

How to Generate Effective Storyboard n Question: Assume there are n frames and the summarization ratio is r/n. How do we select the best r frames ? n Complexity: l There are C(n, r) different choices Multimedia Segmentation and Summarization 27 / 47

How to Generate Effective Storyboard n In visual viewpoint l Most visually distinct frames should be extracted l Dissimality between two frames is measured by low level visual features n How to select best r frames from n frames l Solution: maximize the overall pairwise dissimilities l Complexity: C(n, r) x C(r, 2) l Unfeasible: C(n, r) is usually huge n Fact l Human beings usually browse a storyboard in a sequential way n Optimal solution in a sequential sense l Maximize the sum of dissimilities from sequential adjacent images in a storyboard Multimedia Segmentation and Summarization 28 / 47

How to Maximize the Dissimality Sum of the Extracted Images n Lattice-based representative frame extraction approach l Extract key component from temporal sequence l Dynamic programming can be applied n Example: how to select the best 4 images from an 8 -image sequence Multimedia Segmentation and Summarization 29 / 47

How to Maximize the Adjacent Dissimality Sum of the Extracted Images n Original images: O(1), O(2), O(3), O(4), O(5), O(6), O(7), O(8) n Extracted images: E(1), E(2), E(3), E(4) n E(1) ← O(i); E(1) ← O(j); E(1) ← O(k); E(1) ← O(l); where i < j < k < l n Each legal left-to-right path represents a way to extract images n Each transition results in an adjacent dissimality n In this example, the adjacent dissimality sum of the extracted images are D[ O(1), O(3) ] + D[ O(3), O(4) ] + D[ O(4), O(7) ] Multimedia Segmentation and Summarization 30 / 47

How to Maximize the Adjacent Dissimality Sum of the Extracted Images Multimedia Segmentation and Summarization 31 / 47

Complexity Comparison n Select 4 images from an 8 -image sequence l Lattice-based approach 4 45 l dissimality comparison Optimal approach 4 420 dissimality comparison Multimedia Segmentation and Summarization 32 / 47

Segment-Based Solution Multimedia Segmentation and Summarization 33 / 47

Experimental Results Multimedia Segmentation and Summarization 34 / 47

Incorporation of the Semantic Information n Conventional l The static summarized images are extracted in accordance with low level visual features n Disadvantage l It’s difficult to catch the main story without the support of semantic significant information n We present a semantic based static video summarization l Each extracted image has an annotation l Related images are connected by edge l Using ‘who’ ‘what’ ‘where’ ‘when’ to list all extracted images Multimedia Segmentation and Summarization 35 / 47

The Proposed Architecture n Shot annotation: mapping visual content to text n Concept expansion: It provides an alterative view and dependency information while measuring the relation of two annotations. n Relational graph construction Multimedia Segmentation and Summarization 36 / 47

Concept Tree Construction n The concept tree denotes the dependent structure of the expanded words n Meronym l n ‘Wheel' is a meronym of 'automobile'. Holonym l ‘Tree' is a holonym of 'bark', of 'trunk' and of 'limb' n Pencil used for Draw n Salesperson location of Store n Motorist capable of Drive n Eat breakfast Effect of Full stomach Multimedia Segmentation and Summarization 37 / 47

Concept Tree Reorganization n Who: names of people, subset of "person" in Word. Net n Where: "social group, " "building, " and "location " in Word. Net n What: " All the other words which do not belong to "who" and "where" n When: searching for time-period phrase Multimedia Segmentation and Summarization 38 / 47

Relational Graph Construction -- Relation of Two Concept Trees n The relation of the two concept trees n The relation of the two roots n The relation of the two children Multimedia Segmentation and Summarization 39 / 47

Relational Graph Construction -- Remove Unimportant Vertices and Edges n Remove edges with smaller weighting, i. e. lower relation n Remove vertices with smaller term frequency – inverse document frequency (TF-IDF) Multimedia Segmentation and Summarization 40 / 47

The Final Relational Graph n Comparison with conventional storyboard Multimedia Segmentation and Summarization 41 / 47

Conclusion n A novel speaker segmentation criterion is proposed Misclassification error rate The unsupervised speaker segmentation problem is solved by supervised classification with label-forcing Discriminat classifier makes the proposed approach be able to have smaller scanning window size l The ability to detect short speaker change segment increases Two new static video summarization approaches are proposed Lattice-based representative frame extraction l Merely using low level visual features l The summarization ratio is adjustable l Under a given summarization ratio, the dissimality sum from sequential adjacent images is minimized Concept-organized representative frame extraction l Incorporating semantic information l Mining the four kinds of concept entities: who, what, where, and when l People can efficiently grasp the comprehensive structure of the story and understand the main points of the contents l n n n Multimedia Segmentation and Summarization 42 / 47

Future Work n Multimedia segmentation l Speech segmentation l Audio segmentation l Video segmentation n Multimedia summarization l Video summarization 4 Static, dynamic l Speech summarization l Audio summarization Multimedia Segmentation and Summarization 43 / 47

Thank all of you for your attendance!