Carnegie Mellon Univ Dept of Computer Science 15

  • Slides: 76
Download presentation
Carnegie Mellon Univ. Dept. of Computer Science 15 -415 - Database Applications C. Faloutsos

Carnegie Mellon Univ. Dept. of Computer Science 15 -415 - Database Applications C. Faloutsos Multimedia Indexing Carnegie Mellon 15 -415 - C. Faloutsos

General Overview • Relational model – SQL; db design • Indexing; Q-opt; Transaction processing

General Overview • Relational model – SQL; db design • Indexing; Q-opt; Transaction processing • Advanced topics – Distributed Databases – RAID – Authorization / Stat. DB – Spatial Access Methods – Multimedia Indexing Carnegie Mellon 15 -415 - C. Faloutsos 2

Multimedia - Detailed outline • multimedia – Motivation / problem definition – Main idea

Multimedia - Detailed outline • multimedia – Motivation / problem definition – Main idea / time sequences – images – sub-pattern matching – automatic feature extraction / Fast. Map Carnegie Mellon 15 -415 - C. Faloutsos 3

Problem Given a large collection of (multimedia) records (eg. stocks) Allow fast, similarity queries

Problem Given a large collection of (multimedia) records (eg. stocks) Allow fast, similarity queries Carnegie Mellon 15 -415 - C. Faloutsos 4

Applications • time series: financial, marketing (clickstreams!), ECGs, sound; • images: medicine, digital libraries,

Applications • time series: financial, marketing (clickstreams!), ECGs, sound; • images: medicine, digital libraries, education, art • higher-d signals: scientific db (eg. , astrophysics), medicine (MRI scans), entertainment (video) Carnegie Mellon 15 -415 - C. Faloutsos 5

Sample queries • find medical cases similar to Smith's • Find pairs of stocks

Sample queries • find medical cases similar to Smith's • Find pairs of stocks that move in sync • Find pairs of documents that are similar (plagiarism? ) • find faces similar to ‘Tiger Woods’ Carnegie Mellon 15 -415 - C. Faloutsos 6

Detailed problem defn. : Problem: • given a set of multimedia objects, • find

Detailed problem defn. : Problem: • given a set of multimedia objects, • find the ones similar to a desirable query object • for example: Carnegie Mellon 15 -415 - C. Faloutsos 7

$price 1 365 day distance function: by expert 1 Carnegie Mellon (eg, Euclidean distance)

$price 1 365 day distance function: by expert 1 Carnegie Mellon (eg, Euclidean distance) 365 day 15 -415 - C. Faloutsos 8

Types of queries • whole match vs sub-pattern match • range query vs nearest

Types of queries • whole match vs sub-pattern match • range query vs nearest neighbors • all-pairs query Carnegie Mellon 15 -415 - C. Faloutsos 9

Design goals • Fast (faster than seq. scan) • ‘correct’ (ie. , no false

Design goals • Fast (faster than seq. scan) • ‘correct’ (ie. , no false alarms; no false dismissals) Carnegie Mellon 15 -415 - C. Faloutsos 10

Multimedia - Detailed outline • multimedia – Motivation / problem definition – Main idea

Multimedia - Detailed outline • multimedia – Motivation / problem definition – Main idea / time sequences – images – sub-pattern matching – automatic feature extraction / Fast. Map Carnegie Mellon 15 -415 - C. Faloutsos 11

Main idea • Eg. , time sequences, ‘whole matching’, range queries, Euclidean distance $price

Main idea • Eg. , time sequences, ‘whole matching’, range queries, Euclidean distance $price 1 Carnegie Mellon 365 day 15 -415 - C. Faloutsos 12

Main idea • Seq. scanning works - how to do faster? Carnegie Mellon 15

Main idea • Seq. scanning works - how to do faster? Carnegie Mellon 15 -415 - C. Faloutsos 13

Idea: ‘GEMINI’ (GEneric Multimedia INdex. Ing) Extract a few numerical features, for a ‘quick

Idea: ‘GEMINI’ (GEneric Multimedia INdex. Ing) Extract a few numerical features, for a ‘quick and dirty’ test Carnegie Mellon 15 -415 - C. Faloutsos 14

‘GEMINI’ - Pictorially eg, . std S 1 F(S 1) 1 365 day F(Sn)

‘GEMINI’ - Pictorially eg, . std S 1 F(S 1) 1 365 day F(Sn) Sn eg, avg 1 Carnegie Mellon 365 day 15 -415 - C. Faloutsos 15

GEMINI Solution: Quick-and-dirty' filter: • extract n features (numbers, eg. , avg. , etc.

GEMINI Solution: Quick-and-dirty' filter: • extract n features (numbers, eg. , avg. , etc. ) • map into a point in n-d feature space • organize points with off-the-shelf spatial access method (‘SAM’) • discard false alarms Carnegie Mellon 15 -415 - C. Faloutsos 16

GEMINI Important: Q: how to guarantee no false dismissals? A 1: preserve distances (but:

GEMINI Important: Q: how to guarantee no false dismissals? A 1: preserve distances (but: difficult/impossible) A 2: Lower-bounding lemma: if the mapping ‘makes things look closer’, then there are no false dismissals Carnegie Mellon 15 -415 - C. Faloutsos 17

GEMINI Important: Q: how to extract features? A: “if I have only one number

GEMINI Important: Q: how to extract features? A: “if I have only one number to describe my object, what should this be? ” Carnegie Mellon 15 -415 - C. Faloutsos 18

Time sequences Q: what features? Carnegie Mellon 15 -415 - C. Faloutsos 19

Time sequences Q: what features? Carnegie Mellon 15 -415 - C. Faloutsos 19

Time sequences Q: what features? A: Fourier coefficients (we’ll see them in detail soon)

Time sequences Q: what features? A: Fourier coefficients (we’ll see them in detail soon) Carnegie Mellon 15 -415 - C. Faloutsos 20

Time sequences white noise brown noise Fourier spectrum. . . in log-log Carnegie Mellon

Time sequences white noise brown noise Fourier spectrum. . . in log-log Carnegie Mellon 15 -415 - C. Faloutsos 21

Time sequences • Eg. : Carnegie Mellon 15 -415 - C. Faloutsos 22

Time sequences • Eg. : Carnegie Mellon 15 -415 - C. Faloutsos 22

Time sequences • conclusion: colored noises are well approximated by their first few Fourier

Time sequences • conclusion: colored noises are well approximated by their first few Fourier coefficients • colored noises appear in nature: Carnegie Mellon 15 -415 - C. Faloutsos 23

Time sequences • brown noise: stock prices (1/f 2 energy spectrum) • pink noise:

Time sequences • brown noise: stock prices (1/f 2 energy spectrum) • pink noise: works of art (1/f spectrum) • black noises: water reservoirs (1/fb , b>2) • (slope: related to ‘Hurst exponent’, for selfsimilar traffic, like, eg. Ethernet/web [Schroeder], [Leland+] Carnegie Mellon 15 -415 - C. Faloutsos 24

Time sequences - results • keep the first 2 -3 Fourier coefficients • faster

Time sequences - results • keep the first 2 -3 Fourier coefficients • faster than seq. scan • NO false dismissals (see book) total time cleanup-time r-tree time # coeff. kept Carnegie Mellon 15 -415 - C. Faloutsos 25

Time sequences - improvements: • improvements/variations: [Kanellakis+Goldin], [Mendelzon+Rafiei] • could use Wavelets, or DCT

Time sequences - improvements: • improvements/variations: [Kanellakis+Goldin], [Mendelzon+Rafiei] • could use Wavelets, or DCT • could use segment averages [Yi+2000] Carnegie Mellon 15 -415 - C. Faloutsos 26

Multimedia - Detailed outline • multimedia – Motivation / problem definition – Main idea

Multimedia - Detailed outline • multimedia – Motivation / problem definition – Main idea / time sequences – images (color, shapes) – sub-pattern matching – automatic feature extraction / Fast. Map Carnegie Mellon 15 -415 - C. Faloutsos 27

Images - color what is an image? A: 2 -d array Carnegie Mellon 15

Images - color what is an image? A: 2 -d array Carnegie Mellon 15 -415 - C. Faloutsos 28

Images - color Color histograms, and distance function Carnegie Mellon 15 -415 - C.

Images - color Color histograms, and distance function Carnegie Mellon 15 -415 - C. Faloutsos 29

Images - color Mathematically, the distance function is: Carnegie Mellon 15 -415 - C.

Images - color Mathematically, the distance function is: Carnegie Mellon 15 -415 - C. Faloutsos 30

Images - color Problem: ‘cross-talk’: • Features are not orthogonal -> • SAMs will

Images - color Problem: ‘cross-talk’: • Features are not orthogonal -> • SAMs will not work properly • Q: what to do? • A: feature-extraction question Carnegie Mellon 15 -415 - C. Faloutsos 31

Images - color possible answers: • avg red, avg green, avg blue it turns

Images - color possible answers: • avg red, avg green, avg blue it turns out that this lower-bounds the histogram distance -> • no cross-talk • SAMs are applicable Carnegie Mellon 15 -415 - C. Faloutsos 32

Images - color performance: time seq scan w/ avg RGB Carnegie Mellon 15 -415

Images - color performance: time seq scan w/ avg RGB Carnegie Mellon 15 -415 - C. Faloutsos selectivity 33

Multimedia - Detailed outline • multimedia – Motivation / problem definition – Main idea

Multimedia - Detailed outline • multimedia – Motivation / problem definition – Main idea / time sequences – images (color; shape) – sub-pattern matching – automatic feature extraction / Fast. Map Carnegie Mellon 15 -415 - C. Faloutsos 34

Images - shapes • distance function: Euclidean, on the area, perimeter, and 20 ‘moments’

Images - shapes • distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ • (Q: how to normalize them? Carnegie Mellon 15 -415 - C. Faloutsos 35

Images - shapes • distance function: Euclidean, on the area, perimeter, and 20 ‘moments’

Images - shapes • distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ • (Q: how to normalize them? • A: divide by standard deviation) Carnegie Mellon 15 -415 - C. Faloutsos 36

Images - shapes • distance function: Euclidean, on the area, perimeter, and 20 ‘moments’

Images - shapes • distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ • (Q: other ‘features’ / distance functions? Carnegie Mellon 15 -415 - C. Faloutsos 37

Images - shapes • distance function: Euclidean, on the area, perimeter, and 20 ‘moments’

Images - shapes • distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ • (Q: other ‘features’ / distance functions? • A 1: turning angle • A 2: dilations/erosions • A 3: . . . ) Carnegie Mellon 15 -415 - C. Faloutsos 38

Images - shapes • distance function: Euclidean, on the area, perimeter, and 20 ‘moments’

Images - shapes • distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ • Q: how to do dim. reduction? Carnegie Mellon 15 -415 - C. Faloutsos 39

Images - shapes • distance function: Euclidean, on the area, perimeter, and 20 ‘moments’

Images - shapes • distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ • Q: how to do dim. reduction? • A: Karhunen-Loeve (= centered PCA/SVD) Carnegie Mellon 15 -415 - C. Faloutsos 40

Images - shapes • Performance: ~10 x faster log(# of I/Os) all kept #

Images - shapes • Performance: ~10 x faster log(# of I/Os) all kept # of features kept Carnegie Mellon 15 -415 - C. Faloutsos 41

Case study: Informedia • Video database system, developed at CMU • 2+ TB of

Case study: Informedia • Video database system, developed at CMU • 2+ TB of video data (broadcast news) • retrieval by text, image and face similarity www. informedia. cs. cmu. edu/ Carnegie Mellon 15 -415 - C. Faloutsos 42

Case study: Informedia • next foils: visualization features – by space – by time

Case study: Informedia • next foils: visualization features – by space – by time – by concept Carnegie Mellon 15 -415 - C. Faloutsos 43

geo mapping • automatic place recognition • ambiguity resol. + • lookup Carnegie Mellon

geo mapping • automatic place recognition • ambiguity resol. + • lookup Carnegie Mellon 15 -415 - C. Faloutsos 44

Carnegie Mellon 15 -415 - C. Faloutsos 45

Carnegie Mellon 15 -415 - C. Faloutsos 45

time line Carnegie Mellon 15 -415 - C. Faloutsos 46

time line Carnegie Mellon 15 -415 - C. Faloutsos 46

concept space Carnegie Mellon 15 -415 - C. Faloutsos 47

concept space Carnegie Mellon 15 -415 - C. Faloutsos 47

Multimedia - Detailed outline • multimedia – Motivation / problem definition – Main idea

Multimedia - Detailed outline • multimedia – Motivation / problem definition – Main idea / time sequences – images (color; shape) – sub-pattern matching – automatic feature extraction / Fast. Map Carnegie Mellon 15 -415 - C. Faloutsos 48

Sub-pattern matching • Problem: find sub-sequences that match the given query pattern Carnegie Mellon

Sub-pattern matching • Problem: find sub-sequences that match the given query pattern Carnegie Mellon 15 -415 - C. Faloutsos 49

$price 1 400 day $price 1 Carnegie Mellon 1 300 day 15 -415 -

$price 1 400 day $price 1 Carnegie Mellon 1 300 day 15 -415 - C. Faloutsos 30 365 day 50

Sub-pattern matching • Q: how to proceed? • Hint: try to turn it into

Sub-pattern matching • Q: how to proceed? • Hint: try to turn it into a ‘whole-matching’ problem (how? ) Carnegie Mellon 15 -415 - C. Faloutsos 51

Sub-pattern matching • Assume that queries have minimum duration w; (eg. , w=7 days)

Sub-pattern matching • Assume that queries have minimum duration w; (eg. , w=7 days) • divide data sequences into windows of width w (overlapping, or not? ) Carnegie Mellon 15 -415 - C. Faloutsos 52

Sub-pattern matching • Assume that queries have minimum duration w; (eg. , w=7 days)

Sub-pattern matching • Assume that queries have minimum duration w; (eg. , w=7 days) • divide data sequences into windows of width w (overlapping, or not? ) • A: sliding, overlapping windows. Thus: trails Pictorially: Carnegie Mellon 15 -415 - C. Faloutsos 53

Sub-pattern matching Carnegie Mellon 15 -415 - C. Faloutsos 54

Sub-pattern matching Carnegie Mellon 15 -415 - C. Faloutsos 54

Sub-pattern matching sequences -> trails -> MBRs in feature space Carnegie Mellon 15 -415

Sub-pattern matching sequences -> trails -> MBRs in feature space Carnegie Mellon 15 -415 - C. Faloutsos 55

Sub-pattern matching Q: do we store all points? why not? Carnegie Mellon 15 -415

Sub-pattern matching Q: do we store all points? why not? Carnegie Mellon 15 -415 - C. Faloutsos 56

Sub-pattern matching Q: how to do range queries of duration w? Carnegie Mellon 15

Sub-pattern matching Q: how to do range queries of duration w? Carnegie Mellon 15 -415 - C. Faloutsos 57

Sub-pattern matching (very recent improvement [Moon+2001]) • use non-overlapping windows, for data Carnegie Mellon

Sub-pattern matching (very recent improvement [Moon+2001]) • use non-overlapping windows, for data Carnegie Mellon 15 -415 - C. Faloutsos 58

Conclusions • GEMINI works for any setting (time sequences, images, etc) • uses a

Conclusions • GEMINI works for any setting (time sequences, images, etc) • uses a ‘quick and dirty’ filter • faster than seq. scan • (but: how to extract features automatically? ) Carnegie Mellon 15 -415 - C. Faloutsos 59

Multimedia - Detailed outline • multimedia – Motivation / problem definition – Main idea

Multimedia - Detailed outline • multimedia – Motivation / problem definition – Main idea / time sequences – images (color; shape) – sub-pattern matching – automatic feature extraction / Fast. Map Carnegie Mellon 15 -415 - C. Faloutsos 60

Fast. Map Automatic feature extraction: • Given a dissimilarity function of objects • Quickly

Fast. Map Automatic feature extraction: • Given a dissimilarity function of objects • Quickly map the objects to a (k-d) `feature' space. • (goals: indexing and/or visualization) Carnegie Mellon 15 -415 - C. Faloutsos 61

Fast. Map O 1 O 2 O 3 O 4 O 1 0 1

Fast. Map O 1 O 2 O 3 O 4 O 1 0 1 1 100 O 2 1 0 1 100 O 3 1 1 0 100 O 4 100 100 0 1 O 5 100 100 1 0 Carnegie Mellon O 5 15 -415 - C. Faloutsos ~100 ~1 62

Fast. Map • Multi-dimensional scaling (MDS) can do that, but in O(N**2) time Carnegie

Fast. Map • Multi-dimensional scaling (MDS) can do that, but in O(N**2) time Carnegie Mellon 15 -415 - C. Faloutsos 63

MDS Multi Dimensional Scaling Carnegie Mellon 15 -415 - C. Faloutsos 64

MDS Multi Dimensional Scaling Carnegie Mellon 15 -415 - C. Faloutsos 64

Main idea: projections We want a linear algorithm: Fast. Map [SIGMOD 95] Carnegie Mellon

Main idea: projections We want a linear algorithm: Fast. Map [SIGMOD 95] Carnegie Mellon 15 -415 - C. Faloutsos 65

Fast. Map - next iteration Carnegie Mellon 15 -415 - C. Faloutsos 66

Fast. Map - next iteration Carnegie Mellon 15 -415 - C. Faloutsos 66

Results Documents /cosine similarity -> Euclidean distance (how? ) Carnegie Mellon 15 -415 -

Results Documents /cosine similarity -> Euclidean distance (how? ) Carnegie Mellon 15 -415 - C. Faloutsos 67

Results bb reports recipes Carnegie Mellon 15 -415 - C. Faloutsos 68

Results bb reports recipes Carnegie Mellon 15 -415 - C. Faloutsos 68

Applications: time sequences • given n co-evolving time sequences • visualize them + find

Applications: time sequences • given n co-evolving time sequences • visualize them + find rules [ICDE 00] GBP rate JPY HKD time Carnegie Mellon 15 -415 - C. Faloutsos 69

Applications - financial • currency exchange rates [ICDE 00] FRF DEM HKD JPY USD(t)

Applications - financial • currency exchange rates [ICDE 00] FRF DEM HKD JPY USD(t) USD(t-5) Carnegie Mellon USD GBP 15 -415 - C. Faloutsos 70

Video. Trails [ACM MM 97] Carnegie Mellon 15 -415 - C. Faloutsos 71

Video. Trails [ACM MM 97] Carnegie Mellon 15 -415 - C. Faloutsos 71

Conclusions • GEMINI works for multiple settings • Fast. Map can extract ‘features’ automatically

Conclusions • GEMINI works for multiple settings • Fast. Map can extract ‘features’ automatically (-> indexing, visual d. m. ) Carnegie Mellon 15 -415 - C. Faloutsos 72

References • Faloutsos, C. , R. Barber, et al. (July 1994). “Efficient and Effective

References • Faloutsos, C. , R. Barber, et al. (July 1994). “Efficient and Effective Querying by Image Content. ” J. of Intelligent Information Systems 3(3/4): 231 -262. • Faloutsos, C. and K. -I. D. Lin (May 1995). Fast. Map: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. Proc. of ACM-SIGMOD, San Jose, CA. • Faloutsos, C. , M. Ranganathan, et al. (May 25 -27, 1994). Fast Subsequence Matching in Time-Series Databases. Proc. ACM SIGMOD, Minneapolis, MN. Carnegie Mellon 15 -415 - C. Faloutsos 73

References • Flickner, M. , H. Sawhney, et al. (Sept. 1995). “Query by Image

References • Flickner, M. , H. Sawhney, et al. (Sept. 1995). “Query by Image and Video Content: The QBIC System. ” IEEE Computer 28(9): 23 -32. • Goldin, D. Q. and P. C. Kanellakis (Sept. 19 -22, 1995). On Similarity Queries for Time-Series Data: Constraint Specification and Implementation. Int. Conf. on Principles and Practice of Constraint Programming (CP 95), Cassis, France. • Leland, W. E. , M. S. Taqqu, et al. (Feb. 1994). “On the Self-Similar Nature of Ethernet Traffic. ” IEEE Transactions on Networking 2(1): 115. Carnegie Mellon 15 -415 - C. Faloutsos 74

References • Moon, Y. -S. , K. -Y. Whang, et al. (2001). Duality-Based Subsequence

References • Moon, Y. -S. , K. -Y. Whang, et al. (2001). Duality-Based Subsequence Matching in Time-Series Databases. ICDE, Heidelberg, Germany. • Rafiei, D. and A. O. Mendelzon (1997). Similarity-Based Queries for Time Series Data. SIGMOD Conference, Tucson, AZ. Carnegie Mellon 15 -415 - C. Faloutsos 75

References • Schroeder, M. (1991). Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise.

References • Schroeder, M. (1991). Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. New York, W. H. Freeman and Company. • Yi, B. -K. and C. Faloutsos (2000). Fast Time Sequence Indexing for Arbitrary Lp Norms. VLDB, Kairo, Egypt. Carnegie Mellon 15 -415 - C. Faloutsos 76