Data Mining Concepts and Techniques Mining data streams

  • Slides: 17
Download presentation
Data Mining: Concepts and Techniques Mining data streams. 29 November 2020 Data Mining: Concepts

Data Mining: Concepts and Techniques Mining data streams. 29 November 2020 Data Mining: Concepts and Techniques 1

Chapter 8. Mining Stream, Time. Series, and Sequence Data Mining data streams Mining time-series

Chapter 8. Mining Stream, Time. Series, and Sequence Data Mining data streams Mining time-series data Mining sequence patterns in transactional databases Mining sequence patterns in biological data 29 November 2020 Data Mining: Concepts and Techniques 2

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream data management systems: Issues and solutions n Stream data cube and multidimensional OLAP analysis n Stream frequent pattern analysis n Stream classification n Stream cluster analysis n Research issues 29 November 2020 Data Mining: Concepts and Techniques 3

Characteristics of Data Streams n n Data Streams n Data streams—continuous, ordered, changing, fast,

Characteristics of Data Streams n n Data Streams n Data streams—continuous, ordered, changing, fast, huge amount n Traditional DBMS—data stored in finite, persistent data sets Characteristics n Huge volumes of continuous data, possibly infinite n Fast changing and requires fast, real-time response n Data stream captures nicely our data processing needs of today n n n Random access is expensive—single scan algorithm (can only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing 29 November 2020 Data Mining: Concepts and Techniques 4

Stream Data Applications n Telecommunication calling records n Business: credit card transaction flows n

Stream Data Applications n Telecommunication calling records n Business: credit card transaction flows n Network monitoring and traffic engineering n Financial market: stock exchange n Engineering & industrial processes: power supply & manufacturing n Sensor, monitoring & surveillance: video streams, RFIDs n Security monitoring n Web logs and Web page click streams n Massive data sets (even saved but random access is too expensive) 29 November 2020 Data Mining: Concepts and Techniques 5

DBMS versus DSMS n Persistent relations n Transient streams n One-time queries n Continuous

DBMS versus DSMS n Persistent relations n Transient streams n One-time queries n Continuous queries n Random access n Sequential access n “Unbounded” disk store n Bounded main memory n Only current state matters n Historical data is important n No real-time services n Real-time requirements n Relatively low update rate n Possibly multi-GB arrival rate n Data at any granularity n Data at fine granularity n Assume precise data n Data stale/imprecise n Access plan determined by query processor, physical DB design 29 November 2020 n Unpredictable/variable data arrival and characteristics Ack. From Motwani’s PODS tutorial slides Data Mining: Concepts and Techniques 6

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream

Mining Data Streams n What is stream data? Why Stream Data Systems? n Stream data management systems: Issues and solutions n Stream data cube and multidimensional OLAP analysis n Stream frequent pattern analysis n Stream classification n Stream cluster analysis n Research issues 29 November 2020 Data Mining: Concepts and Techniques 7

Architecture: Stream Query Processing SDMS (Stream Data Management System) User/Application Continuous Query Results Multiple

Architecture: Stream Query Processing SDMS (Stream Data Management System) User/Application Continuous Query Results Multiple streams Stream Query Processor Scratch Space (Main memory and/or Disk) 29 November 2020 Data Mining: Concepts and Techniques 8

Challenges of Stream Data Processing n Multiple, continuous, rapid, time-varying, ordered streams n Main

Challenges of Stream Data Processing n Multiple, continuous, rapid, time-varying, ordered streams n Main memory computations n Queries are often continuous n n n Evaluated continuously as stream data arrives n Answer updated over time Queries are often complex n Beyond element-at-a-time processing n Beyond stream-at-a-time processing n Beyond relational queries (scientific, data mining, OLAP) Multi-level/multi-dimensional processing and data mining n Most stream data are at low-level or multi-dimensional in nature 29 November 2020 Data Mining: Concepts and Techniques 9

Processing Stream Queries n Query types n n n Predefined query vs. ad-hoc query

Processing Stream Queries n Query types n n n Predefined query vs. ad-hoc query (issued on-line) Unbounded memory requirements n n One-time query vs. continuous query (being evaluated continuously as stream continues to arrive) For real-time response, main memory algorithm should be used Approximate query answering n With bounded memory, it is not always possible to produce exact answers n High-quality approximate answers are desired n Data reduction and synopsis construction methods n 29 November 2020 Sketches, random sampling, histograms, wavelets, etc. Data Mining: Concepts and Techniques 10

Methodologies for Stream Data Processing n n n Major challenges n Keep track of

Methodologies for Stream Data Processing n n n Major challenges n Keep track of a large universe, e. g. , pairs of IP address, not ages Methodology n Synopses (trade-off between accuracy and storage) k n Use synopsis data structure, much smaller (O(log N) space) than their base data set (O(N) space) n Compute an approximate answer within a small error range (factor ε of the actual answer) Major methods n Random sampling n Histograms n Sliding windows n Multi-resolution model n Sketches n Radomized algorithms 29 November 2020 Data Mining: Concepts and Techniques 11

Stream Data Processing Methods (1) n Random sampling (but without knowing the total length

Stream Data Processing Methods (1) n Random sampling (but without knowing the total length in advance) n Reservoir sampling: maintain a set of s candidates in the reservoir, which form a true random sample of the element seen so far in the stream. As the data stream flow, every new element has a certain probability (s/N) of replacing an old element in the reservoir. n n Sliding windows n Make decisions based only on recent data of sliding window size w n An element arriving at time t expires at time t + w Histograms n Approximate the frequency distribution of element values in a stream n Partition data into a set of contiguous buckets n n Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket) Multi-resolution models n Popular models: balanced binary trees, micro-clusters, and wavelets 29 November 2020 Data Mining: Concepts and Techniques 12

Stream Data Processing Methods (2) n Sketches n n Histograms and wavelets require multi-passes

Stream Data Processing Methods (2) n Sketches n n Histograms and wavelets require multi-passes over the data but sketches can operate in a single pass Frequency moments of a stream A = {a 1, …, a. N}, Fk: where v: the universe or domain size, mi: the frequency of i in the sequence n n Given N elts and v values, sketches can approximate F 0, F 1, F 2 in O(log v + log N) space Randomized algorithms n n Monte Carlo algorithm: bound on running time but may not return correct result Chebyshev’s inequality: n n Let X be a random variable with mean μ and standard deviation σ Chernoff bound: 29 November 2020 n Let X be the sum of independent Poisson trials X 1, …, Xn, δ in (0, 1] n The probability decreases expoentially as we move from the mean Data Mining: Concepts and Techniques 13

Mining Time-Series Data n Time-series database n Consists of sequences of values or events

Mining Time-Series Data n Time-series database n Consists of sequences of values or events changing with time n Data is recorded at regular intervals n Characteristic time-series components n n Trend, cycle, seasonal, irregular Applications n Financial: stock price, inflation n Industry: power consumption n Scientific: experiment results n Meteorological: precipitation 29 November 2020 Data Mining: Concepts and Techniques 14

Categories of Time-Series Movements n n Categories of Time-Series Movements n Long-term or trend

Categories of Time-Series Movements n n Categories of Time-Series Movements n Long-term or trend movements (trend curve): general direction in which a time series is moving over a long interval of time n Cyclic movements or cycle variations: long term oscillations about a trend line or curve n e. g. , business cycles, may or may not be periodic n Seasonal movements or seasonal variations n i. e, almost identical patterns that a time series appears to follow during corresponding months of successive years. n Irregular or random movements Time series analysis: decomposition of a time series into these four basic movements n Additive Modal: TS = T + C + S + I n Multiplicative Modal: TS = T C S I 29 November 2020 Data Mining: Concepts and Techniques 15

Estimation of Trend Curve n n The freehand method n Fit the curve by

Estimation of Trend Curve n n The freehand method n Fit the curve by looking at the graph n Costly and barely reliable for large-scaled data mining The least-square method n Find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points n The moving-average method 29 November 2020 Data Mining: Concepts and Techniques 16

Moving Average n Moving average of order n n Smoothes the data n Eliminates

Moving Average n Moving average of order n n Smoothes the data n Eliminates cyclic, seasonal and irregular movements n Loses the data at the beginning or end of a series n Sensitive to outliers (can be reduced by weighted moving average) 29 November 2020 Data Mining: Concepts and Techniques 17