Finding Aggregates from Streaming Data in Single Pass
- Slides: 17
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS 631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
Overview n n n n The need Type of solutions Choice of solution Problems addressed How does wavelet transform work ? Implementation Results 2
The need … n n Huge data streams encountered at routers, telephone switches, stock exchanges etc. Necessity to analyze this data for trend-related analysis, and fraud detection. Analysis to be done as fast as possible for missioncritical tasks as detecting fraud, security breaches etc. What are the possible ways of analysis ? 3
Solutions … n n Offline processing – Archive whole data in real-time, and analyze it offline. (Slower w. r. t. basic motives of analysis i. e. frauddetection and performance. ) Real-time processing – Analyze the data as it arrives … Ø In Multiple passes – easiest method. . But slower and inefficient w. r. t. load of system. Ø In Single pass – Requires special implementation techniques. . But faster and efficient. 4
Real time processing of Data in Single pass n n Methods used – Wavelet Transform, Sampling techniques, Max. Diff algorithm. Why Wavelet Transform ? – Ø Storing fairly approximate “sketch” of data in smaller space. Ø Answering simple point and range queries with quite good approximation from stored “sketch”. Ø Known to perform better than other techniques and easier for implementation. Note: Comparative analysis of these techniques is outside the scope of this project. 5
Block diagram of implementation technique Data Stream Find the Wavelet transform coeffs query Select the “m” best coefficients 6
Key aspects of implementation. . n n Single pass over data (obviously!!) At any point while processing data only O(N) memory is used where N is the number of data items being considered. Selecting “m” best data coefficients out of N data items. . such that they give minimum error in retrieval of original value of data-key. Storing these “m” coefficients instead of all N data items (m << N). 7
How does Wavelet Transform work. . Wavelet coefficients [5 6 0 2] D: 6 a: 5 D: 0 a: 2 Original data [2 [2 2 0 D: 2 a: 8 7 5 9] 2] 8
S 0 S 1 S 3 S 2 S 4 S 5 S 6 S 7 Tree not stored 9
How queries are answered … n Point queries – Find the number of calls made by telephone number 2422 5074 Find value of key 5. . Value(24225074) = S(0) + ½*S(1) + ½*S(3) ½*S(6) n Range queries – Find number of calls made from exchange 2422. . Answer to this query is the root of tree having numbers from exchange 2422 as leaves. i. e. S 1, S 2 etc. 10
Brief about implementation. . n n n Data input from a file Reading file sequentially to simulate single pass over data-stream, and not accessing previous data of file. Forming the coefficient tree in the form of linked list. Storing “m” best coefficients. A program to calculate point and range queries from coefficient and answer back to user. 11
Few points to note … n n Very basic implementation … cannot handle data fed in any arbitrary format. Assumptions – Ø Ø n Assumes incoming data in key-value pair (e. g. key is telenumber 2422 5074, value is number of calls made from it in last 1 hr. “ 24225074”=>6 Incoming data stream is in ordered-aggregate form. Selection of “m” best coefficients changes according to data-stream types. 12
contd … n n Currently we take highest first “m” coefficient by sorting them … not the best approach. Multi-dimensional data-streams not considered (discussed in research papers referred for implementation). 13
Our results … N = 4096 ************************************* VALUE OF m = N ************************************* RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 1011 IS 85631. 0 RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 3067 IS 7505 THE RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 2015 IS 1480. 0 RESULT OF RANGE QUERY VALUE FROM 1016 TO 1021 IS : 20562. 0 ************************************* VALUE OF m = 50% of N ************************************* RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 1011 IS 85630. 0 THE RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 3067 IS 8296. 0 THE RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 2015 IS 6323. 0 RESULT OF QUERY VALUE FROM 1016 TO 1021 IS : 22416. 0 14
contd … ************************************* VALUE OF m = 25% of N ************************************* THE RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 1011 IS 85630. 0 THE RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 3067 IS 8297. 0 THE RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 2015 IS 8338. 0 RESULT OF QUERY VALUE FROM 1016 TO 1021 IS : 22415. 938 15
References n Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. S. Muthukrishnan, A. C. Gilbert, Y. Kotidis, M. Strauss, 2001. n Wavelet-based histograms for selectivity estimation. J. Vitter, Y. Matias, M. Wang, 1998. 16
Thank you
- Shell and tube heat exchanger in food industry
- Single pass and multi pass heat exchanger
- Mud box in boiler
- Dtlm heat exchanger
- Rocks are aggregates of minerals
- Aggregates size
- Scalar replacement of aggregates
- Contoh pencampuran
- Grading of aggregates ppt
- Sieve analysis of fine aggregate ppt
- Gordon aggregates
- F prime c concrete
- Ibas aggregates
- Example of sisd
- Single instruction single data
- Apa itu data stream
- Linux streaming telemetry
- Explain multi pass assembler in detail