Searching Technology For a Large Number Of Objects
Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory SDM All-hands, October 2005
Outline • Current work — Fast. Bit: a compressed bitmap indexing package — Applications: • • Grid Collector DEX TBitmap. Index Network Flow Data Analysis • Future Plans — Extending the searching technology — Integrating with other SDM center technologies SDM All-hands, October 2005 2
Fast. Bit A compressed bitmap indexing technology for efficient searching of read-only data John Wu, Ekow Otoo, Arie Shoshani Kurt Stockinger, Doron Rotem http: //sdm. lbl. gov/fastbit SDM All-hands, October 2005
Fast. Bit Overview • Fast. Bit is designed to search multidimensional data — Conceptually in table format row • rows objects • columns attributes • Fast. Bit uses vertical (column-oriented) organization for the data — Efficient for analysis of read-only data • Fast. Bit uses compressed bitmap indices to speed up searches SDM All-hands, October 2005 column — Proven in analysis to be optimal for singleattribute queries — Superior to other optimal indices because they are also efficient for multi-attribute queries 4
Grid Collector Put Fast. Bit and SRM together to improve the efficiency of STAR analysis jobs John Wu, Junmin Gu, Jerome Lauret, Arthur M. Poskanzer, Arie Shoshani, Alexander Sim, Wei-Ming Zhang http: //www. star. bnl. gov/ SDM All-hands, October 2005
Grid Collector Features Key features of the Grid Collector: — Providing transparent object access — Selecting objects based on their attribute values — Improving analysis system’s throughput — Enabling interactive distributed data analysis SDM All-hands, October 2005 6
Grid Collector Speeds up Analyses more selective less selective • Legend — Selectivity: fraction of events needed by the analysis — Speedup = ratio of time to read events without GC and with GC — Speedup = 1: speed of the existing system (without GC) • Results — When searching for rare events, say, selecting one event out of 1000 (selectiv — Even using GC to read 1/2 of events, speedup > 1. 5 SDM All-hands, October 2005 7
DEX: Using Efficient Bitmap Indices to Accelerate Scientific Visualization Kurt Stockinger, John Shalf, Wes Bethel, John Wu Computational Research Division Lawrence Berkeley National Laboratory Berkeley, California SDM All-hands, October 2005
DEX: Dexterous Data Explorer Query Data Visualization Toolkit (VTK) 3 D visualization of a Supernova explosion SDM All-hands, October 2005 9
Performance Results with Scientific Data VTK rendering time: 0. 2 – 2 seconds. One of the simplest tasks DEX performs is to find isosurface DEX is on average a factor of three to four faster than the best isosurface algorithm of VTK. SDM All-hands, October 2005 10
Query-Driven Visualization of Combustion Data Set a) Query: CH 4 > 0. 3 b) Q: temp < 3 c) Q: CH 4 > 0. 3 AND temp < 3 d) Q: CH 4 > 0. 3 AND temp < 4 SDM All-hands, October 2005 11
TBitmap. Index: An attempt to introduce Fast. Bit to ROOT Kurt Stockinger 1, John Wu 1, Rene Brun 2, Philippe Canal 3 (1) Berkeley Lab, Berkeley, USA (2) CERN, Geneva, Switzerland (3) Fermi Lab, Batavia, USA SDM All-hands, October 2005
Current Status • Built a prototype wrapper on Fast. Bit called TBitmap. Index — Read one variable at a time into memory to build index — Each Index is currently stored in a binary file • Integrated bitmap indices to support: — TTree: : Draw — TTree: : Chain • Verified the performance advantage of Fast. Bit vs. ROOT’s TTree. Formula SDM All-hands, October 2005 13
Experiments With Ba. Bar Data • Software/Hardware: — Bitmap Index Software is implemented in C++ — Tests carried out on: • Linux Cent. OS • 2. 8 GHz Intel Pentium 4 with 1 GB RAM • Hardware RAID with SCSI disk • Data: — 7. 6 million records with ~100 attributes each — Babar data set: • Bitmap Indices (Fast. Bit): — 10 out of ~100 attributes — 1000 equality-encoded bins — 100 range-encoded bins SDM All-hands, October 2005 14
Size of Compressed Bitmap Indices EE-BMI: equality-encoded bitmap index RE-BMI: range-encoded bitmap index SDM All-hands, October 2005 15
Query Performance TTree. Formula vs. Bitmap Indices Bitmap indices 10 X faster than TTree. Formula SDM All-hands, October 2005 16
An Application of TBitmap. Index -- Network Flow Data Analysis Kurt Stockinger, John Wu, Scott Campbell, Stephen Lau, Mike Fisk, Eugene Gavrilov, Alex Kent, Christopher E. Davis, Rick Olinger, Rob Young, Jim Prewett, Paul Weber, Thomas P. Caudell, E. Wes Bethel, Steve Smith LBNL, LANL, UNM SDM All-hands, October 2005
Chasing the Track of a Network Scan • IDS log shows — Jul 28 17: 19: 56 Address. Scan 221. 207. 14. 164 has scanned 19 hosts (62320/tcp) — Jul 28 19: 56 Address. Scan 221. 207. 14. 88 has scanned 19 hosts (62320/tcp) • Using Fast. Bit/ROOT to explore what else might be going on • Queries prepared by Scott Campbell. More details at http: //www. nersc. gov/~scottc/papers/ROOT/rootuse. prod. html SDM All-hands, October 2005 18
Are There More Scans? • Query: select ts/(60*60*24)-12843, IPR_C, IPR_D where IPS_A=211 an • More scans from the same subnet SDM All-hands, October 2005 19
Who Is Doing It? • Query: select IPS_C, IPS_D where IPS_A==211 and IPS_B==207 • Picture: the histogram of the IPS_C and IPS_D • Five IP addresses started most of the scans! SDM All-hands, October 2005 20
Future Plans Meet the challenges of searching in data intensive sciences SDM All-hands, October 2005
Types of Searching Problems • Not practical to work on many terabytes of data simultaneously work on a subset instead — Analyze the data collected last month — Analyze the data collected by Joe • Find the objects of interest — Find the flame front in combustion simulation — Find the top-talker in network communication • Knowledge discovery — Association rules — Cliques/connection subgraphs SDM All-hands, October 2005 22
Searching Problems From Sci. DAC 2 Appendix B. 1 Experimental Combustion Science Feature identification and tracking 20 TB B. 8 Empowering RHIC users with new analysis tools Analyze subsets ~GB/s B. 10 U. S. LHC Experiments Analyze subsets ~GB/s B. 13 The Solenoid Tracker at RHIC Analyze subsets (STAR) 1 GB/s B. 2 Advanced Computing for LCLS ? , classification 200 MB/s B. 3 An Earth Science Knowledge System Locating dataset of interest PB B. 5 Enabling Discovery in Experimental Biological Science High-dimensional data search, data versioning, semantic graphs (ontology), multiple sources SDM All-hands, October 2005 23
Searching Problems From Sci. DAC 2 Appendix B. 4 Remote operations of LHC, CMS and ITER Streaming data B. 9 ARM/ACRF Program Instrument data streams B. 6 Enhancing Material Science Beamline ND data array, real-time processing B. 7 Large-Scale Computation for ITER Data management B. 11 Nanoscience Mining simulation data together with experimental data B. 12 The Spallation Neutron Source Real-time image analysis, data comparison SDM All-hands, October 2005 1 GB/h ? 20 MB/s 24
Features of These Search Problems • Large: many datasets are petabytes in size, billions records • Complex data: multi-dimensional arrays, user-defined data types, mixed simulation data with experimental data, regular data with attribute defined with ontologies (semantic networks) • Complex searching: data versioning, provenance-based search, catalog matching • Beyond searching: data mining and knowledge discovery • Real-time response: instrument control, interactive designed of experiments, computational steering • Integrated: searching is only a part of the overall data analysis, need to improve the overall throughput SDM All-hands, October 2005 25
Improve Existing Searching Tools • Fast. Bit is efficient for range queries; need to support other types of queries, e. g. , joins • Fast. Bit is efficient for read-only data; need to support update • Fast. Bit supports up to 232 (4 billion) records; need to support at least 264 (16 quintillion) records • Fast. Bit allows the user to choose from many different type of indices; need to automatically decide one for the user SDM All-hands, October 2005 26
Expand The Repertoire Of Searching Tools • Support parallel index building and searching • Support search of semantic networks, combining ontology with structured data • Support data versioning (time stamps, provenance, …) • Support robust recovery (a la POSTGRES) • Support user-defined data types (ROOT) • Support user-defined functions • Support commonly used B-trees and R-trees • Support combined searching of structured and semistructured data, extend SDM All-hands, October 2005 27
Extend The Accessibility Of The Tools • Extend the collaboration with ROOT to make Fast. Bit seamlessly available to users — Implemented a prototype, need a more integrated way to read and write ROOT files • Read data from other common file formats; write indices to the same file formats — net. CDF, HDF (4/5) • Extend the advantage of searching to other steps of analysis — Feature tracking; extending it to higher dimension; more general image analysis • Make Fast. Bit available in other forms — Web service, an actor in Kepler, … SDM All-hands, October 2005 28
Summary • Fast. Bit is efficient for range queries on read-only data • Integration of Fast. Bit with ROOT is getting underway — TBitmap. Index prototype • Integration with other systems possible — Need to develop a short list based on target application area • Plan to extend Fast. Bit — Integration with ROOT will bring up a list of requirements — Intend to target biological applications SDM All-hands, October 2005 29
- Slides: 29