Volume Variety and Veracity of Big Data Analytics
Volume, Variety and Veracity of Big Data Analytics in NASA’s Giovanni Tool 12/6/13 C Lynnes, M. Hegde, C Smit, J Pan, K Bryant, C Chidambaram, P. Zhao 1 GES – DISC Goddard Earth Sciences Data Information Services Center
The Language of “Big Data” • • Volume Velocity Variety Veracity 12/6/13 2 GES – DISC Goddard Earth Sciences Data Information Services Center
The Language of “Big Data” • • Volume Velocity (mostly an onboard challenge) Variety Veracity 12/6/13 3 GES – DISC Goddard Earth Sciences Data Information Services Center
Working with Remote Sensing Data Exploratory Data Analysis Find Download Learn format WRITE CODE Read Subset Quality Filter Summarize / Analyze Visualize Select Data Main Analyze Analysis 12/6/13 Derive Conclusions Phase Publish 4 GES – DISC Goddard Earth Sciences Data Information Services Center
Why So Difficult? Exploratory Data Analysis Find 2000+ datasets Variety Download Learn format WRITE CODE Read Subset Quality Filter Summarize / Analyze Visualize Select Data Main Analyze Analysis 12/6/13 Derive Conclusions Phase Publish 5 GES – DISC Goddard Earth Sciences Data Information Services Center
Why So Difficult? Exploratory Data Analysis Find Volume FTP Data Products up to 1 GB Download Learn format WRITE CODE Read Subset Quality Filter Summarize / Analyze Visualize Select Data Main Analyze Analysis 12/6/13 Derive Conclusions Phase Publish 6 GES – DISC Goddard Earth Sciences Data Information Services Center
Why So Difficult? Exploratory Data Analysis Find Download Learn format HDF 4, HDF 5, HDF-EOS 2, HDF-EOS 5, net. CDF, GRIB, binary Variety WRITE CODE Read Subset Quality Filter Summarize / Analyze Visualize Select Data Main Analyze Analysis 12/6/13 Derive Conclusions Phase Publish 7 GES – DISC Goddard Earth Sciences Data Information Services Center
Why So Difficult? Exploratory Data Analysis Find Download Learn format WRITE CODE Read Subset Files up to 800+ variables Variety Quality Filter Summarize / Analyze Visualize Select Data Main Analyze Analysis 12/6/13 Derive Conclusions Phase Publish 8 GES – DISC Goddard Earth Sciences Data Information Services Center
Why So Difficult? Exploratory Data Analysis Find Download Learn format WRITE CODE Read Subset Quality Filter More studying, coding Veracity Summarize / Analyze Visualize Select Data Main Analyze Analysis 12/6/13 Derive Conclusions Phase Publish 9 GES – DISC Goddard Earth Sciences Data Information Services Center
Why So Difficult? Exploratory Data Analysis Find Download Learn format WRITE CODE Read Subset Quality Filter Summarize / Analyze Visualize Volume Summarize data over long time periods, large areas Select Data Main Analyze Analysis 12/6/13 Derive Conclusions Phase Publish 10 GES – DISC Goddard Earth Sciences Data Information Services Center
Why So Difficult? Exploratory Data Analysis Find Download Learn format WRITE CODE Read Subset Quality Filter Summarize / Analyze Visualize Locate, read, use geolocation info Variety Select Data Main Analyze Analysis 12/6/13 Derive Conclusions Phase Publish 11 GES – DISC Goddard Earth Sciences Data Information Services Center
Why So Difficult? Variety 2000+ datasets Exploratory Data Analysis Volume FTP Data Products up to 1 GB Find Download Learn format HDF 4, HDF 5, HDF-EOS 2, HDF-EOS 5, net. CDF, GRIB, binary Variety WRITE CODE Read Subset Quality Filter Files up to 800+ variables More studying, coding Variety Veracity Summarize / Analyze Visualize Locate, read, use geolocation info Variety Select Data Main Analyze Analysis 12/6/13 Derive Conclusions Phase Publish 12 GES – DISC Goddard Earth Sciences Data Information Services Center
Giovanni provides (relatively) rapid exploration of datasets Exploratory Data Analysis Web-based Services Find Download Learn format Read Subset Quality Filter Summarize / Analyze Visualize Giovanni WRITE CODE Giovanni provides Server-Side, Quick-Start Exploratory Data Analysis: Find Fetch Reformat Filter Regrid Summarize no coding necessary no downloads necessary Visualize Select Data Main Analyze Analysis 12/6/13 Derive Conclusions Phase Publish 13 GES – DISC Goddard Earth Sciences Data Information Services Center
Giovanni User Interface 12/6/13 14 GES – DISC Goddard Earth Sciences Data Information Services Center
Example: Exploring in Time and Space Time Averaged Map 20 -22 May 12/6/13 15 GES – DISC Goddard Earth Sciences Data Information Services Center
Tackling Volume Longitude Latitude 2 -D Color Slice ✔� ✔� ✔� �� [i] �� �� ✔� [i] ✔� 1 -D Line Plots Vertical �� �� �� ✔� �� �� [i] ✔� ✔� ✔� [i] Time �� �� �� ✔� ✔� ✔� �� �� �� ✔� Plot Time-averaged map Longitude Cross-Section Latitude Cross-Section Hovmoller (Longitude) Hovmoller (Latitude) Vertical-Time Cross-Section Meridional Mean Zonal Mean Vertical Profile Area-averaged Time Series 12/6/13 ✔� Plot Axis �� [i] Aggregation (Summation, average, . . . ) Select Layer or Level 16 GES – DISC Goddard Earth Sciences Data Information Services Center
Tackling Variety • • Data Formats Internal Metadata Data Locations Catalogs 12/6/13 17 GES – DISC Goddard Earth Sciences Data Information Services Center
Giovanni-4 architecture leverages standards to tackle Variety in Data Sources Web User Interface YUI REST-ish URL image / data service manager info about variables query Variable Database info about variables Open. Search Data Catalog search fetch subset prep-> CF+ prep -> CF+ net. CDF/CFvia OPe. NDAP data regrid 12/6/13 run Workflow Engine 18 GES – DISC Goddard Earth Sciences Data Information Services Center
Variety: Shades of Gray Aerosol Optical Depth Total AOD MODIS Deep Blue Aqua Terra OMI Dark Target Aqua 388 nm Terra Land 500 nm MISR 555 nm Sea. Wi. FS Deep Blue 0. 5 degree Black Carbon 388 nm Sea Salt 500 nm Dust 1. 0 degree Ocean 470 nm 550 nm 555 nm 660 nm 659 nm 865 nm 12/6/13 Component AOD Absorption AOD 1240 nm 1640 nm 2130 nm Sulfate AOD Variants • Component (or Total) • Absorption / Extinction • Instrument • Satellite • Algorithm • Wavelength(s) Used • Spatial Resolution • Land/Ocean Particulate Organic Matter 19 GES – DISC Goddard Earth Sciences Data Information Services Center
Giovanni offers data comparison services to compare similar datasets Latitude Longitude Z Time Plot Type ✔ ✔ [i] Σ Correlation Map ✔ ✔ [i] Σ Difference Map Σ Σ [i] ✔ Difference Time Series -- -- [i] -- Scatterplot ✔ ✔ [i] -- Interactive Scatterplot+Map ✔ ✔ [i] Σ Time-averaged Scatterplot+Map 12/6/13 20 GES – DISC Goddard Earth Sciences Data Information Services Center
Comparing Datasets with Giovanni Correlation Map: TRMM 3 B 42 V 6 vs. V 7, 2010 -2011 12/6/13 21 GES – DISC Goddard Earth Sciences Data Information Services Center
Veracity is the Achilles’ Heel of Big Data Failures in credibility can plunge it (deep) into the Trough of Disillusionment* 2013 Peak of Inflated Expectations (You are here) expectations 2012 of Slope ent tenm h g i l n E 2011 12/6/13 Technology Trigger *Gartner Hype Cycle Plateau of Productivity Trough of Disillusionment time 22 GES – DISC Goddard Earth Sciences Data Information Services Center
But ensuring veracity is hard. . . • Quality information? – Not always available or usable – N. B. : More than just accuracy/uncertainty. . . –. . . Selection bias pops up often • Consistency checks? – Good, but are only relative. . . 12/6/13 23 GES – DISC Goddard Earth Sciences Data Information Services Center
Correlation Map: TRMM Rainfall Versions 6 vs. 7 2010 -2011 12/6/13 24 GES – DISC Goddard Earth Sciences Data Information Services Center
Correlation Map: TRMM Rainfall Versions 6 vs. 7 1998 -1999 12/6/13 Additional relative uncertainty metrics coming soon. . . See Z. Liu et al. , H 42 -F 08, Thurs. 12: 05 -12: 20 @ MW 3022 25 GES – DISC Goddard Earth Sciences Data Information Services Center
Actionable Provenance: Reproducible (REST-ish) URLs http: //giovanni. gsfc. nasa. gov/giovanni/# service=INTERACTIVE_MAP& starttime=2001 -01 -01 T 00: 00 Z& endtime=2001 -01 -31 T 23: 59 Z& bbox=-124. 4531, 24. 2578, -68. 2031, 52. 3828& data=MOD 08_D 3_051_Optical_Depth_Land_And_Ocean_Mean& variable. Facets=data. Field. Discipline%3 AAerosols%3 B 12/6/13 26 GES – DISC Goddard Earth Sciences Data Information Services Center
Can exploratory data analysis conquer the 3 V’s ACROSS our data systems? • Distributed data access AND usage – Format standards++ – Catalog query standards – Access at data variable level • Actionable Provenance – Reproducible (REST-ish? ) execution 12/6/13 27 GES – DISC Goddard Earth Sciences Data Information Services Center
- Slides: 27