DIGITAL HUMANITIES XSEDE PROJECTS CLOWDER Michael Simeone Director




























- Slides: 28
DIGITAL HUMANITIES XSEDE PROJECTS + CLOWDER Michael Simeone (Director, Library Data Science and Analytics, ASU) Sandeep Puthanveetil Satheesan (Research Programmer, NCSA, University of Illinois)
Outline • Introduction to Digital Humanities • Clowder: Data + Metadata Management System • XSEDE Projects • Video Analysis Tableau (VAT) • Decomposing Bodies (DEBOD for short) • Image Analysis of Rural Photography (IARP) • Fake News Shelf Life • Q&A
Introduction (Digital Humanities) What is Digital Humanities (DH)? Questions of social, ethical, historical, cultural, import. And more. Challenge: Identifying a computable problem Challenge: Theory of data often different from science and engineering Challenge: Getting data that will behave
Clowder: Data + Metadata Management System • Scalable research data management system for your own cloud • Contains three major extension points for preprocessing, and previewing of data • Designed to support any data format and multiple research domains • Supports both machine-created and user-created metadata
Video Analysis Tableau (Background) • You. Tube Estimates 2016 • 400 hours of video uploaded to You. Tube every minute • Cisco Visual Networking Index Forecast 2014 -2019 • “Globally, consumer internet video traffic will be 80 percent of all consumer Internet traffic in 2019, up from 64 percent in 2014. ” • Cisco Visual Networking Index: Forecast and Trends, 2017– 2022 • “Globally, IP video traffic will be 82 percent of all IP traffic (both business and consumer) by 2022, up from 75 percent in 2017. ” • 130 years of cinema • You. Tube | Vimeo | … • Academic Videos Netflix | Amazon Prime Video | Hulu … Archives Live Internet video (e. g. Facebook)
Video Analysis Tableau (VAT) • Attempt at fusing software, arts, and humanities • Tools aimed at enabling researchers to analyze and get new insights from movie / video data at scale • More details: http: //virginiakuhn. net/vat/
VAT: Data Flow
VAT: Architecture
VAT: Shot Detection (Cinemetrics Extractor)
VAT: Image Based Retrieval of Similar Shots
VAT: Movie Slice Extractor
VAT: Shot Indexing (LSVA Extractor) • Obtains different feature vectors from the key frames • Used in the image based retrieval of similar shots • Feature descriptors extracted • • • Grayscale histogram Color layout DCT (Discrete Cosine Transform) Gabor energies HSV / HSL color histogram Edge histogram • Feature distance scores between all the shots in the database are precomputed for faster retrieval
VAT: XSEDE Gateway Architecture
VAT: Screenshots
VAT: Clowder Contributions/Feature Requests • Main Contributions to Clowder • File sections • Feature-based file comparisons • Processing files as HPC jobs (Contribution to Py. Clowder) • Feature Requests • Intelligent load balancing between small and big requests • Intelligent method to combine multiple files to a single HPC job • Improve image-based retrieval of similar shots
VAT: Team, Publication, Acknowledgements • Team • • • PI: Virginia Kuhn (USC) Co-PIs: Michael Simeone (ASU) and Alan Craig (XSEDE) ECSS Consultants/Developers: Luigi Marini (NCSA), David Bock (NCSA), Mona Wong (SDSC), Ritu Arora (TACC), Sandeep Puthanveetil Satheesan (NCSA), Liana Diesendruck (NCSA - Alumnus) • Publication • Kuhn, Virginia, Alan Craig, Michael Simeone, Sandeep Puthanveetil Satheesan, and Luigi Marini. "The VAT: enhanced video analysis. " In Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, p. 11. ACM, 2015. • Acknowledgements • This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI 1053575. We are grateful to XSEDE for providing us the vital resources required for the development of this project.
Image Analysis of Rural Photography (IARP) • Goal: Analyze the Farm Security Administration/Office of War Information (FSA/OWI) Black-and-White Negatives and Color Photographs to better understand the collection from a humanities perspective • Data: • FSA/OWI Black-and-White Negatives and Color Photographs • Extensive pictorial record of American life between 1935 and 1944. This U. S. government photography project was headed for most of its existence by Roy E. Stryker, who guided the effort in a succession of government agencies. • 176, 000+ photographs in total
IARP: Key Processing Steps • Metadata extraction and indexing • • • Stryker hole punch detection Mean grayscale Face detection OCR NLP - Photo captions Metadata - geolocation, photographer, date, film medium • Advanced search queries • Find all photographs from FSA collection that were not Stryker killed, whose mean grayscale value is greater than 85, and contain eyes
IARP: Screenshots
IARP: Screenshots Unreleased
IARP: Clowder Contributions/Feature Requests • Main contributions to Clowder • Promoted metadata fields in advanced search (unreleased) • Admin page to manage promoted metadata fields (unreleased) • Feature Requests • Thumbnails for advanced search results • Grid view + list view for datasets in a collection • Especially for collections contain large number of datasets • Grid view + list view for advanced • Venn diagram visualizations to study data
IARP: Team, Publication, Acknowledgements • Team: • • • PI: Elizabeth Wuerffel (Valparaiso University) Co-PI: Jeffrey Will (Valparaiso University) ECSS Consultants/Developers: Paul Rodriguez (SDSC), Sandeep Puthanveetil Satheesan (NCSA), Marcus Slavenas (NCSA), Alan Craig (XSEDE) • Publication: • Rodriguez, Paul, Sandeep Puthanveetil Satheesan, Jeffrey Will, Elizabeth Wuerffel, and Alan Craig. "Extracting, Assimilating, and Sharing the Results of Image Analysis on the FSA/OWI Photography Collection. " In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact, p. 42. ACM, 2017. • Acknowledgements: • This material is based upon work supported by the National Science Foundation under grant numbers ACI-1053575 (XSEDE) and ACI-1261582 (Brown Dog), and ACI-1341698 (Comet). PI and Co-PIs thank Sandeep Puthanveetil Satheesan, Paul Rodriguez, Alan Craig, Kenton Mc. Henry, and Marcus Slavenas for their assistance with image analysis, which was made possible through the XSEDE Extended Collaborative Support Service (ECSS) program.
Decomposing Bodies (DEBOD) • Goal: Recognize handwritten text and decimal numbers from scanned Bertillon Cards from Ohio Penitentiaries from the late 19 th and early 20 th centuries to defamiliarize the process of Bertillonage as practiced in the United States. • Data: Scanned Bertillon Cards • Prisoner identification cards • Contains anthropometric measurements and other details • Bertillonage - first known scientific attempt at biometric data collection, storage, indexing, and searching for criminal identification • Developed by Alphonse Bertillon, a French criminologist in 1879
DEBOD: Extractor Key Steps (Front Side) • Preprocessing • RAW to PNG format conversion • Image dewarping • Brightness-contrast correction • Segmentation • Top tabular region extraction • Rotation correction • Column image segmentation • Cell image extraction • Digit image extraction / isolation • Digit classification • Decimal number extraction
DEBOD: Screenshots
DEBOD: Clowder Contributions/Feature Requests • Main contributions to Clowder • Basic extractor for processing Bertillon Cards using template matching and decimal number recognition and text recognition • Feature Requests • Improve the Bertillon cards extractor to increase accuracy of transcribed text and numbers • Explore better methods for extracting text and numbers from Bertillon cards
DEBOD: Team, Publication, Acknowledgements • Team: • PI: Alison Langmead (University of Pittsburgh) • ECSS Consultants/Developers: Alan Craig (XSEDE), Paul Rodriguez (SDSC), Sandeep Puthanveetil Satheesan (NCSA) • Publication: • Langmead, Alison, Paul Rodriguez, Sandeep Puthanveetil Satheesan, and Alan Craig. "Extracting meaningful data from decomposing bodies. " In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact, p. 41. ACM, 2017. • Acknowledgements: • This material is based upon work supported by the National Science Foundation under grant numbers ACI-1053575 (XSEDE) and ACI 1261582 (Brown Dog).
THANK YOU Q&A