Data science for Materials Science Engineering Making data

  • Slides: 15
Download presentation
Data science for Materials Science & Engineering Making data accessible, discoverable and useful FACE

Data science for Materials Science & Engineering Making data accessible, discoverable and useful FACE CAMERA In this module • Introduction to FAIR data principles • Use of open resources to store and share data & models • HW assignment: explore available data Juan C. Verduzco and Alejandro Strachan jverduzc@purdue. edu || strachan@purdue. edu School of Materials Engineering & Network for Computational Nanotechnology Purdue University West Lafayette, Indiana USA Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 1

Learning objectives and prerequisites FACE CAMERA After completing this lecture you will: • •

Learning objectives and prerequisites FACE CAMERA After completing this lecture you will: • • Be aware of and able to adopt FAIR principles in your own work Know how to add metadata to your results Be able to contribute your own results to data repositories Be able to document and share models and scientific workflows Pre-requisites: • None Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 2

Data Science & Machine Learning in Science & Engineering Acquiring and handling data FACE

Data Science & Machine Learning in Science & Engineering Acquiring and handling data FACE CAMERA Learning from data Predictive models (supervised learning) Cyber-infrastructure Finding patterns (unsupervised learning) Design of experiments Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 3

Motivation FACE CAMERA • Running experiments and simulations is costly and time-consuming • Making

Motivation FACE CAMERA • Running experiments and simulations is costly and time-consuming • Making all data generated findable and accessible accelerates innovation • Sharing data, including provenance and metadata, helps with reusability, reproducibility and transparency • Benefits of sharing data: – As a researcher you get credit for your work – Results are not trapped in static PDF files – Your colleagues can learn from your data and advance their fields – Data science tools can be used to learn from available data Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 4

FAIR Data Principles FACE CAMERA Findable Data has unique identifier and is indexed in

FAIR Data Principles FACE CAMERA Findable Data has unique identifier and is indexed in a searchable resource Interoperable Data uses a formal, accessible and broadly applicable language for knowledge representation. Reusable Accessible An access protocol that is open, free and universally implementable allows for retrieval of the data Wilkinson, M. D. , Dumontier, M. , Aalbersberg, I. J. , Appleton, G. , Axton, M. , Baak, A. , . . . & Bouwman, J. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3(1), 1 -9. Data is richly described with metadata for provenance and to guarantee domain-relevant community standards Pundir, S. (2016) "Fair data principles" [Graphic]. Wikimedia Commons. https: //commons. wikimedia. org/wiki/ File: FAIR_data_principles. jpg. Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 5

FAIR Data Principles FACE CAMERA Wilkinson, M. D. , Dumontier, M. , Aalbersberg, I.

FAIR Data Principles FACE CAMERA Wilkinson, M. D. , Dumontier, M. , Aalbersberg, I. J. , Appleton, G. , Axton, M. , Baak, A. , . . . & Bouwman, J. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3(1), 1 -9. Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 6

Metadata FACE CAMERA Metadata aims to establish provenance for your data, by listing details

Metadata FACE CAMERA Metadata aims to establish provenance for your data, by listing details on: - Why and how was the data produced (experimental or simulation details) - How was the data processed and curated (Data handling) - Who, when and where generated this data (Authorship) - Sources and references Data from the Materials Project, accessed through Citrination: A. Jain*, S. P. Ong*, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, K. A. Persson The Materials Project: A materials genome approach to accelerating materials innovation APL Materials, 2013, 1(1), 011002. doi: 10. 1063/1. 4812323 Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 7

Sharing Data FACE CAMERA • Contributing data to the community is everyone’s responsibility •

Sharing Data FACE CAMERA • Contributing data to the community is everyone’s responsibility • NIST, following the directives from the Materials Genome Initiative (MGI) has published a registry of data repositories https: //materials. registry. nist. gov/ Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 8

Example: sharing your data in Citrination is an AI platform and repository for materials

Example: sharing your data in Citrination is an AI platform and repository for materials discovery. FACE CAMERA • How to prepare data: – Step 1: Selecting a template – Step 2: Adding metadata following the template – Step 3: Adding keywords • For an Excel or. CSV file: – Template (http: //help. citrination. com/knowledgebase/articles /1188136 -citrine-template-csv) Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 9

Citrination Once you have your data with keywords from the template FACE CAMERA https:

Citrination Once you have your data with keywords from the template FACE CAMERA https: //citrination. com/datasets/184812/ • How to contribute data: – Step 1: Log into Citrination (https: //citrination. com/) – Step 2: Go to Add Data – Step 3: Upload your file – Step 4: Select an ingester (matches with the templates) – Step 5: DONE Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 10

Sharing computational and data workflows • Computational and data research involves complex workflows FACE

Sharing computational and data workflows • Computational and data research involves complex workflows FACE CAMERA – Acquiring input files, simulation setup, running multiple simulations, postprocessing – Acquiring data, filtering, creating descriptors, training models, analyzing results • Research workflows are often not published – Reproducing published results often takes months, even for experts – Slowing down progress • Jupyter notebooks – Combine rich text, powerful visualization, and life code – powerful tool to document and share workflows • Share via git-like repositories • Publish your notebook in nano. HUB – Anyone can run it, modify it from any standard web-browser Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 11

Workflow: calculating Tg of a polymer FACE CAMERA 1. Use Polymer Modeler to create

Workflow: calculating Tg of a polymer FACE CAMERA 1. Use Polymer Modeler to create amorphous polymer system 2. Visualize structure Publish your workflow with a few clicks nano. HUB containerize it for reproducibility 3. MD simulations using LAMMPS on HPC resources 4. Post process results and plot Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 12

Workflow: active learning for materials discovery 1. Query materials data from online repositories FACE

Workflow: active learning for materials discovery 1. Query materials data from online repositories FACE CAMERA 2. Create state-of-the-art Machine Learning models 3. Use active learning to inform which experiments to run next https: //nanohub. org/tools/citrinetools https: //nanohub. org/tools/htoxideprop 4. Plot and visualize your results Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 13

Jupyter workflows in nano. HUB FACE CAMERA • Use the Jupyter tool in nano.

Jupyter workflows in nano. HUB FACE CAMERA • Use the Jupyter tool in nano. HUB to develop your workflow • https: //nanohub. org/tools/jupyter • Lots of pre-installed packages for physics-based & machine learning modeling • Learn more about using Jupyter and developing models • https: //nanohub. org/whypublish • https: //nanohub. org/resources/34611 • Sim. Tools for physics-based simulators • Declared and validated inputs and outputs (FAIR principles) • https: //nanohub. org/tools/introtosimtools/ Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 14

Summary FACE CAMERA • Using the FAIR Data Principles can benefit the research community

Summary FACE CAMERA • Using the FAIR Data Principles can benefit the research community significantly • Sharing data and workflows helps transparency and reproducibility both in experiments and simulations • Published articles have shown that sharing data accelerates innovation (Jain et al, Cubuk et al, Ling et al) • Data availability can greatly improve techniques based on data science Jain, A. , Hautier, G. , Moore, C. J. , Ong, S. P. , Fischer, C. C. , Mueller, T. , . . . & Ceder, G. (2011). A high-throughput infrastructure for density functional theory calculations. Computational Materials Science, 50(8), 2295 -2310. Cubuk, E. D. , Sendek, A. D. , & Reed, E. J. (2019). Screening billions of candidates for solid lithium-ion conductors: A transfer learning approach for small data. The Journal of chemical physics, 150(21), 214701. Ling, J. , Hutchinson, M. , Antono, E. , Paradiso, S. , & Meredig, B. (2017). High-dimensional materials and process optimization using data-driven experimental design with well-calibrated uncertainty estimates. Integrating Materials and Manufacturing Innovation, 6(3), 207 -217. Juan C. Verduzco & Ale Strachan - Data Science for Materials Science & Engineering 15