Provenance and workflows in Aii DA Sebastiaan P
Provenance and workflows in Aii. DA Sebastiaan P. Huber
Sebastiaan Huber Post-doc @ EPFL and developer of: * aiida-core * aiida-quantumespresso In this talk Aii. DA Virtual Tutorial - July 2020 * How and when does Aii. DA store provenance? * Short introduction to writing workflows * Overview of Aii. DA’s provenance model Sebastiaan P. Huber Introduction
Aii. DA Virtual Tutorial - July 2020 Keeping data provenance is important… . . . but imagine having to manually up link everything Sebastiaan P. Huber Automated provenance
Aii. DA Virtual Tutorial - July 2020 So how exactly is provenance automatically stored? Sebastiaan P. Huber Automated provenance
Imagine the following simple arithmetic problem: Add two numbers and multiply the sum by third Aii. DA Virtual Tutorial - July 2020 Just apply the calcfunction decorator Sebastiaan P. Huber Automated provenance: calculations
Now we just call it like a normal function while passing storable types Aii. DA Virtual Tutorial - July 2020 The provenance is automatically stored in the database Directed Acyclic Graph ▪ ▪ Directed: inputs go in outputs come out Acyclic: causality principle forbids an output being its own input Sebastiaan P. Huber Automated provenance: calculations
Aii. DA Virtual Tutorial - July 2020 But not all code is well-suited as Python code: What about running external codes through Aii. DA? Implementation is different as it requires a plugin, but running very similar. . . Sebastiaan P. Huber Automated provenance: external codes
Generated provenance similar to that of calculation function ▪ Has additional output node containing the Aii. DA Virtual Tutorial - July 2020 retrieved output files ▪ Can be run on remote machines through job scheduler ▪ Implementation independent of job scheduler ▪ To change machine, just change the ‘code’ input ▪ Many can be run in parallel by submitting to the daemon Sebastiaan P. Huber Automated provenance: external codes
Aii. DA Virtual Tutorial - July 2020 Let’s go back to the first add-multiply example Individual sequence of calculations recorded. . . but not the ‘how’ or ‘why’ Sebastiaan P. Huber Automated provenance: workflows
Work functions can be used to store logical provenance Aii. DA Virtual Tutorial - July 2020 We can look at the logical provenance to ‘hide’ the complexity Or ignore it to retrieve the original data provenance Sebastiaan P. Huber Automated provenance: workflows
Work chains achieve the same but save progress in between steps Sebastiaan P. Huber Automated provenance: workflows Aii. DA Virtual Tutorial - July 2020 Only difference with work function solution is node type
Aii. DA Virtual Tutorial - July 2020 Work chains provide many advantages over work functions: ▪ Many can be run in parallel by submitting to the daemon ▪ Progress is saved between steps in checkpoints ▪ Process specification gives succinct but clear summary ▪ Captures the scientific knowledge and can be re-run ▪ Can be re-used as building block in more complex workflows Sebastiaan P. Huber Automated provenance: workflows
Aii. DA Virtual Tutorial - July 2020 Workflows cannot create new data. Doing so anyway, will cause loss of provenance Take first example: add two numbers and multiply with third. . . but now compute the product inside the work function Sebastiaan P. Huber Automated provenance: losing provenance
Sebastiaan P. Huber Why a difference between calculations and workflows Since workflows can return, they can also return their inputs Aii. DA Virtual Tutorial - July 2020 This breaks the acyclity and therefore the DAG Two clearly distinct types of processes CALCULATION S Can create new data WORKFLOWS Can call other processes Can return existing data
Aii. DA Virtual Tutorial - July 2020 Sebastiaan P. Huber Provenance design: an overview
Aii. DA Virtual Tutorial - July 2020 Sebastiaan P. Huber Provenance design: an overview
Aii. DA Virtual Tutorial - July 2020 The engine runs processes and stores a corresponding node in the database Sebastiaan P. Huber Processes and nodes
Four processes launchers with identical interface run_get_node run_get_pk submit Run blockingly and return result + node Run blockingly and return result + pk Submit to daemon and return node Aii. DA Virtual Tutorial - July 2020 Running will block the interpreter, submitting will not as the daemon will take care of it Sebastiaan P. Huber Launching processes
Four processes launchers with identical interface run_get_node run_get_pk submit Run blockingly and return result + node Run blockingly and return result + pk Submit to daemon and return node Aii. DA Virtual Tutorial - July 2020 Run blockingly in the current interpreter Sebastiaan P. Huber Launching processes
Four processes launchers with identical interface run_get_node run_get_pk submit Run blockingly and return result + node Run blockingly and return result + pk Submit to daemon and return node Aii. DA Virtual Tutorial - July 2020 Run variants to get the process node or pk in addition to the result Sebastiaan P. Huber Launching processes
Four processes launchers with identical interface run_get_node run_get_pk submit Run blockingly and return result + node Run blockingly and return result + pk Submit to daemon and return node Aii. DA Virtual Tutorial - July 2020 Variants are available as attributes on run launcher requiring only single import Sebastiaan P. Huber Launching processes
Four processes launchers with identical interface run_get_node run_get_pk submit Run blockingly and return result + node Run blockingly and return result + pk Submit to daemon and return node Aii. DA Virtual Tutorial - July 2020 Submit to the daemon to immediately regain control of the interpreter Sebastiaan P. Huber Launching processes
Four processes launchers with identical interface run_get_node run_get_pk submit Run blockingly and return result + node Run blockingly and return result + pk Submit to daemon and return node Aii. DA Virtual Tutorial - July 2020 Can use dictionary with keyword expansion in case of many inputs Sebastiaan P. Huber Launching processes
Aii. DA Virtual Tutorial - July 2020 Your one-stop-shop for inspecting and interacting with processes Sebastiaan P. Huber Command line interaction: verdi process
Your one-stop-shop for inspecting and interacting with processes Aii. DA Virtual Tutorial - July 2020 verdi process list: list active and terminated processes Sebastiaan P. Huber Command line interaction: verdi process
Your one-stop-shop for inspecting and interacting with processes Aii. DA Virtual Tutorial - July 2020 verdi process status: tree representation of call stack Sebastiaan P. Huber Command line interaction: verdi process
Your one-stop-shop for inspecting and interacting with processes Aii. DA Virtual Tutorial - July 2020 verdi process report: complete report of log messages and scheduler stdout/stderr Sebastiaan P. Huber Command line interaction: verdi process
Your one-stop-shop for inspecting and interacting with processes verdi process pause: pause an active process verdi process play: resume a paused process Aii. DA Virtual Tutorial - July 2020 verdi process kill: kill an active process verdi process pause/play/kill: fails if process is already terminated Sebastiaan P. Huber Command line interaction: verdi process
Aii. DA Virtual Tutorial - July 2020 Carl Simon Adorf (EPFL) Francisco F. Ramirez (EPFL) Casper W. Andersen (EPFL) Leopold Talirz (EPFL) Marnik Bercx (EPFL) Aliaksandr Yakutovich (EPFL) Sebastiaan P. Huber The Materials Cloud And Aii. DA teams Developers Marco Borelli (EPFL) Chris Sewell (EPFL) Valeria Granata (EPFL) Sebastiaan P. Huber (EPFL) Giovanni Pizzi (EPFL) Berend Smit (EPFL) Leonid Kahle (EPFL) Joost Vande. Vondele (ETHZ, CSCS) Snehal P. Kumbhar (EPFL) Thomas Schulthess (ETHZ, CSCS) Elsa Passaro (EPFL) Nicola Marzari (EPFL) Aii. DA contributors
- Slides: 29