Analysis Description Languages Workshop ANALYSIS LANGUAGES DISCUSSION AND

  • Slides: 36
Download presentation
Analysis Description Languages Workshop ANALYSIS LANGUAGES DISCUSSION AND IDEAS G. Watts (UW/Seattle) Analysis Languages

Analysis Description Languages Workshop ANALYSIS LANGUAGES DISCUSSION AND IDEAS G. Watts (UW/Seattle) Analysis Languages Workshop May 7, 2019

2 MOTIVATION How can we do analysis: • Correctly? • Quickly? • With small

2 MOTIVATION How can we do analysis: • Correctly? • Quickly? • With small team? • Efficient use of resources? From S. Sekmen’s Talk G. Watts (UW/Seattle)

3 Think Big! G. Watts (UW/Seattle)

3 Think Big! G. Watts (UW/Seattle)

4 THE LHC G. Watts (UW/Seattle)

4 THE LHC G. Watts (UW/Seattle)

5 THE LHC Analysis files will be ~10’s of TB’s • • • Laptops

5 THE LHC Analysis files will be ~10’s of TB’s • • • Laptops for anything other than development? What about running at 10 K meters? Will need server/PROOF like functionality Shared resources between groups or countries? But you need local editing!! • Editing a file at CERN from the USA is not an acceptable thing. Think about scaling as we design these languages! G. Watts (UW/Seattle)

6 ENVIRONMENT How is a user going to start using this? • Software requirements

6 ENVIRONMENT How is a user going to start using this? • Software requirements • Version of python, gcc, etc. • Metadata that may be experiment specific • Frameworks that are used to access data Personal Opinion The world is moving to a containerbased sandbox method of application distribution G. Watts (UW/Seattle) e. g. Docker • • Built into Linux Built into Windows (soon) Mac. OS VM No story for Chrome. Book By Run 4 make this a zero-level requirement for local development?

7 WHERE BE THE DATA? Data Format • All LHC experiments write out ROOT

7 WHERE BE THE DATA? Data Format • All LHC experiments write out ROOT data • Many smaller ones are avoiding ROOT • Most experiments have a custom ROOT format • Can’t be read w/out experiment’s software framework • TTree’s without objects are common intermediate format • Non LHC moving away from ROOT Where is the data stored? • For Run 4 – data lakes • Large federated storage • Distributed across country • Delivery by cache • Perhaps basic transform by i. DDS? Tooling • Most of our tools expect ROOT format • Most tools outside HEP expect numpy or similar Need bridges! format • Increasing in popularity in the field • G. Watts (UW/Seattle) Pandas, hdf 5, awkward array

8 IDDS i. DDS • Deliver just the data you want • Dynamic requests

8 IDDS i. DDS • Deliver just the data you want • Dynamic requests • Transform by adding new ‘columns’, removing some, reformatting, etc. • Reduce disk usage! Analysis System G. Watts (UW/Seattle)

9 NOT MY DAD’S COMPUTER Compute power is now in co-processors! All the super

9 NOT MY DAD’S COMPUTER Compute power is now in co-processors! All the super computers announced for the start of Run 4 are GPU enhanced! A 21 will have a significant amount of Intel Optaine memory 1 Rewrite bench-mark analysis to use co processors to prove it is faster I suspect they are from numba benchmarks 2 Write out analysis languages so we can move between physics and computer representations G. Watts (UW/Seattle) These co-processors like to crunch columns of data, not rows.

10 HOW WILL PEOPLE USE IT? Jupyter Lab interface • • Tutorials Quick Examinations

10 HOW WILL PEOPLE USE IT? Jupyter Lab interface • • Tutorials Quick Examinations Easy to present text, code, plots in one One level up from TTree: : Draw Would you write an analysis in a notebook? Would you preserve an analysis in a notebook? Chrome Book with big enough backend? G. Watts (UW/Seattle)

11 HOW WILL PEOPLE USE IT? Command Line/CI • Any automation • Continuous Integration

11 HOW WILL PEOPLE USE IT? Command Line/CI • Any automation • Continuous Integration for testing of analyses • Complex algorithms? What would output look like? G. Watts (UW/Seattle)

12 HOW WILL PEOPLE USE IT? Full Fledge IDE/GUI • IDE’s now have language

12 HOW WILL PEOPLE USE IT? Full Fledge IDE/GUI • IDE’s now have language servers • Syntax checking, type checking • Compile errors as you type in your editor • Debugging in your editor • Automatic completion • Underused in field, but huge productivity enhancers • Custom GUI for the language • As long as text files remain the linga-franca • Hard to round-trip G. Watts (UW/Seattle)

13 WHAT WILL THEY DO WITH IT? Preserve the Analysis • This will only

13 WHAT WILL THEY DO WITH IT? Preserve the Analysis • This will only be done in the language that was used to write the analysis originally. • Too expensive otherwise • RECAST ‘preserves the mess’ for example Explore • • Lots of ideas, lots of dead-ends Plots, and scripts… Lots of stuff thrown away A few nuggets kept • Though we often do not remember to remove them from our code! G. Watts (UW/Seattle) Quick-Checks • Explore funny shape in one distribution • Often need to reuse complex selection • But add-on code should be separate • Otherwise remains as dead code long after it is needed • Key thing that leads to 5000 line long C++ macros. Analysis • Big Iron • Systematics, control regions, fitting, etc. • Carefully tracked and maintained

14 Can we use the same toolset and language to do all of this?

14 Can we use the same toolset and language to do all of this? G. Watts (UW/Seattle)

15 A few thoughts G. Watts (UW/Seattle)

15 A few thoughts G. Watts (UW/Seattle)

16 99 LANGUAGES ON THE WALL… TAKE ONE DOWN, PARSE IT AROUND… ADL Query

16 99 LANGUAGES ON THE WALL… TAKE ONE DOWN, PARSE IT AROUND… ADL Query Data • Structured, binary, etc. • Scalars • (non-event data) 2 Analysis Description Language • Control and signal regions • Fitting • Systematics, ML control, etc. 3 Query Language • Per-event language • Declarative • Select events, objects • Calculate ML results • Histograms or some other aggregate data back Hist, etc. Limit Plot G. Watts (UW/Seattle) 1

17 QUERY LANGUAGE Specifically Designed to loop over structured data G. Watts (UW/Seattle)

17 QUERY LANGUAGE Specifically Designed to loop over structured data G. Watts (UW/Seattle)

18 NO EVENT (DATABASE) LEFT BEHIND Event Run #10 Event #123 Run #10 Event

18 NO EVENT (DATABASE) LEFT BEHIND Event Run #10 Event #123 Run #10 Event #234 Run #11 Event #501 G. Watts (UW/Seattle) Near Tracks 55. 0 1. 2 2. 34 1, 2, 10 130. 3 0. 5 -0. 7 3, 5, 10 55. 0 1. 2 2. 34 130. 3 0. 5 1. 2 85. 3 -1. 2 0. 78 … Physics: every collision is independent This has strong effects on our compute approach • Embarrassing parallel • Each event can be its own database

19 CS ALREADY KNOWS The syntax isn’t awesome But the set of operations is

19 CS ALREADY KNOWS The syntax isn’t awesome But the set of operations is complete and unambiguous G. Watts (UW/Seattle)

20 EACH EVENT IS A DATABASE events. Select. Many(e => e. Jets). Future. Plot(“jet_pt",

20 EACH EVENT IS A DATABASE events. Select. Many(e => e. Jets). Future. Plot(“jet_pt", “Jet p_T", 100, 0. 0, 1000. 0, j => j. pt). Save(hdir); Run a query over each event, Aggregate in a histogram events. Select. Many(e => e. Jets). Where(j => j. pt > 40. 0). Count() Run a query over each event, Aggregate in a single integer. G. Watts (UW/Seattle) Though clear to us what is meant here, a bit tricky to code up crossing the event boundary

21 NO EVENT (DATABASE) LEFT BEHIND • How to reason about nested data structures

21 NO EVENT (DATABASE) LEFT BEHIND • How to reason about nested data structures • Flatten nested arrays, filter, sorting, matching, multiobject looping, etc. • Terminals (Count, Aggregate, Max, Min, etc. ) G. Watts (UW/Seattle)

22 ANALYSIS LANGUAGE I have always thought of the ADL as the wild west

22 ANALYSIS LANGUAGE I have always thought of the ADL as the wild west ADL • Totally wacky manipulations of query results • Impossible to predict Query Language You need a General Purpose Programming Language Hist. Factory – a statistical package that combines the results of queries into limits, etc. G. Watts (UW/Seattle)

23 KEEP THEM SEPARATED? ADL Query Language G. Watts (UW/Seattle) ? Query Language

23 KEEP THEM SEPARATED? ADL Query Language G. Watts (UW/Seattle) ? Query Language

24 WHY I CHOSE C# ORIGINALLY It has a query language (SQL) embedded in

24 WHY I CHOSE C# ORIGINALLY It has a query language (SQL) embedded in the GPL C# is well supported • Tooling, debuggers, etc. all for free! • Parser and AST built into language standard • I just had to implement a library back-end! G. Watts (UW/Seattle)

25 LEAKY ABSTRACTIONS 1 Carefully control where abstraction leaks 2 Especially dangerous in the

25 LEAKY ABSTRACTIONS 1 Carefully control where abstraction leaks 2 Especially dangerous in the query language • Automated optimization is much more difficult • Limits the type of backend you can run on (GPU, CPU, etc. ) G. Watts (UW/Seattle)

26 pip install physics-tenpy Installs a quantum manybody simulator pip install scikit-hep Installs packages

26 pip install physics-tenpy Installs a quantum manybody simulator pip install scikit-hep Installs packages for doing HEP work in python Ecosystem • Uniform interface for installing add-on’s • Exist for many programming languages G. Watts (UW/Seattle) Why not one for an analysis language Reuse one that is out there if possible!!

27 LAZY EVALUATION If user makes plot “jet p. T”, do not calculate delta

27 LAZY EVALUATION If user makes plot “jet p. T”, do not calculate delta R between jet and tracks that is used for another unrequested plot. System should optimize, not the user This means dataflow! Start from the goal and work backwards G. Watts (UW/Seattle) e. g. An unused control region in the ADL file This is particularly useful in an analysis group • Lots of people work on a ‘framework’ • Has lots of regions, algorithms, etc. • User needs only a small portion of them.

28 CAN WE WRITE IT ONCE? Write the same ADL source file for CMS,

28 CAN WE WRITE IT ONCE? Write the same ADL source file for CMS, ATLAS, etc? Superficially: yes. Usefully: no. The ADL and the query language may be opinionated… But they can’t be too opinionated or they will suffer adoption problems. G. Watts (UW/Seattle)

29 G. Watts (UW/Seattle) Moving from C# to Python

29 G. Watts (UW/Seattle) Moving from C# to Python

30 Based on my LINQ work (first check in was Dec 11, 2010 2250

30 Based on my LINQ work (first check in was Dec 11, 2010 2250 comits) Query Language G. Watts (UW/Seattle) CURRENT WORK AST DAG Backend

31 Based on my LINQ work CURRENT WORK Run on ATLAS x. AOD’s Backend

31 Based on my LINQ work CURRENT WORK Run on ATLAS x. AOD’s Backend Run on awkward Arrays Run on flat Ttree with RData. Frame G. Watts (UW/Seattle) Second axis: • Run on the GRID • Run on a local cluster Second axis: • Awkward arrays could run on a GPU or CPU

32 Based on my LINQ work CURRENT WORK AST DAG AST (or DAG) contains

32 Based on my LINQ work CURRENT WORK AST DAG AST (or DAG) contains the complete information for a query • How to manipulate the data • How to filter the data • What histogram to calculate • How to weight the data • Application of a ML weight • etc G. Watts (UW/Seattle)

33 Based on my LINQ work CURRENT WORK AST Cache DAG Insert a cache

33 Based on my LINQ work CURRENT WORK AST Cache DAG Insert a cache between these two • The AST becomes the cache key • Can request the same plot and the second time it should be ms to return it • Not 10 minutes with a compute cluster • Can re-run full analysis in second or two • Spend time making only the new plots, but have it all together G. Watts (UW/Seattle)

Jupyter Examples 34 Based on my LINQ work STATUS Query Language AST LINQ in

Jupyter Examples 34 Based on my LINQ work STATUS Query Language AST LINQ in Python • No effort to make it concise • Text strings (ick! ) G. Watts (UW/Seattle) DAG Backend DAG is part of backend AST is Python’s • Some minor extensions • Can transform • To add convivence tuples, for example • and can be put into a http request

35 Based on my LINQ work WHAT’S NEXT? Query Language Query language • Use

35 Based on my LINQ work WHAT’S NEXT? Query Language Query language • Use python with language parsing • No (or almost no) text strings • Types? • Play to python’s strengths G. Watts (UW/Seattle) AST DAG Backend DAG is part of backend Convert to simplified AST that Jim has discussed Backend • Turn into web service • Create cache • Already have web service to load GRID data local i. DDS

36 THE CONVERSATION • There is a huge amount of activity around Analysis and

36 THE CONVERSATION • There is a huge amount of activity around Analysis and Query Languages • Join the Conversation! • HSF Data Analysis Forum (home (email list), indico) • Topical Meetings @IRIS-HEP (sign up for one! • CHEP and ACAT conferences • CHEP deadline is soon! Please submit an abstract! • IRIS-HEP/Slack channel • Think Big • • The context for Run 3 and Run 4 is much bigger than we are used to Can we do a full analysis with a small team? Scalability? An Analysis System, not just an ADL! G. Watts (UW/Seattle)