The Bookkeeping SQL API Tim Adye Rutherford Appleton
The Bookkeeping SQL API Tim Adye Rutherford Appleton Laboratory Bookkeeping / Data Distribution Parallel Ba. Bar Collaboration Meeting 8 th December 2004 Tim Adye 1
Talk Plan • The problem • “Why not just write SQL? ” • The Ba. Bar SQL API • • User view: Perl classes and command-line tool Behind the scenes Table schema configuration classes Summary of features • Could this be generalised to other applications? • Possible improvements • Comparison with other DBIx packages • Summary and references 8 th December 2004 Tim Adye 2
User Access • Users need to query database to find out what data to process • May also need other information • eg. luminosity, run numbers, file sizes • Mostly select by dataset, but may need to limit further • eg. only data available locally, taken at peak energy, excluding some problem datataking period • Cannot expect users to know which tables to use, how to join them, or even the SQL syntax • Even worse if the schema change • Cannot expect developers to code for all combinations of queries with all possible selections • Previous ad hoc tools (some mine!) tried to do this and it was a nightmare, even for a simpler table structure 8 th December 2004 Tim Adye 3
8 th December 2004 Tim Adye 4
Ba. Bar SQL API – user view • Each column that users might want to query or select on is given a unique logical name – regardless of which table it lives in • These names are used to specify query values $query->add. Values('collection', 'gbytes'); and selections Each of these happens to be in a different table $query->add. Selector('dataset', 'Dilepton-*'); $query->add. Selector('run', '10000 -19999'); • Different types of data allow for different selection syntax, eg. wildcards for names, or ranges for run numbers. • Can also use SQL expressions (in terms of logical names) $query->add. Values('SUM(lumi)/1000'); and sorting, row limits, etc 8 th December 2004 Tim Adye 5
SQL API – returning results • That’s enough to generate a valid SQL SELECT query. To return the results: my $sta = $query->execute(); while (my $row = $sta->fetch()) { print $row->gbytes(), $row->collection(), "n"; } • The $query object collects the user requests • $query->execute() returns a “statement accessor” (like a DBI statement handle). • $sta iterates over row objects, each of which has accessors for each query value, gbytes and collection. • That’s all there is to it! • After the usual DBI connect, and $query object instantiation (see later), these statements form a working program 8 th December 2004 Tim Adye 6
Command-Line Tools • Standard Ba. Bar tools use this API to create job configuration, create datasets, calculate luminosities, etc. • Standard tasks, but optionally allowing additional selections • Also provide an “expert” tool that allows access to full API functionality from the command line • This has proved very popular, with many “non-experts” making their own unique queries 8 th December 2004 Tim Adye 7
Examples $ Bbk. User --dataset=A 0 -Run 4 -On. Peak-R 14 --is_local=1 --file_status=0 dse_lumi events gbytes file --style=adye --display DSE_LUMI ==== 1250. 3 1348. 4. . . 156 rows EVENTS GBYTES FILE ========================== 526115 1. 6 /store/PRskims/R 14/14. 4. 0 d/A 0/02/A 0_0239. 01. root 526115 0. 8 /store/PRskims/R 14/14. 4. 0 d/A 0/02/A 0_0239. 02 HBCA. root 576239 1. 6 /store/PRskims/R 14/14. 4. 0 d/A 0/02/A 0_0240. 01. root 576239 1. 0 /store/PRskims/R 14/14. 4. 0 d/A 0/02/A 0_0240. 02 HBCA. root returned $ Bbk. User –-collection-file=coll. lis tot_gbytes collection 8 th December 2004 Tim Adye 8
What happens behind your back • The SQL API • translates the logical names to table columns • selects the required tables and joins • including otherwise unused tables required for the joins • generates and executes a valid SQL SELECT statement • creates a statement accessor object • dynamically generates a class for the row objects with accessors for each query value 8 th December 2004 Tim Adye 9
Bbk. User --dataset=A 0 -Run 4 -On. Peak-R 14 --is_local=1 --file_status=0 dse_lumi events gbytes file Our Example • That first Bbk. User command involved 5 tables • including one that provides the join between dataset and collection tables SELECT dse. lumi_sum AS "dse_lumi", dse. output_nev AS "events", file. bytes, dse. name AS "collection", file. suffix AS "file_suffix", ds. id AS "ds_id", dse. id AS "dse_id", dtd. id AS "dtd_id", dtd. link_status FROM bbk_dataset ds, bbk_dsentities dse, data_files dfile, bbk_files file, bbk_dstodse dtd WHERE ds. id=dtd. ds_id The SQL API can even AND dtd. dse_id=dse. id pretty-print it like this for you AND dse. id=file. dse_id (What’s shown here is somewhat AND file. id=dfile_id abbreviated: actual command AND ds. name='A 0 -Run 4 -On. Peak-R 14' includes full database and table AND dse. is_local='1' names in case of ambiguities) AND dfile. status='0'; 8 th December 2004 Tim Adye 10
8 th December 2004 Tim Adye 11
Table schema configuration classes • Mapping between logical names and table columns is defined in the configuration classes. • One class per table • Can also define special properties of each column (eg. whether to allow ranges (“ 100 -199”) for selection). • Possible joins between tables defined here too • Use logical column names for join conditions, so one table class does not need to know about column names in other classes. • In most cases it’s just a matter of listing logical vs. column names • with a little Perl syntactic sugar • Inheritence of config classes expresses commonalities • eg. common id and created, and modified columns 8 th December 2004 Tim Adye 12
Example Table Configuration sub table { return 'bbk_files' } sub table. Config { return { alias => 'file', columns => [ bytes uuid checksum file_suffix nfiles gbytes tot_gbytes file_dse_id file ], joins => [ dse_id file_id ], }} => => => 'bytes‘, 'uuid', 'checksum', 'suffix', 'COUNT(DISTINCT file_id)', '(bytes/1073741824)', 'SUM(bytes)/1073741824', { column => 'dse_id', selector. Type => 'range' }, { value. Action => 'add. Lfn. Value', selector. Action => 'lfn. Selector' }, => 'file_dse_id', => 'dfile_id', 8 th December 2004 Tim Adye 13
Putting it all together • Configuration classes must be registered with $query object my $query = new Bbk. Sql. Select($bbkconfig); $query->add. Modules(new My. Table. Class($bbkconfig)); but of course it is usually simpler to provide a $query object pre-registered with all the table configs as part of a specific API. 8 th December 2004 Tim Adye 14
Overriding and Synthetic Columns • A crutial advantage of this system is that it allows us to override the default behaviour • Allows us to hide complexities from user • Make even complex schema changes transparent to users • A logical name can refer to • • ordinary database column name SQL expression (in terms of database columns, or other logical names) Perl method to pre- or post-process selection or query value “synthetic” query value or selection • can return calculated value or alter behaviour • Global post-processing • Can be triggered by value, selection, or table inclusion • Allows global filtering of returned rows 8 th December 2004 Tim Adye 15
What happens behind your back 2 • We already used some of these features without noticing! 1. Dataset names can be found in the bbk_dataset or the bbk_aliases table • Requires a check and translation using the alias table 2. Datasets can evolve with time, with collection being added or removed • • Need to query dataset for any time in the past, or use tagged dataset alias (like a CVS tag) Implemented by automatically including date selection in query, and post-processing returned results to remove deleted collections 3. File names are made from a collection name + a suffix • $query->add. Selector('file') splits the file name for the query and the $row->file() accessor rejoins them 8 th December 2004 Tim Adye 16
Features • Supports Oracle 8 and My. SQL 3. 23 • Most queries that can be expressed in both these dialects can be expressed by users via the API – without breaking the paradigm of a flat namespace • • • aggregatation and grouping sorting and distinct My. SQL’s LIMIT emulated in Oracle inner and outer joins (generates Oracle or My. SQL syntax) Does not support UNION or subqueries • Could be added, but not in My. SQL 3. 23 • Convenience features • automatic Getopt specification • query results display formatting and summary table generation • Configuration class summary table generation 8 th December 2004 Tim Adye 17
Limitations • Assumes tables can be joined in a unique way • ie. the joins form an acyclic graph • can still select different joins with explicit switches • Each column must have its own unique logical name • This is usually a good thing • but if the same data is held in different columns, it would be more efficient to automatically select from tables that are already included 8 th December 2004 Tim Adye 18
A public version • Current version has a few Ba. Bar-specific pieces • Ba. Bar Connection/Configuration manager – can use DBI directly • Ba. Bar Options manager – can use Getopt directly • Ba. Bar base objects – borrow required methods • Ba. Bar table formatting class – publish this too … otherwise just uses standard Perl modules but with different table configs could be used elsewhere • Already do this in Ba. Bar – used for QA and TM databases • Maybe I’m making some other assumptions that are true of our database and requirements, but not more generally so. I can’t think of any. • Needs a better name! • This is really an SQL API creator 8 th December 2004 Tim Adye DBIx: : Sql. Abstractor ? ? ? 19
Possible improvements • Tidy up code! • User and config APIs are OK, but in between it’s pretty ugly • Separate functionality that can be used on its own • Already true of the DBI statement accessor class • More SQL dialects: Postgre. SQL, MS SQL? • New SQL syntax: subqueries, UNIONs, … • INSERT, UPDATE, etc • these don’t need joins, so hand-coding not such a problem • Automatic selection of different join possibilities • Automatic generation of default table classes from SQL schema • Could use “The SQL Fairy” • though not much work to do it by hand 8 th December 2004 Tim Adye 20
Why not use another package? • More than 100 DBIx and other SQL access packages in CPAN • Could not find any that do all (or even most) of • • • hide table structure from user allow multi-table queries, taking care of joins automatically do not impose their own conventions on table schema allow query values and selections to be overridden allow transparent post-processing of query results provide accessor functions for query results • I believe that taken together these features provide a clear and easy to use abstraction 8 th December 2004 Tim Adye 21
Feedback and Discussion • Would this be useful outside Ba. Bar? • Is it a good idea to make a public release? • eg. on CPAN • Does it need any improvements? • New features • Make it compatible with some other standards • eg. sit on top of another abstraction like DBIx: : Table • A better name! 8 th December 2004 Tim Adye 22
References • Ba. Bar Bookkeeping project http: //slac. stanford. edu/BFROOT/www/Computing/Distri buted/Bookkeeping/Documentation/ • Ba. Bar Bookkeeping presentation and paper http: //indico. cern. ch/contribution. Display. py? contrib Id=338&session. Id=7&conf. Id=0 D. A. Smith et al. , Ba. Bar Book Keeping project – a distributed meta-data catalog of the Ba. Bar event store, Proc. Computing in High Energy and Nuclear Physics 2004 (CHEP 04). • CPAN Database Interfaces (see particularly DBIx) http: //cpan. uwinnipeg. ca/chapter/Database_Interfaces 8 th December 2004 Tim Adye 23
- Slides: 23