The new Table Browser Talk Outline Table Browser

The (new) Table Browser

Talk Outline • Table Browser History • New Table Browser Features • New Table Browser Implementation – all. joiner &. as files – Overall control and data flow – Joining and intersection modules • Limits and future directions

Table Browser History • Goal - annotations over a particular region of genome in text rather than graphic format • Krish - did first successful implementation separated tables into positional and non-positional, merged chr. N_ tables, split off hg. Find. • Angie - added sequence output, filters, intersections, and many help pages. • These versions of the table browser were called hg. Text

Why a New Table Browser • hg. Text is powerful, but much of the power is not obvious in the first page. • In hg. Text the association between tracks and tables was not clear. • No way to join fields across related tables.

New Table Browser • Flip to demoing new table browser online. – Show overall controls – Demo getting genome position, common name, and review status for ref. Seq on ENCODE. – Demo getting alt-splice varients with known. Canonical and known. Isoforms – Demo custom track created from filtered cpg. Islands (>= 500 bases >= 0. 9 Exp/Obs) – Intersect custom fat cpg track with most conserved, requiring 75% overlap, output as custom track – Intersect conserved fat cpg with exonophy, requiring <= 5% overlap, output as hyperlink (custom track output crashes!)

New Table Browser Implementation • Built using: – Auto. Sql. as files to describe table fields – all. joiner file to describe table relationships –. bed based intersection and sequence output code from old table browser – About 8000 lines of new C code in 19. c files in src/hg/hg. Tables

Data Flow • Each region (piece of a chromosome) processed separately • Filter is turned into a SQL where clause • Field oriented output, especially selected tables is handled by one branch of code. – SQL rows -> joining routines -> output • GFF, Custom Track, Sequence, Hyperlink, and Summary Stats outputs handled by a branch of code that turns things into BED format internally: – SQL rows -> BED -> intersecting -> output • Need to merge fields & BEDs to get joining and intersecting to happen at the same time ultimately.

Joining Code • Use all. joiner to find out route from primary table to other tables in join. • Construct SQL query for each table that applies table filters and region and includes key fields even if not part of final output. • Construct a row object (array of lists) for each row returned on primary table. • Construct a hash keyed by joining field of primary table, with row objects as values. • Execute SQL query for next table, and when keys match add info to row object. • Repeat with third and subsequent tables if any.

Limits/Features of Joining Code • Unless a filter is applied, non-positional tables will be scanned completely. This takes 3 minutes for gb. Cdna. Info. (Hint, add filter type=m. RNA) • Joining code only applied to field oriented output. • Will handle joins across split tables. • Can chop of prefixes and suffixes on a key field before joining if specified in all. joiner. (Needed for chopping off version number in some Ensembl tables for instance) • Avoids combinatorical explosion of output rows by allowing fields to contain lists.

Intersecting Code • Primarily inherited from hg. Text. • Uses h. Table. Info (call in hg/lib/hdb. c) which reports which fields in database store chromosome, start, end, etc. • Analyses h. Table. Info to figure out how many fields in corresponding BED structure, and how to query database and massage output to get a BED. • Converts second table in intersection into a bitmap. • Counts up number of bases in bitmap that intersect each bed item in first table. • (For pure bitwise operations converts first table to bitmap too. )

Limits and Features of Intersections • Not applied to field or MAF output. • Information is lost in converting to BED. • Does allow intersection code for sequence, GFF, custom track, BED, statistics, and hyperlinks output to go through same path.

Future Directions • Make a combined BED/Row structure to bring together intersections and joining. • Polish sequence output in some places. • Get. as file info for all tables. • Encourage people to pay a little more attention to database concerns as well as genome browser concerns when designing tables. • See if can phase out split tables by tuning My. SQL aggressively.