Ch IPseq Data Analysis Ch IPseq overview DNA

Ch. IP-seq overview DNA + bound protein Fragment DNA Sequence Map sequence tags to

Ch. IP-seq big picture • Combine high-throughput sequencing with Chromatin Immuno-precipitation to identify specific

Ch. IP-seq Workflow Confirm Ch. IP Prepare library Submit for Sequencing Get Raw sequence

Initial quality control measures Fast. QC as for the RNASeq data

A note on using duplication levels to estimate your library size (complexity) Assuming you

Read mapping - Bowtie Algorithm (Burrows-Wheeler Transformation) Provides an identifier to any sequence, allowing

How do peak-finders map binding sites? • Fragments contain the TF binding site at

IGV visualization of MACS results using the Fox. A 1 data set

IGV visualization of MACS results using the University of Washington H 3 K 4

IGV visualization of MACS results using the Broad Institute H 3 K 36 me

MACS’ "shiftsize” model You can find info on the estimated parameters in your. macsinfo

MACS model file This a representative result from running MACS #2 Build Peak Model.

Peaks & negative peaks Keep scrolling down your. macsinfo file until you find… …

Troubleshooting MACs Briefly… MACS can’t build a model: - Adjust the mfold values (the

Toubleshooting MACs… Be on the lookout for MACS building a model from short-separation noise

Downstream analysis • Find the nearest gene to each peak • Check distribution relative

Overlaps in Galaxy Operate On Genomic Intervals-> Intersect This lets you create a new.

Comparing peaks from multiple samples (Bardet et al. 2011)

During genome-wide peak calling, only the best peaks pass the stringent thresholds required for

How can we tell whether overlaps are significantly greater than chance? Assigning a p-value?

Resampling statistic or binomial tests for overlaps Let’s say we have peak regions from

Ch. IP-seq for histone modifications Method: • Mouse ES cells vs ES-derived primary neural

The ENCODE Project Dozens of labs did Ch. IP-seq, under rigorous quality guidelines, for

Example of Ch. IP-seq data tracks on UCSC browser Peak calls: regions of significant

Downloading data from UCSC browser Try zooming in (you can go all the way

Introduction to Galaxy Tools Galaxy is a web platform providing a lot of basic

What about data that’s not on the UCSC Browser? The ENCODE project was UNUSUALLY

What if the data I’m interested in isn’t in GEO? Authors are almost always

How many reads do I need? Minimum for Ch. IP-seq of a transcription factor

Slides: 35

Download presentation

Ch. IP-seq Data Analysis

Ch. IP-seq overview DNA + bound protein Fragment DNA Sequence Map sequence tags to genome & identify peaks Adapted from slide set by: Stuart M. Brown, Ph. D. , Center for Health Informatics & Bioinformatics, NYU School of Medicine Prepare sequencing library Immunoprecipitate Release DNA

Ch. IP-seq big picture • Combine high-throughput sequencing with Chromatin Immuno-precipitation to identify specific protein-DNA interactions genome-wide, including those of: – – – Transcription factors Histones (various types and modifications) RNA Polymerase (survey of transcription) DNA polymerase (investigate DNA replication) DNA repair enzymes • … or fragments of DNA that are modified (e. g. methylated)

Ch. IP-seq Workflow Confirm Ch. IP Prepare library Submit for Sequencing Get Raw sequence data & do QC Map sequence reads to genome Identify Ch. IP peaks over input background Downstream bioinformatic analyses

Initial quality control measures Fast. QC as for the RNASeq data

A note on using duplication levels to estimate your library size (complexity) Assuming you have 100 initial fragments in your library (before amplification) & which fragment gets read is random: # reads : 25 50 75 100 150 200 # unique reads: 23 37 52 63 78 87 % duplicated: 8% 26% 31% 43% 55% 69% x-more left in lib: 4. 3 2. 7 1. 9 1. 6 1. 3 1. 15 x-more than prev: 1. 6 1. 4 1. 24 1. 11 Given 9% duplicates, an additional sequencing run of the same size (from the same library) will give you 1. 6 x more unique reads. Two additional runs will give you 2. 2 x more (1. 6*1. 4). …but if you have a high % duplicates (e. g. 43%) adding one more lane will only give you 1. 37 x more unique reads than you had initially. Depending on sequencing depth, this could indicate that your library has low complexity – either because too few fragments from your Ch. IP survived to the library amplification step, or because the protein binds few sites. Alternatively: sub-sample your data and check saturation of peak calling

Read mapping - Bowtie Algorithm (Burrows-Wheeler Transformation) Provides an identifier to any sequence, allowing fast lookup of all its genomic positions in an indexed genome file (ebw file). Avoid having to search genome for matches each time (like blast would do).

How do peak-finders map binding sites? • Fragments contain the TF binding site at a (mostly) random position within them. • Reads are (randomly) from left or right edges (sense or antisense) of fragments. • Thus peak for sense tags will be 1/2 the fragment length upstream… • Binding site position = mid-way between sense tag peak & antisense tag peak. • To get binding site peak, shift sense downstream by ½ fragsize & antisense upstream by ½ fragsize. • Adapted from slide set by: Stuart M. Brown, Ph. D. , Center for Health Informatics & Bioinformatics, NYU School of Medicine & from Jothi, et al. Genome-wide identification of in vivo protein–DNA binding sites from Ch. IP-Seq data. NAR (2008), 36: 5221 -31

MACS procedure

IGV visualization of MACS results using the Fox. A 1 data set

IGV visualization of MACS results using the University of Washington H 3 K 4 me 3 data set.

IGV visualization of MACS results using the Broad Institute H 3 K 36 me 3 data set.

MACS’ "shiftsize” model You can find info on the estimated parameters in your. macsinfo file… INFO @ Sun, 10 Feb 2013 21: 27: 51: #2 Build Peak Model. . . INFO @ Sun, 10 Feb 2013 21: 27: 51: #2 number of paired peaks: 0 WARNING @ Sun, 10 Feb 2013 21: 27: 51: Too few paired peaks (0) so I can not build the model! Broader your MFOLD range parameter may erase this error. If it still can't build the model, please use --nomodel and --shiftsize 100 instead. WARNING @ Sun, 10 Feb 2013 21: 27: 51: Process for pairing-model is terminated! WARNING @ Sun, 10 Feb 2013 21: 27: 51: #2 Skipped. . . WARNING @ Sun, 10 Feb 2013 21: 27: 51: #2 Use 100 as shiftsize, 200 as fragment length Here MACs tried to estimate the “shift size” for moving sense & antisense reads to get a final peak position, by identifying sets of strong + & - strand peaks at a certain distance from each other. There was not enough info on chromosome 19 to do this, so it made a guess that the fragment size was 200 & shiftsize was 100. 200 is close enough to the actual fragment size of ~150 bp that we can go with this.

MACS model file This a representative result from running MACS #2 Build Peak Model. . . #2 number of paired peaks: 683 Fewer paired peaks (683) than 1000! Model may not be build well! Lower your MFOLD parameter may erase this warning. Now I will use 683 pairs to build model! finished! predicted fragment length is 125 bps Generate R script for model : Li. E_IP_v_INPUT_11_2012_dup 1_model. r Call peaks. . . use control data to filter peak candidates. . . Finally, 9504 peaks are called! find negative peaks by swapping treat and control Finally, 337 peaks are called! To generate this file you will need to go into R, and enter: Source(“MACS_output_file. r”), which will generate a. pdf

Peaks & negative peaks Keep scrolling down your. macsinfo file until you find… … INFO @ Sun, 10 Feb 2013 21: 36: 47: #3 Finally, 364 peaks are called! INFO @ Sun, 10 Feb 2013 21: 36: 47: #3 find negative peaks by swapping treat and control INFO @ Sun, 10 Feb 2013 21: 36: 52: #3 Finally, 364 peaks are called! INFO @ Sun, 10 Feb 2013 21: 36: 52: #4 Write output… This is the pay-off, where MACS identifies your ER alpha peak locations! 364 peaks on chromosome 19 (which is ~1/50 th of the genome) suggests ~20, 000 peaks for the whole genome, which is not bad! Equally critical, MACS now swaps treat & control (pretending your INPUT data is your IP & your Ch. IP data is your input) and looks again for peaks. The number of “negative” peaks found in this way should be far less than the positive peaks, and the 10: 1 ratio here is fine.

Troubleshooting MACs Briefly… MACS can’t build a model: - Adjust the mfold values (the fold over background ranges MACs considers for paired peaks) - Tell MACs to not build a model, but instead use the shiftsize you specify. Peaks/Negative Peaks ratio is poor or too few peaks are detected: - Adjust model settings to see if you can improve both. Otherwise, you may have to conclude that 1) your library was no good or 2) the factor just doesn’t bind to many places in the genome.

Toubleshooting MACs… Be on the lookout for MACS building a model from short-separation noise peaks (that may arise from sonication sensitive breakpoints or other things unrelated to your protein binding). To avoid this, you can decrease the maximum “mfold” so that these strong irrelevant peaks are ignored when the model is built.

Downstream analysis

Downstream analysis • Find the nearest gene to each peak • Check distribution relative to gene features (start site, exon, intron, upstream/downstream) • Find overrepresented motifs in peak region (TFBS binding sites of our factor + possible co-binders) – kmers/logos • Check if peaks are clustered or co-occur with other binding events • Sequence conservation (or conservation of binding event, if data is available) • Gene set functional analysis • …

Overlaps in Galaxy Operate On Genomic Intervals-> Intersect This lets you create a new. bed file which has only the regions that intersect between two datasets. Overlapping Intervals: (saves complete intervals from file 1 that overlap anything in file 2) Overlapping Pieces of Intervals: (saves only the regions shared between 1 & 2)

Additional notes on overlaps

Comparing peaks from multiple samples (Bardet et al. 2011)

During genome-wide peak calling, only the best peaks pass the stringent thresholds required for low false discovery rates (FDRs) due to the correction for multiple testing (orange lines). Regions may show substantial tag enrichment, yet are not called as peaks (green line). When comparing peaks across conditions, we advocate using 'significant enrichment' (not multiple testing–corrected) as the measure to assess whether a peak is shared across conditions or is truly condition-specific. Merely intersecting peaks called at each condition would miss conserved peaks (e. g. , middle examples).

How can we tell whether overlaps are significantly greater than chance? Assigning a p-value? We have 3 pieces of count frequency information: The number of overlaps, the number of regions compared, and we can generate the expected background frequency through shuffling. This type of data is like coin tosses & is ideally suited for a binomial test, which uses “number of matches”, “number of tests” and “expected background frequency” to calculate p. values. If you flip a coin, say 10 times and it comes up heads 6 out of 10 (frequency 0. 6 vs. expected 0. 5), that would not seem unlikely – and a binomial test would tell you this. p. =. 74 1 5 6 10 p. =2 e-10 However if you flip a coin 1000 times & get heads 600 out of 1000, that would seem a bit odd, and the binomial test would indicate this by saying that the probability of the null hypothesis (that the frequency of heads is 0. 5) is low. 100 500 600 1000

Resampling statistic or binomial tests for overlaps Let’s say we have peak regions from two samples (7000 and 8260), and the number of overlaps is 1653. We can estimate the background chance by randomly placing the regions in the first dataset in new locations, and then count the number of overlaps. We would repeat this procedure 100 -1000 times. Then, we can either ask: how many times, out of the 1000 repeated procedures, was the number of overlaps greater than that observed for the real dataset? This is our p-value (3/1000 -> p=0. 003. 0 -> p<0. 001) Alternatively, we can take the mean of the number of overlaps, and use this in a binomial test. For example, let’s say the mean number of overlaps in the shuffled sets was 95. 11 Run a binomial test in R by typing: binom. test(1653, 8260, 95. 11/8260) The p. value is <2 e-16. Very low. So, yes, the binding sites in the two samples overlapped more than expected by chance… but binding events are still to ~80% different places between these two samples. .

Data in the UCSC browser

Ch. IP-seq for histone modifications Method: • Mouse ES cells vs ES-derived primary neural progenitor cells (NPCs). • Prepare chromatin & Ch. IP with antibodies to specific histone modifications. • H 3 K 4 methylation marks active genes, H 3 K 27 marks repressed genes, both marks together in ES cells mark “poised” genes that will become activated in certain developmental lineages. Meissner et al. 2008, Nature 454: 766.

The ENCODE Project Dozens of labs did Ch. IP-seq, under rigorous quality guidelines, for over 100 transcription factors and histone modifications, plus related assays for DNA methylation, chromatin accessibility etc. Major paper (many others provide additional details): Encode Project Consortium (over 100 authors) An integrated encyclopedia of DNA elements in the human genome. Nature. 2012 Sep 6; 489(7414): 57 -74. doi: 10. 1038/nature 11247. Some ways to access this data: Nature. com/encode (Nature’s summary & links to all related papers) factorbook. org (a way to explore the data in a wiki format) UCSC genome browser (hands on examples next) sra (short read archive, repository for raw data, more on this later!)

Sample of Encode Data

Example of Ch. IP-seq data tracks on UCSC browser Peak calls: regions of significant enrichment over background Processed read density, read as # of reads overlapping a given BP position, data (used to make peak calls) H 3 K 4 me 3 & H 3 K 27 acet are marks of active promoters (e. g. MTRF 1 L) H 3 K 27 me 2 is a mark of repressed promoters Genes in ESCs that are required for differentiation are often “poised” and bear K 4 me 3 & K 27 me 3 “bivalent mark” (e. g. SYNE 1) Differentiation resolves bivalent mark to all activating marks (Osteoblasts) or all repressive marks (Hep. G 2)

Downloading data from UCSC browser Try zooming in (you can go all the way to base pair resolution) Want to learn more about a gene? Control click on it’s ideogram & select “open details page in new window” What if you want to use this data somewhere else? Select Tools->Table Browser Select Group: Regulation, Track: Broad Histone Table: H 1 -h. ESC H 3 K 4 me 3 … Pk (for the peaks data, the signal file will be huge) In output format, select “all fields from selected table” --Note that you could have selected “sequence”… if you had you’d get the actual DNA sequence for each one of these peaks. We’ll use this later. Check “Galaxy” next to send output to: --Note that you could have selected send to file, we’ll use this later as well. Click “Get output” & then click “send query to galaxy”

Introduction to Galaxy Tools Galaxy is a web platform providing a lot of basic tools for manipulating genomics data. On the right are input & analysis options. On the left is your history of uploaded files & analyses. You’ll have one item in process, which will finish soon & turn green. Click on the title to get a sample of what the data looks like. Click on the eye to see the data in the central panel. Each entry has a chromosome#, BP for start & BP for end & some other values (signal value=enrichment over background, p. value=-log(base 10)of p. value, so, for p=. 0001 this would be 4) Click on the pencil to look at and edit the name & other attributes of any item. We’ll look more at Galaxy tools later…

What about data that’s not on the UCSC Browser? The ENCODE project was UNUSUALLY considerate when compared to most other researchers who generate genomics data. Even though ENCODE is huge, it’s probably <10% of published NGS data. To publish, researchers must make their data accessible, but they will very rarely provide a link to a UCSC browser track. If you’re lucky, they will have put processed data up somewhere: generally on GEO… The GENE EXPRESSION OMNIBUS (GEO) Key repository for microarray and genomics data. Open a new browser tab (ctrl-T) & go to: http: //www. ncbi. nlm. nih. gov/geo/ Search for “encode h 3 k 4 me 3 h 1 -hesc”. You’ll see several entries. The first few are larger datasets that include this specific data. The one at the bottom is just the data for this track. First note the “accession number GSM 733657” - often publications will give this accession number, providing an easy way to get directly to the right place. ---> Now, click on the title for this entry “Bernstein…” Scroll down the next page: There’s lots of info about this experiment with links for more information. At the bottom are the processed data files: They’ve been nice & offer us a “BROADPEAK” file (the same as what we just uploaded to Galaxy), a “BAM” file for each experimental replicate (which has the genomic coordinates for each read), and a “BIGWIG” file (the filetype for the “signal” track on the UCSC browser)

What if the data I’m interested in isn’t in GEO? Authors are almost always required to make their NGS data accessible in order to publish…. …but they’re often not required to make it easy! Many times the only thing that’s available is the raw data, stored in… The Sequence Read (SRA) - Repository for Open a new browser tab (ctrl-T) & go to: raw NGS data. http: //www. ncbi. nlm. nih. gov/sra/ Search again for “encode h 3 k 4 me 3 h 1 -hesc”. A single record will be called up… Clicking on link 1 or 2 under “run” will give you information about that particular biological replicate sample. Clicking on the link 1 or 2 under ‘size’, will take you to a page with a single linked file with a “. sra” extension. Note how big this file is… just a few of these would rapidly fill up a PC hard drive! So what is a. sra file & what can you do with it? Don’t even try to download & open it… besides being huge, it’s not even normal text. In the next few lectures we’ll find out how to handle this sort of raw data.

How many reads do I need? Minimum for Ch. IP-seq of a transcription factor with < ~30, 000 binding sites in a mammalian genome: • 2 replicates per condition • 20+ million reads per sample (>40 M per condition, proportionately less for smaller genomes & fewer binding peaks) • One Hi. Seq lane gives ~150 million reads …& can multiplex ~4 samples (2 exp + 2 input / lane) • Single end 50 bp reads almost always good (unlike RNA-seq where longer and/or paired end reads are required for many downstream questions). For some applications need many more reads (e. g. mapping nucleosome positions need >400 M). Make your best estimate. If you have too few you can re-sequence the samples or additional samples. Reads from all runs can be pooled in the end.