De novo Motif Finding using Ch IPSeq Presenter

  • Slides: 24
Download presentation
De novo Motif Finding using Ch. IP-Seq Presenter: Zhizhuo Zhang Supervisor: Wing-Kin Sung

De novo Motif Finding using Ch. IP-Seq Presenter: Zhizhuo Zhang Supervisor: Wing-Kin Sung

Outline • Introduction of Chip-Seq Data • The Impact of Chip-Seq’s Properties in Motif

Outline • Introduction of Chip-Seq Data • The Impact of Chip-Seq’s Properties in Motif Finding • Our proposing algorithm (Pomoda) • Experiment Result • Exploring Center Distribution 1/15/2022 2 Copyright 2009 @ Zhang Zhi. Zhuo

Chip-Seq Technique 1/15/2022 3 Copyright 2009 @ Zhang Zhi. Zhuo

Chip-Seq Technique 1/15/2022 3 Copyright 2009 @ Zhang Zhi. Zhuo

Comparison with Chip-Chip 1/15/2022 4 Copyright 2009 @ Zhang Zhi. Zhuo

Comparison with Chip-Chip 1/15/2022 4 Copyright 2009 @ Zhang Zhi. Zhuo

What Chip-Seq means to us? Sequences Motif Finding Tools Motif models More data Good

What Chip-Seq means to us? Sequences Motif Finding Tools Motif models More data Good news for data mining, but necessary for denovo motif finding Higher resolution job becomes easier, localization 1/15/2022 5 Copyright 2009 @ Zhang Zhi. Zhuo

How large the data is? The definition of “large data” keeps changing! • 10

How large the data is? The definition of “large data” keeps changing! • 10 years before, tens of sequences (Promoter Sequences: MEME, Align. ACE) • 5 years before, hundreds of sequences (Chip-Chip: Weeder) • 2 years before, thousands of sequences (higher throughput Chip-Chip: Trawler, Amandeus) • Now, tens of thousands of sequences (Chip-Seq: ? ) 1/15/2022 6 Copyright 2009 @ Zhang Zhi. Zhuo

Higher Resolution Means? Means finding main motif (antibody targeting TF) becomes a easy job!

Higher Resolution Means? Means finding main motif (antibody targeting TF) becomes a easy job! Main Motif would be very over-represented The Peak range just about 50 bp, simply align all the peak region, we can get the good motif. It means our focuses may change from the main TF to the TFs who are working with the main one. 1/15/2022 7 Copyright 2009 @ Zhang Zhi. Zhuo

Localization =?Over-Representation AR GATA 1000 450 900 400 800 350 300 600 Frequency 700

Localization =?Over-Representation AR GATA 1000 450 900 400 800 350 300 600 Frequency 700 500 400 250 200 150 300 100 200 50 100 0 0 1 3 5 7 9 1113151719212325272931333537394143454749515355575961 Location bins 1/15/2022 1 3 5 7 9 1113151719212325272931333537394143454749515355575961 Location bins 8 Copyright 2009 @ Zhang Zhi. Zhuo

Peak Oriented Motif Discovery What information of Peak can be helpful? Peak Intensity Peak

Peak Oriented Motif Discovery What information of Peak can be helpful? Peak Intensity Peak location Our targets: not only the main motif, but also the co-motifs sitting around the main motif. 1/15/2022 9 Copyright 2009 @ Zhang Zhi. Zhuo

POMODA • Peak Oriented Motif Discovery Algorithm Centered on Ch. IP-seq peak of 1/15/2022

POMODA • Peak Oriented Motif Discovery Algorithm Centered on Ch. IP-seq peak of 1/15/2022 The main motif A co-motif Should be noise as it does not exhibit distance preference to the main motif 10 Copyright 2009 @ Zhang Zhi. Zhuo

Motif Modeling String Motif : Smaller searching space, enable fast string matching algorithm PWM

Motif Modeling String Motif : Smaller searching space, enable fast string matching algorithm PWM Motif: More precise approximation to the real motif, statistics sound. (PWM—Position Weighted Matrix) 1/15/2022 11 Copyright 2009 @ Zhang Zhi. Zhuo

Background Modeling Organism Specified Background: Hard to capture the negative information in background Position

Background Modeling Organism Specified Background: Hard to capture the negative information in background Position Specified Background: Reveal the biological context, and easier to capture the negative information 1/15/2022 12 Copyright 2009 @ Zhang Zhi. Zhuo

Position Specified Background Given the peak position in chip-seq, we not only identify the

Position Specified Background Given the peak position in chip-seq, we not only identify the active position(center) of the master TF, but also the active region of its co-motif. Peak in Chip-Seq Background Region Active Region 1/15/2022 13 Copyright 2009 @ Zhang Zhi. Zhuo

Center Enrichment Score Since we don’t know the exact size of the active region,

Center Enrichment Score Since we don’t know the exact size of the active region, and it may vary for different motif. Hence, we define a odd-ratio score base on dynamic window size. 1/15/2022 14 Copyright 2009 @ Zhang Zhi. Zhuo

Algorithm Overview Seed Finding PWM Extending & Refinement Redundant Motifs Filtering 1/15/2022 15 Copyright

Algorithm Overview Seed Finding PWM Extending & Refinement Redundant Motifs Filtering 1/15/2022 15 Copyright 2009 @ Zhang Zhi. Zhuo

Seeds Finding GGTCAC CGGTCA GGGTCA AGGTCA … Enumerate all length 6 patterns AACTTG …

Seeds Finding GGTCAC CGGTCA GGGTCA AGGTCA … Enumerate all length 6 patterns AACTTG … ATGACC CAGGTCG CGTGAC CTGACC 1/15/2022 Po 1 2 3 4 5 6 A 0. 97 0. 01 C 0. 01 0. 97 0. 01 G 0. 01 0. 97 T 0. 01 0. 97 0. 01 16 Copyright 2009 @ Zhang Zhi. Zhuo

PWM Extending & Refinement Encapsulate the core PWM into a wide PWM For example,

PWM Extending & Refinement Encapsulate the core PWM into a wide PWM For example, we implant the length 6 PWM into a length 26 PWM, as following: Po 1 2 … … 9 10 11 12 13 14 15 16 … … 25 26 A 0. 25 …… 0. 25 0. 97 0. 01 0. 25 …… 0. 25 C 0. 25 …… 0. 25 0. 01 0. 97 0. 01 0. 25 …… 0. 25 G 0. 25 …… 0. 25 0. 01 0. 97 0. 25 …… 0. 25 T 0. 25 …… 0. 25 0. 01 0. 97 0. 01 0. 25 …… 0. 25 1/15/2022 Core PWM 17 Copyright 2009 @ Zhang Zhi. Zhuo

Background Instances PWM Extending & Refinement A…A…GGTCA…C…C T…G…GGTCA…A…G G…A…GGTCA…T…T T…G…GGTCA…G…G …… C…T…GGTCA…T…A Select the

Background Instances PWM Extending & Refinement A…A…GGTCA…C…C T…G…GGTCA…A…G G…A…GGTCA…T…T T…G…GGTCA…G…G …… C…T…GGTCA…T…A Select the best column to update based on Center PWM and Bg PWM. Center Instances 1/15/2022 A…A…GGTCA…C…C T…G…GGTCA…C…G …… C…T…GGTCA…C…A GGTCANNNNC 18 Copyright 2009 @ Zhang Zhi. Zhuo

Redundant Motifs Filtering 1. Positions overlap more than 5% 2. PWM divergence less than

Redundant Motifs Filtering 1. Positions overlap more than 5% 2. PWM divergence less than 0. 18 1/15/2022 19 Copyright 2009 @ Zhang Zhi. Zhuo

Results – Comparison 1. Dataset: 1. MCF 7 dataset (ER), 4361 sequences 2. LNCAP

Results – Comparison 1. Dataset: 1. MCF 7 dataset (ER), 4361 sequences 2. LNCAP dataset (AR), 10000 sequences 2. Evaluate “PWM divergence” with Transfac motif as in Harbison et al (2004) and Amadeus (2008) 3. +/- 5000 bases from peak (Pomoda), and +/- 200 bases from peak for other algorithms 4. Each motif finder report its top 20 results 1/15/2022 20 Copyright 2009 @ Zhang Zhi. Zhuo

Cell TF Mcf 7 ER Pomoda Amadeus Trawler Weeder HNF 3 GATA AP 1

Cell TF Mcf 7 ER Pomoda Amadeus Trawler Weeder HNF 3 GATA AP 1 SP 1 BACH 1 E 2 F <0. 12 <0. 18 <0. 24 OCT 1 AP 4 LNCAP AR HNF 3 NF 1 GATA OCT ETS 1/15/2022 21 Copyright 2009 @ Zhang Zhi. Zhuo

Comparison Pomoda Amadeus Trawler Weeder Background model Position Specified Organism Specified Motif model PWM

Comparison Pomoda Amadeus Trawler Weeder Background model Position Specified Organism Specified Motif model PWM (k-mer exact match) PWM (k-mer with mismatches ) PWM (IUPAC string in initial scan) k-mer with mismatches Algorithm Exhaustive search +PWM column updating Add mismatches Merge (recursively) EM Exhaustive search + clustering Exhaustive search Motif Length Various length Fixed length Semi-various length Gap Detection Supported Not Supported Localization center windows size Over-represented bins Not supported Sequence Weighting Supported Not Supported Average Running time 1/15/2022 30 min 93 min >4 hours >4 hours 22 Copyright 2009 @ Zhang Zhi. Zhuo

Center Distribution Foxa 1 1600 1400 1200 1000 800 600 400 200 -1900 -1700

Center Distribution Foxa 1 1600 1400 1200 1000 800 600 400 200 -1900 -1700 -1500 -1300 -1100 -900 -700 -500 -300 -100 300 500 700 900 1100 1300 1500 1700 1900 0 Mixture Model: 1/15/2022 23 Copyright 2009 @ Zhang Zhi. Zhuo

Thank You! 1/15/2022 24 Copyright 2009 @ Zhang Zhi. Zhuo

Thank You! 1/15/2022 24 Copyright 2009 @ Zhang Zhi. Zhuo