Detecting short indels and complex structural variations from

  • Slides: 52
Download presentation
Detecting short indels and complex structural variations from next-gen sequence data Kai Ye Leiden

Detecting short indels and complex structural variations from next-gen sequence data Kai Ye Leiden University Medical Center k. ye@lumc. nl

Start Pindel • • Copy Pindel. tar. gz to your local folder tar xvfz

Start Pindel • • Copy Pindel. tar. gz to your local folder tar xvfz Pindel. tar. gz. /run Return to presentation 2

Types of sequencing • • Without a reference genome: New species With a reference

Types of sequencing • • Without a reference genome: New species With a reference genome: Re-sequencing ü common variants in populations SNPs, indels and structural variants ü All (somatic) variants for subjects with extreme phenotypes, such as cancer a) Normal/tumor paired genomes from the same patient b) Exome sequencing of a group of cancer patients

Types of sequencing • • Without a reference genome: New species With a reference

Types of sequencing • • Without a reference genome: New species With a reference genome: Re-sequencing ü common variants in populations SNPs, indels and structural variants ü All (somatic) variants for subjects with extreme phenotypes, such as cancer a) Normal/tumor paired genomes from the same patient b) Exome sequencing of a group of cancer patients

Sequence for common variants in populations Nature, 2010 Nature, 2011

Sequence for common variants in populations Nature, 2010 Nature, 2011

Precision of breakpoint coordinate for large deletions (> 50 bp) LUMC Sanger institute Yale

Precision of breakpoint coordinate for large deletions (> 50 bp) LUMC Sanger institute Yale Univ.

Types of sequencing • • Without a reference genome: New species With a reference

Types of sequencing • • Without a reference genome: New species With a reference genome: Re-sequencing ü common variants in populations SNPs, indels and structural variants ü All (somatic) variants for subjects with extreme phenotypes, such as cancer a) Normal/tumor paired genomes from the same patient b) Exome sequencing of a group of cancer patients

 • Normal/tumor paired genome sequencing Design - Tumor: malignant melanoma, 40 x, 75

• Normal/tumor paired genome sequencing Design - Tumor: malignant melanoma, 40 x, 75 PE • - Normal: lymphoblatoid, 32 x, 75 PE Result: - 33, 345 somatic base substitutions - 66 somatic indels - 37 SVs Nature, 2010 8 8

The catalog of somatic mutations

The catalog of somatic mutations

Types of sequencing • • Without a reference genome: New species With a reference

Types of sequencing • • Without a reference genome: New species With a reference genome: Re-sequencing ü common variants in populations SNPs, indels and structural variants ü All (somatic) variants for subjects with extreme phenotypes, such as cancer a) Normal/tumor paired genomes from the same patient b) Exome sequencing of a group of cancer patients

Exome sequencing of a group of cancer patients • • PBRM 1 with truncating

Exome sequencing of a group of cancer patients • • PBRM 1 with truncating mutations in 41% (92/227) of cases. My tool, Pindel, was used to detect somatic indels. Nature, 2011 11

Exome sequencing of a group of cancer patients

Exome sequencing of a group of cancer patients

From sequence data to variants samples Sequencing, processing data and variant calling Biological discovery

From sequence data to variants samples Sequencing, processing data and variant calling Biological discovery

Paired-end reads in Next Generation sequencing

Paired-end reads in Next Generation sequencing

Paired-end reads in Next Generation sequencing ~ insert size

Paired-end reads in Next Generation sequencing ~ insert size

Substitution ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT

Substitution ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT

Gapped alignment ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCACAGATCAGTCCAGT

Gapped alignment ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCACAGATCAGTCCAGT

CNV (gain)

CNV (gain)

CNV (loss)

CNV (loss)

Read-depth

Read-depth

Read-pair approach Sample No Indel Reference Sample Deletion Reference Sample Insertion Reference

Read-pair approach Sample No Indel Reference Sample Deletion Reference Sample Insertion Reference

Detecting large indels with pattern growth (Illumina 36 bp paired-end reads) The exact break

Detecting large indels with pattern growth (Illumina 36 bp paired-end reads) The exact break points on the reference • Deletions • Short insertions • Long insertions The fragment removed or inserted

Mapping paired-end reads SNP or small indel • SNPs • Small INDELs • SV

Mapping paired-end reads SNP or small indel • SNPs • Small INDELs • SV

Deletions test ref 1 base - 1 million bases

Deletions test ref 1 base - 1 million bases

Deletions ref Anchor

Deletions ref Anchor

Deletions 2 x average distance ref Anchor

Deletions 2 x average distance ref Anchor

Deletions 2 x average distance ref Anchor Expected maximum deletion size + read length

Deletions 2 x average distance ref Anchor Expected maximum deletion size + read length (36)

Deletions test ref

Deletions test ref

Short insertions 2 x average distance ref Anchor Read length - 1

Short insertions 2 x average distance ref Anchor Read length - 1

Short insertions test ref

Short insertions test ref

Input format of Pindel @9113 TGGGGACCGGTGGAATGCTTCCACTGGGGGGC + chr 2 41149518 50 Tumor ref Anchor

Input format of Pindel @9113 TGGGGACCGGTGGAATGCTTCCACTGGGGGGC + chr 2 41149518 50 Tumor ref Anchor

Output format 1 base - 1 million bases

Output format 1 base - 1 million bases

Breakpoint shift ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT ATCCGTATCACGGAAATGCTGTCAGTC CAGT ATCCGTATCACGGAATGCTGTCAGTCCAGT ATCCGTATCACGGATATGCAAGCAGTC CAGT ATCCGTATCACGG--

Breakpoint shift ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT ATCCGTATCACGGAAATGCTGTCAGTC CAGT ATCCGTATCACGGAATGCTGTCAGTCCAGT ATCCGTATCACGGATATGCAAGCAGTC CAGT ATCCGTATCACGG--

Breakpoint shift ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT ATCCGTATCACGGAAATGCTGTCAGTC CAGT ATCCGTATCACGGAATGCTGTCAGTCCAGT ATCCGTATCACGGATATGCAAGCAGTC CAGT ATCCGTATCACGG--

Breakpoint shift ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT ATCCGTATCACGGAAATGCTGTCAGTC CAGT ATCCGTATCACGGAATGCTGTCAGTCCAGT ATCCGTATCACGGATATGCAAGCAGTC CAGT ATCCGTATCACGG--

Breakpoint shift ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT ATCCGTATCACGGAAATGCTGTCAGTC CAGT ATCCGTATCACGGAATGCTGTCAGTCCAGT ATCCGTATCACGGATATGCAAGCAGTC CAGT ATCCGTATCACGG--

Breakpoint shift ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT ATCCGTATCACGGAAATGCTGTCAGTC CAGT ATCCGTATCACGGAATGCTGTCAGTCCAGT ATCCGTATCACGGATATGCAAGCAGTC CAGT ATCCGTATCACGG--

Breakpoint shift ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT ATCCGTATCACGGAAATGCTGTCAGTC CAGT ATCCGTATCACGGAATGCTGTCAGTCCAGT ATCCGTATCACGGATATGCAAGCAGTC CAGT ATCCGTATCACGG--

Breakpoint shift ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT ATCCGTATCACGGAAATGCTGTCAGTC CAGT ATCCGTATCACGGAATGCTGTCAGTCCAGT ATCCGTATCACGGATATGCAAGCAGTC CAGT ATCCGTATCACGG--

Breakpoint shift ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT ATCCGTATCACGGAAATGCTGTCAGTC CAGT ATCCGTATCACGGAATGCTGTCAGTCCAGT ATCCGTATCACGGATATGCAAGCAGTC CAGT ATCCGTATCACGGA--

Breakpoint shift ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT ATCCGTATCACGGAAATGCTGTCAGTC CAGT ATCCGTATCACGGAATGCTGTCAGTCCAGT ATCCGTATCACGGATATGCAAGCAGTC CAGT ATCCGTATCACGGA--

Breakpoint shift ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT ATCCGTATCACGGAAATGCTGTCAGTC CAGT ATCCGTATCACGGAATGCTGTCAGTCCAGT ATCCGTATCACGGATATGCAAGCAGTC CAGT ATCCGTATCACGGAT--

Breakpoint shift ATCCGTATCACGGTCACCAGATCAGTC CAGT ATCCGTATCACGGTCAGCAGATCAGTC CAGT ATCCGTATCACGGAAATGCTGTCAGTC CAGT ATCCGTATCACGGAATGCTGTCAGTCCAGT ATCCGTATCACGGATATGCAAGCAGTC CAGT ATCCGTATCACGGAT--

Real data on NA 18507 • Input: – Bentley et al. , 2008 –

Real data on NA 18507 • Input: – Bentley et al. , 2008 – 2 x 36 bp paired-end reads – 135 Gb of sequence – ~4 billion paired 35 -base reads • Preprocessing: – ~56 M pairs of one-end mapped reads • Output: – 4. 5 hours on a single CPU and maximum memory usage 1. 5 G – 146, 843 deletions (1 bp-10 kb) and 142, 908 short insertions (1 -16 bp) – 91. 2% of 1 -16 bp deletions and 87. 2% of 1 -16 bp insertions overlap with calls from Illumina

Deletions in NA 18507

Deletions in NA 18507

Paired (cancer) genomes • • COLO-829 cells Normal ~30 x paired-end 75 nt reads

Paired (cancer) genomes • • COLO-829 cells Normal ~30 x paired-end 75 nt reads Tumor ~40 x paired-end 75 nt reads Search for somatic (tumor specific) indels

Inversions sample ref

Inversions sample ref

Tandem duplications sample ref

Tandem duplications sample ref

Non-template insertions ref sample

Non-template insertions ref sample

Non-template insertions D 4 I 2 Chr. ID 3 BP 156978978 156978983 Supports 12

Non-template insertions D 4 I 2 Chr. ID 3 BP 156978978 156978983 Supports 12 + 0 - 12 S 1 13 SUM_MS 627 Num. Sup. Samples 1 HCC 1599 a 12 CATGGCTGACTTATAAATCCCTACAGATATGTGGTTACTTCTCTACTTTCCCTTTGGCTTGGGCAACTGCCACGTTGATGCACTGGAGCCATTCTTCTGCATTCTTCTCATCCTTGGCCTTAA AGACATAGGTTTTATTGTC TTATAAATCCCTACAGATATGTGGTTACTTCTCTACTTTCCCTTTGCCTTGGGCAACTGCCAAA GATGCACT ATGTGGTTACTTCTCTACTTTCCCTTTGGCTTGGGCAACTGCCAAA GATGCACTGGAGCCATTCTTCTGCAT CTCTACTTTCCCTTTGGCTTGGGCAACTGCCA AA GATGCACTGGAGCCATTCTTCTGCATTCTTCTCATCCT AGATATGTGGTTACTTCTCTACTTTCCCTTTGGCTTGGGCAACTGCCAAA GATGCACTGGAGCCATTCTTCT TTTCCCTTTGGCTTGGGCAACTGCCA AA GATGCACTGGAGCCATTCTTCTGCATTCTTCTCATCCTTGGCCTT TTACTTCTCTACTTTCCCTTTGGCTTGGGCAACTGCCA AA GATGCACTGGAGCCATTCTTCTGCATTCTTCTCATCCTTGGCCTTAAAGACATAGGTTT CTACAGATATGTGGTTACTTCTCTACTTTCCCTTTGGCTTGGGCAACTGCCAAA GATGCACTGGAGCCATTC AAATCCCTACAGATATGTGGTTACTTCTCTACTTTCCCTTTGGCTTGGGCAACTGCCAAA GATGCACTGGAG CTTGGGCAACTGCCA AA GATGCACTGGAGCCATTCTTCTGCATTCTTCTCATCCTTGGCCTTAAAGACATAGGTTT TTCCCTTTGGCTTGGGCAACTGCCA AA GATGCACTGGAGCCATTCTTCTGCATTCTTCTCATCCTTGGCCTT

Large insertions

Large insertions

Allow mismatches D 10 Chr. ID 13 BP 32913041 32913052 AAATCAACTAGTGACCTTCCAGGGACAACCCGAACGTGATGAAAAGATCAaagaacctac. TCTATTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAAATTGCAAAGGAATCTTTGGACAAAGT GATGAAAAGATCA TCTGTTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAAATTGCAAAGGAATCTTTGGACAA

Allow mismatches D 10 Chr. ID 13 BP 32913041 32913052 AAATCAACTAGTGACCTTCCAGGGACAACCCGAACGTGATGAAAAGATCAaagaacctac. TCTATTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAAATTGCAAAGGAATCTTTGGACAAAGT GATGAAAAGATCA TCTGTTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAAATTGCAAAGGAATCTTTGGACAA CAACCCGAACGTGATGAAAAGATCA TCTGTTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAAATTGCAAAGGAATCTTTGGA TGAAAAGATCA TCTGTTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAAATTGCAAAGGAATCTTTGGACAAAG GTGATGAAAAGATCA TCTGTTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAAATTGCAAAGGAATCTTTGGAC TAGTGACCTTCCAGGGACAACCCGAACGTGATGAAAAGATCA TCTGTTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAA ACAACCCGAACGTGATGAAAAGATCA TCTGTTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAAATTGCAAAGGAATCTT CCCGAACGTGATGAAAAGATCA TCTGTTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAAATTGCAAAGGAATC AACCCGAACGTGATGAAAAGATCA TCTGTTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAAATTGCAAAGGAATCTTTGGACA ACCTTCCAGGGACAACCCGAACGTGATGAAAAGATCA TCTGTTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAAATTGCAAAGGAATCTTTGGACAA AACCCGAACGTGATGAAAAGATCA TCTGTTGGGTTTTCATACAGCTAGCGGGAAAAAAGTTAAAATTGCAAAGGAATCTTT

Variant types • • • Short indels Large deletions Tandem duplications Inversions Breakpoints of

Variant types • • • Short indels Large deletions Tandem duplications Inversions Breakpoints of large insertions - With/without non-template sequence

Where to get Pindel • www. ebi. ac. uk/~kye/pindel • https: //trac. nbic. nl/pindel

Where to get Pindel • www. ebi. ac. uk/~kye/pindel • https: //trac. nbic. nl/pindel • ftp. sanger. ac. uk/pub/zn 1/pindel • K. ye@lumc. nl • kye@ebi. ac. uk

WWW. EBI. AC. UK/~KYE/PINDEL K. YE@LUMC. NL KYE@EBI. AC. UK

WWW. EBI. AC. UK/~KYE/PINDEL K. YE@LUMC. NL KYE@EBI. AC. UK

Practical • chr 20 -p, MZ twin data of Dorret • Qs: – How

Practical • chr 20 -p, MZ twin data of Dorret • Qs: – How many reads in file twin_reads. txt – Open output file for deletions • less twin_output/twin_reads. txt_20_D – Do you see micro-satellite and homopolymer events? – Can you find potentially discordant events? https: //trac. nbic. nl/pindel/

 • grep Chr. ID twin_output/twin_reads. txt_20_D | awk '{if ($31 == 1 &&

• grep Chr. ID twin_output/twin_reads. txt_20_D | awk '{if ($31 == 1 && $16 >= 10) print}'