The 1000 Genomes Project The Phase 1 Variant

  • Slides: 57
Download presentation
The 1000 Genomes Project The Phase 1 Variant Set and Future Developments Laura Clarke

The 1000 Genomes Project The Phase 1 Variant Set and Future Developments Laura Clarke 16 th October 2012

Glossary • Pilot: The 1000 Genomes project ran a pilot study between 2008 and

Glossary • Pilot: The 1000 Genomes project ran a pilot study between 2008 and 2010 • Phase 1: The initial round of exome and low coverage sequencing of 1000 individuals • Phase 2: Expanded sequencing of 1700 individuals and method improvement • Phase 3: Sequencing of 2500 individuals and a new variation catalogue • SAM/BAM: Sequence Alignment/Map Format, an alignment format • VCF: Variant Call Format, a variant format 2

The 1000 Genomes Project: Overview • International project to construct a foundational data set

The 1000 Genomes Project: Overview • International project to construct a foundational data set for human genetics – Discover virtually all common human variations by investigating many genomes at the base pair level – Consortium with multiple centers, platforms, funders • Aims • Discover population level human genetic variations of all types (95% of variation > 1% frequency) • Define haplotype structure in the human genome • Develop sequence analysis methods, tools, and other reagents that can be transferred to other sequencing projects 3

Phase 1 populations 4

Phase 1 populations 4

Phase 2/3 populations Barbados Pakistan Ghana Bangladesh Peru Nigeria India Vietnam Sierra Leone Sri

Phase 2/3 populations Barbados Pakistan Ghana Bangladesh Peru Nigeria India Vietnam Sierra Leone Sri Lanka USA 5

Hapmap, The Pilot Project and The Main Project • Hapmap • • • Starting

Hapmap, The Pilot Project and The Main Project • Hapmap • • • Starting in 2002 Last release contained ~3 m snps 1400 individuals 11 populations High Throughput genotyping chips • 1000 Genomes Pilot project • • • Started in 2008 Paper release contained ~14 million snps 179 individuals 4 populations Low coverage next generation sequencing • 1000 Genomes Phase 1 • • • Started in 2009 Phase 1 release has 36. 6 millon snps, 3. 8 millon indels and 14 K deletions 1094 individuals 14 populations Low coverage and exome next generation sequencing • 1000 Genomes Phase 2 6 • • Started in 2011 1722 individuals 19 Populations Low coverage and exome next generation sequencing

 • • • • • 7 • Timeline September 2007: 1000 Genomes project

• • • • • 7 • Timeline September 2007: 1000 Genomes project formally proposed Cambridge, UK April 2008: First Submission of Data to the Short Read Archive. May 2008: First public data release. October 2008: SAM/BAM Format Defined. December 2008: First High Coverage Variants Released. December 2008: First 1000 genomes browser released May 2009: First Indel Calls released. July 2009: VCF Format defined August 2009: First Large Scale Deletions released. December 2009: First Main Project Sequence Data Released. March 2010: Low Coverage Pilot Variant Release made July 2010: Phased genotypes for 159 Individuals released. October 2010: A Map of Human Variation from population scale sequencing is published in Nature. January 2011: Final Phase 1 Low coverage alignments are released May 2011: @1000 genomes appears on Twitter May 2011: First Variant Release made on more than 1000 individuals October 2011: Phase 1 integrated variant release made March 2012: Phase 2 Alignment release November 2012: An integrated map of genetic variation from 1, 092 human genomes in Nature

Fraction of variant sites present in an individual that are NOT already represented in

Fraction of variant sites present in an individual that are NOT already represented in db. SNP Date Fraction not in db. SNP February, 2000 98% February, 2001 80% April, 2008 10% February, 2011 2% Now <1% Ryan Poplin, David Altshuler

Sequencing Data Evolution • The Project contains data from 3 different providers and multiple

Sequencing Data Evolution • The Project contains data from 3 different providers and multiple platforms 9 Platform Min Read Length (bp) Max Read Length (bp) 454 Roche GS FLX Titanium 70 400 Illumina GA 30 81 Illumina GA II 26 160 Illumina Hi. Seq 50 102 ABI Solid System 2. 0 25 35 ABI Solid System 2. 5 50 50 ABI Solid System 3. 0 50 50

Pipelines for data processing and variant calling • Tens of analysis groups have contributed

Pipelines for data processing and variant calling • Tens of analysis groups have contributed • Individual pipelines and component tools vary • Typical main steps: • • • Read mapping Duplicate filtering Base quality value recalibration INDEL realignment Variant Site Discovery Individual Genotype Assignment (sometimes part of site discovery) • Variant filtering / call set refinement • Variant reporting

3 pilot coverage strategies 11

3 pilot coverage strategies 11

Phase 1 analysis goal: an integrated view of human variations § Integrated haplotype map

Phase 1 analysis goal: an integrated view of human variations § Integrated haplotype map of 1, 092 human genomes o 14 populations from Europe, Africa, East Asia, and Americas o 38 million SNPs, 2 million INDELs, 14 thousands larger deletions o ~99% of per-individual variants, >95% of 1% frequency SNPs 12

Alignment Data • The project has made more than 10 releases of Alignment Data

Alignment Data • The project has made more than 10 releases of Alignment Data • Pilot Project • Aligned to NCBI 36 • Maq and Corona • Base Quality Recalibration done • Phase 1 • Aligned to GRCh 37 • BWA and Bfast • Indel Realignment • Phase 2 • Aligned to extended GRCh 37 • Improvements to Base Quality Recalibration 13

Short Variant Calling • Early call sets used a single variant caller • Intersect

Short Variant Calling • Early call sets used a single variant caller • Intersect approach developed during pilot • Variant Quality Score Recalibration (VQSR) developed for Phase 1 • Genotype Likelihoods assigned to help with genotype calling • Integrated genotype calling based on individual variant call sets • Phase 2 looks to improve site discovery and improve integration 14

Sequence Data SNPS Indels Large Deletions 15 Summary of Variants Autosomes Chromosome X GENCODE

Sequence Data SNPS Indels Large Deletions 15 Summary of Variants Autosomes Chromosome X GENCODE regions Mean mapped depth (x) 5. 1 3. 9 80. 3 Total Raw Bases (GB) 19049 804 327 No. sites overall Novelty rate 36. 7 M 58% 1. 3 M 77% 498 K 50% No. Syn/Nonsense NA 4. 7 / 6. 5 / 0. 097 k 199 / 293 / 6. 3 k Avg. No. SNPs per sample No. sites overall 105 K 59 K 24. 0 K 1867 Novelty rate 62% No. in-frame/frameshift NA Avg. no. Indels per sample 344 K 73% 19 / 14 13 K 54% 719 / 1066 440 No. sites overall 13. 8 K 432 847 Novelty rate Avg. variants per sample 54% 717 54% 26 50% 39 3. 60 M 1. 38 M

Power to Detect Site Detection 16 Genotype Concordance

Power to Detect Site Detection 16 Genotype Concordance

Quality of the Phase 1 Integrated Genotypes 17 Region TYPE EVAL Genome Exome Genome

Quality of the Phase 1 Integrated Genotypes 17 Region TYPE EVAL Genome Exome Genome SNP SNP INDEL SV Omni 2. 5 M CGI CGI Conrad # EVAL #Markers Samples (Overlaps) 1, 092 34 765 34 34 248 2. 1 M 13 M 50 K 13 M 820 K 1. 1 K HET Overall concord. 99. 09% 98. 63% 99. 73% 99. 52% 95. 64% 99. 01% 99. 65% 99. 60% 99. 95% 99. 83% 98. 01% 99. 82%

Imputation Accuracy 18

Imputation Accuracy 18

Ancestry Deconvolution in Ad-Mixed Populations Blue=European Grey=African Red=Native American Black=Unassigned 19

Ancestry Deconvolution in Ad-Mixed Populations Blue=European Grey=African Red=Native American Black=Unassigned 19

New for Phase 2 and 3 • • • 20 Empirical Indel Error Modeling

New for Phase 2 and 3 • • • 20 Empirical Indel Error Modeling De Novo Assembly Variant Calling Haplotype based variant calling Multi Allelic Sites MNP and complex substitution calling Integrated Genotyping for Structural variation

Data Availability • FTP site: ftp: //ftp. 1000 genomes. ebi. ac. uk/vol 1/ftp/ •

Data Availability • FTP site: ftp: //ftp. 1000 genomes. ebi. ac. uk/vol 1/ftp/ • Raw Data Files • AWS Amazon Cloud: http: //aws. amazon. com/1000 genomes/ • FTP mirror • Web site: http: //www. 1000 genomes. org • Release Announcements • Documentation • Ensembl Style Browser: http: //browser. 1000 genomes. org • • 21 Browse 1000 Genomes variants in Genomic Context Variant Effect Predictor Data Slicer Other Tools

FTP Site • Two mirrored ftp sites • ftp: //ftp. 1000 genomes. ebi. ac.

FTP Site • Two mirrored ftp sites • ftp: //ftp. 1000 genomes. ebi. ac. uk/vol 1/ftp • ftp: //ftp-trace. ncbi. nih. gov/1000 genomes/ftp • • • NCBI site is direct mirror of EBI site Can be up to 24 hours out of date Both also accessible using aspera http: //asperasoft. com/ EBI site has http mirror • http: //ftp. 1000 genomes. ebi. ac. uk/vol 1/ftp 22

ftp: //ftp. 1000 genomes. ebi. ac. uk ftp: //ftp-trace. ncbi. nih. gov/1000 genomes/ftp Documentation

ftp: //ftp. 1000 genomes. ebi. ac. uk ftp: //ftp-trace. ncbi. nih. gov/1000 genomes/ftp Documentation Raw Data Phase 1 Data Pilot Data Release Data Technical Data

The FTP Site: Data Sample Level Files sequence_read alignment 24

The FTP Site: Data Sample Level Files sequence_read alignment 24

FTP Site: Technical Alternative Alignments Reference Data Sets Experimental Data 25

FTP Site: Technical Alternative Alignments Reference Data Sets Experimental Data 25

FTP Site: Release Older Release Dirs Date Format YYYYMMDD Sequence Index Dates 26

FTP Site: Release Older Release Dirs Date Format YYYYMMDD Sequence Index Dates 26

FTP Site: Phase 1 Data Analysis Results 27

FTP Site: Phase 1 Data Analysis Results 27

Finding Data • • • 28 FTP search http: //www. 1000 genomes. org/ftpsearch Search

Finding Data • • • 28 FTP search http: //www. 1000 genomes. org/ftpsearch Search on the current. tree file Provides full ftp paths and md 5 checksums Every page also has a website search box

http: //www. 1000 genomes. org 29

http: //www. 1000 genomes. org 29

http: //browser. 1000 genomes. org

http: //browser. 1000 genomes. org

Genes and SNPs Coding UTR Intron Line indicates number of SNPS Each Line is

Genes and SNPs Coding UTR Intron Line indicates number of SNPS Each Line is One SNP

Region in Detail

Region in Detail

Turning on Tracks 33

Turning on Tracks 33

File upload to view with 1000 Genomes data Manage your data • Supports popular

File upload to view with 1000 Genomes data Manage your data • Supports popular file types: • BAM, BED, bed. Graph, Big. Wig, GBrowse, Generic, GFF, GTF, PSL, VCF*, WIG * VCF must be indexed

Uploaded VCF Example: ftp: //ftp. 1000 genomes. ebi. ac. uk/vol 1/ftp/release/20110521/ALL. wgs. phase 1_release_v

Uploaded VCF Example: ftp: //ftp. 1000 genomes. ebi. ac. uk/vol 1/ftp/release/20110521/ALL. wgs. phase 1_release_v 3. 201011 23. snps_indels_sv. sites. vcf. gz

Gene View Click a Gene then ‘Variation Table’ or ‘Variation Image’ Gene Tab Download

Gene View Click a Gene then ‘Variation Table’ or ‘Variation Image’ Gene Tab Download as csv Get in vcf format 36

Variation Image • Gene variation zoom

Variation Image • Gene variation zoom

Transcript Tab: Variations Effect on Protein: • SIFT • Poly. Phen

Transcript Tab: Variations Effect on Protein: • SIFT • Poly. Phen

Variation Pages 39

Variation Pages 39

http: //browser. 1000 genomes. org

http: //browser. 1000 genomes. org

Tools page

Tools page

Data Slicing

Data Slicing

43

43

Variant Effect Predictor • Predicts Functional Consequences of Variants • Both Web Front end

Variant Effect Predictor • Predicts Functional Consequences of Variants • Both Web Front end and API script • Can provide • sift/polyphen/condel consequences • Refseq gene names • HGVS output • • 44 Can run from a cache as well as Database Convert from one input format to another Script available for download from: ftp: //ftp. ensembl. org/pub/miscscripts/Variant_effect_predictor/ • http: //browser. 1000 genomes. org/Homo_sapiens/User. Dat a/Upload. Variations

46

46

Variation Pattern Finder • Remote or local tabix indexed VCF input • Discovers patterns

Variation Pattern Finder • Remote or local tabix indexed VCF input • Discovers patterns of Shared Inheritance • Variants with functional consequences considered by default • Web output with CSV and Excel downloads • http: //browser. 1000 genomes. org/Homo_sapiens/User. Dat a/Variations. Map. VCF

Variation Pattern Finder 48

Variation Pattern Finder 48

49

49

VCF to PED • • 50 LD Visualization tools like Haploview require PED files

VCF to PED • • 50 LD Visualization tools like Haploview require PED files VCF to PED converts VCF to PED Will a file divide by individual or population http: //browser. 1000 genomes. org/Homo_sapiens/User. Dat a/Haploview

VCF to PED 51

VCF to PED 51

52

52

Haploview • haploview 53 http: //www. broadinstitute. org/scientific-community/science/programs/medical-and-populationgenetics/haploview

Haploview • haploview 53 http: //www. broadinstitute. org/scientific-community/science/programs/medical-and-populationgenetics/haploview

Access to backend Ensembl databases • Public My. SQL database at • mysql-db. 1000

Access to backend Ensembl databases • Public My. SQL database at • mysql-db. 1000 genomes. org port 4272 • Full programmatic access with Ensembl API • The 1000 Genomes Pilot uses Ensembl v 60 databases and the NCBI 36 assembly (this is frozen) • The 1000 Genomes main project currently uses Ensembl v 68 databases • http: //jul 2012. archive. ensembl. org/info/docs/api/vari ation/index. html • http: //www. ensembl. org/info/docs/api/variation/index. html • http: //www. 1000 genomes. org/node/517

More Information http: //www. 1000 genomes. org/using-1000 -genomes-data Please email info@1000 genomes. org with

More Information http: //www. 1000 genomes. org/using-1000 -genomes-data Please email info@1000 genomes. org with any questions 55

Announcements • http: //1000 genomes. org • 1000 announce@1000 genomes. org • http: //www.

Announcements • http: //1000 genomes. org • 1000 announce@1000 genomes. org • http: //www. 1000 genomes. org/1000 -genomesannoucement-mailing-list • http: //www. 1000 genomes. org/announcements/rss. xml • http: //twitter. com/#!/1000 genomes 56

Thanks • • • 57 The 1000 Genomes Project Consortium Paul Flicek Richard Smith

Thanks • • • 57 The 1000 Genomes Project Consortium Paul Flicek Richard Smith Holly Zheng Bradley Ian Streeter