DTL Focus meeting Using GRCh 38 in NGS

  • Slides: 18
Download presentation
DTL Focus meeting: Using GRCh 38 in NGS data analysis Time slot Speaker Subject

DTL Focus meeting: Using GRCh 38 in NGS data analysis Time slot Speaker Subject 12: 45 -13: 00 Coffee/tea 13: 00 -13: 20 Ies Nijman (UMCU) Welcome & Introduction to GRCh 38 (hg 20) 13: 20 -13: 40 Pieter Neerinx (UMCG) Migration of tools, pipelines to support GRCh 38 13: 40 -14: 00 Pjotr Prins BWA handling of ALTcontigs 14: 00 -14: 10 Tea break Zuotian Tatum (LUMC) New insights on Differential Gene Expression using GRCh 38 14: 30 -14: 50 Wibowo Arindrarto (LUMC) Comparison of hg 19 and GRCh 38 in the study of DUX 4 gene 14: 50 -15: 30 Ies Nijman (UMCU) Wrap-up and open discussions 14: 10 -14: 30

GRCh 38 / hg 20

GRCh 38 / hg 20

Human genome build hg 20 • Basic new assembly released dec 24 th 2013,

Human genome build hg 20 • Basic new assembly released dec 24 th 2013, now GRCh 38. p 2 (dec 8 th, 2014) • 5 -7 megabases of added sequence to primary reference • Many corrected regions (patches) to hg 19 • 261 alternative loci: chromosomal regions with high variability (~66 MB) • 128 large unplaced sequence regios • Human_herpes_virus (EBV) mapping decoy (171 kb) • Centromere sequences: gaps are replaced by sequence models of the centromer repeats • New mitochondrial sequence: Revised Cambridge Reference Sequence (r. CRS) from MITOMAP • 4 PAR regions • This means that coordinates change! Lift-over strategies will not completely solve it.

Human genome build hg 20

Human genome build hg 20

Human genome build hg 20 • New genebuild now available (20. 364 coding genes;

Human genome build hg 20 • New genebuild now available (20. 364 coding genes; 2. 101 in alternative loci) • Only few calling/annotation tools support hg 20 yet (VEP fi) • Ensembl default genome is hg 20!! Latest hg 19 site is beeing maintained through archive link. • db. SNP locations available for hg 20 • 1000 G data will be remapped and recalled (est Q 1, /Q 2 2015)

Human genome build hg 20 -Challenges and opportunities- • How to use these alternative

Human genome build hg 20 -Challenges and opportunities- • How to use these alternative loci? In hg 19 only few were present and mostly blissfully ignored. . • Challenge I: mapping strategy and tools needs to be changed • In prep: i. BWA, srprism • BWA 0. 7. 12 (29 dec 2014) supports ALTs in a two-step approach • Challenge II: variant callers need to be aware of alternative references (and context) • Challenge III: how to display this data in genome browsers etc, while maintaining context? • Challenge IV: nomenclature • The primary assembly contains all patches and fixes to hg 19 and is still a good starting point.

What are these ALT loci? • Scaffolds that provide an alternate representation of a

What are these ALT loci? • Scaffolds that provide an alternate representation of a locus found in the primary reference. • long regions with clustered variations (ie LRC/KIR chr 19 and MHC on chr 6. HLA loci) • Next to different haplo-variants of genes, contain also genes not in the primary assembly (20 prot. coding, ~40 predicted prot. cod. , pseudogenes, lincs) • Mind: ALTernative approaches between NCBI and ensembl: NCBI uses primary chromosomes and ALT loci while ensembl build a completely new ALT chromosome (so incl identical sequence)

Usage scenarios • I: use primary reference (toplevel chrs) • II: use primary reference

Usage scenarios • I: use primary reference (toplevel chrs) • II: use primary reference + mapping decoys (Un + EBV) • Improves mapping accuracy • Only feed primary reference to variantcaller • III: use primary reference + ALT loci + mapping decoys (Un + EBV) • Improves mapping accuracy (? ) • A: Only feed primary reference to variantcaller • B: Run variantcaller on all loci…

Adding the mapping decoys Grch 38_full_plus_analysisset Class Total bp Primary 3. 088. 286. 401

Adding the mapping decoys Grch 38_full_plus_analysisset Class Total bp Primary 3. 088. 286. 401 Unlocalized 6. 978. 808 Unplaced 4. 485. 509 ALT 109. 535. 387 decoy 5. 964. 345 Total 3. 215. 250. 450 graphs based on 11 Xten WGS samples Grch 38_full_analysisset Total bp 3. 088. 286. 401 6. 978. 808 4. 485. 509 109. 535. 387 171. 823 3. 209. 457. 928

GRCh 37. p 13 Improved alignments outside of fix patch regions Jason Harris Regions

GRCh 37. p 13 Improved alignments outside of fix patch regions Jason Harris Regions outside of fix patches hs 37 d 5 GRCh 37. p 13 10 hs 37 d 5 GRCh 37. p 13 Personalis, Inc. | Confidential and Proprietary

Heng Li: BWA approach to ALT mapping • ALTs supported in >v 0. 7.

Heng Li: BWA approach to ALT mapping • ALTs supported in >v 0. 7. 11 through additional ID-list file $ref. alt • Advised to use NCBI ngs-analyses sets (3 flavors) with slightly modified sequences to facilitate mapping (hardmasked PAR and centromeric regions) 1. The original map. Q of a non-ALT hit is computed across non-ALT hits only. The reported map. Q of an ALT hit is always computed across all hits. 2. An ALT hit is only reported if its score is better than all overlapping non. ALT hits. A reported ALT hit is flagged with 0 x 800 (supplementary) unless there are no non-ALT hits. 3. The map. Q of a non-ALT hit is reduced to zero if its score is less than 80% (controlled by option -g) of the score of an overlapping ALT hit. In this case, the original map. Q is moved to the om tag.

Heng Li: BWA approach

Heng Li: BWA approach

Variantcalling on ALTs?

Variantcalling on ALTs?

Variant calling on ALTs?

Variant calling on ALTs?

Variant calling on ALTs? • By adding the ALT loci in mapping and calling

Variant calling on ALTs? • By adding the ALT loci in mapping and calling we gain better haplo aware mappings/calls, but it is not clearly reflected in the vcf • Adding ‘ haplotyping’ to the VCF format A. Quinlan, Virginia, GRC WS 2014

Variant Annotation on HG 20 / ALTs • Ensembl VEP • snp. EFF •

Variant Annotation on HG 20 / ALTs • Ensembl VEP • snp. EFF • db. NSFP in next release (~may)

Nomenclature chr 19_KI 270938 v 1_alt CHR_HSCHR 19 KIR_G 248_BA 2_HAP_CTG 3_1 hg 38

Nomenclature chr 19_KI 270938 v 1_alt CHR_HSCHR 19 KIR_G 248_BA 2_HAP_CTG 3_1 hg 38 / GRCh 38 not hg 20 please… Gen. Bank: KI 270886. 1 Ref. Seq: NT_187640. 1 17 Personalis, Inc. | Confidential and Proprietary

Everything is in a state of flux, including the status quo. -Robert Byrne- •

Everything is in a state of flux, including the status quo. -Robert Byrne- • Even after 1. 5 years after the release many things are uncertain about the use of the full build. • GATK is remarkably silent • Ewan Birney and Richard Durbin agreed march 24 th to rebuild a new reference/analysis set with more standardized set of chr, ALTs and decoys (pers. Comm). • Henk Li: “ The current BWA-MEM method is just a start. []We may make changes. It is also possible that we might make breakthrough on the representation of multiple genomes, in which case, we can even get rid of ALT contigs for good. ”