ANALYSIS OF B CELLAIRRSEQ DATA Jason Anthony Vander
ANALYSIS OF B CELLAIRR-SEQ DATA Jason Anthony Vander Heiden Kleinstein & O’Connor Labs Yale University
Topics § Raw read assembly and quality control § V(D)J reference alignment and novel allele detection § Clonal assignment and diversity § Analysis of mutations and selection 2
Antibody structure and V(D)J recombination § Antibodies are composed of two heavy and light chains. § Variable region is antigen contacting. § Constant region defines the isotype and effector function of the antibody: Ig. M, Ig. D, Ig. A, Ig. G or Ig. E. § Built by recombination of V-D-J (heavy chain) or V-J (light chain). § V(D)J junction region identifies a clonal lineage. Junction 3
Affinity maturation and somatic hypermutation 4
V(D)J m. RNA sequencing § Typically bulk lysis of isolated B cells. § The variable, V(D)J region, is selectively amplified along with a small segment of the constant region for isotype identification. § Traditionally Roche 454 or Illumina Mi. Seq. Sample collection, B cell isolation, and RNA extraction V(D)J amplification and library preparation Sequencing 5
Unique molecular identifiers § Individual m. RNA molecules are labeled with nucleotide barcodes (UMIs), which are then used for downstream error correction. Vander Heiden, Yaari et al. Bioinformatics. 2014. 6
The Immcantation analysis framework Raw sequences § Two commandline tools sets. § Three R packages. § Modular design. Annotated repertoire Biological interpretation http: //immcantation. readthedocs. io 7
Why use Immcantation? § Covers all stages of analysis § Widely used and extensively tested. § Includes many analytical methods unavailable elsewhere. § Common data standard simplifies workflows. § Modular design. May be used in whole or in part. § Under active development. § Fully open source license (CC BY-SA 4. 0). § Extensive documentation: http: //immcantation. com 8
Immcantation data standard and interoperability § All tools in the framework use a standardized scheme for sequencespecific annotations in sequences files (FASTA/Q) and a standardized tab -separated file format. § Tools are provided to exchange between FASTA/Q and the TSV format, as well as a Python API for parsing the FASTA/Q annotation scheme. § Details of the full standard are available online. FASTA/Q >SEQ 1|PRIMER=IGHV 6, IGM|BARCODE=DAY 7|DUPCOUNT=8 NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACCTGTGCCAGCAGACCCTCTCACCTGTGCCCCTCGCAGACCCTGCGT >SEQ 2|PRIMER=IGHV 1, IGG|BARCODE=DAY 3|DUPCOUNT=64 TCTCCGGGGACAGTGTTTCTACCAAAAGACCCTCTCACTGACCCTCTCACGCAGACCCTCTCACCTGTGCCCCTCGCAGACCAGCACT Tab-separated SEQUENCE_ID SEQUENCE_INPUT … PRIMER BARCODE DUPCOUNT SEQ 1 NNNNCCACG. . . CCTGTGCCA … IGHV 6, IGM DAY 7 8 SEQ 2 TCTCCGGGG. . . CCCTCTCAC … IGHV 1, IGG DAY 3 64 http: //changeo. readthedocs. io/en/latest/standard. html 9
Documentation Portal: http: //immcantation. readthedocs. io § p. RESTO: http: //presto. readthedocs. io § Change-O: http: //changeo. readthedocs. io § Alakazam: http: //alakazam. readthedocs. io § SHaza. M: http: //shazam. readthedocs. io § TIg. GER: http: //tigger. readthedocs. io 10
Docker container Available on Docker Hub: javh/immcantation: release # Pull and run image docker pull javh/immcantation: release docker run –v ~/Workspace/examples: /data: z -it javh/immcantation: release bash Contents # List installed versions report § Immcantation framework § R § Ig. BLAST # Pipelines help presto-abseq -h changeo-igblast -h shazam-threshold –h changeo-clone -h § MUSCLE § BLAST § vsearch § PHYLIP § Various helper scripts § Set of predefined pipelines 11
Example data § One healthy donor (HD 13) from: § Vander Heiden et al. Dysregulation of B Cell Repertoire Formation in Myasthenia Gravis Patients Revealed through Deep Sequencing. J Immunol. 2017; 198: 1460– 1473. § Archive: § § Raw data in FASTQ: /input. C-primer and template switch sequences in FASTA: Example scripts: /scripts Completed output from all scripts: /complete /input # Reduce files to 100 K reads head –n 400000 HD 13 M_R 1. fastq > read-1. fastq head –n 400000 HD 13 M_R 2. fastq > read-2. fastq 12
RAW READ PROCESSING AND QUALITY CONTROL 13
Raw read processing and quality control Goals § Mitigate the impact of sequencing and PCR error. § Remove off-target amplicons § Read assembly and UMI consensus generation. § Add read specific annotations of interest to downstream analysis. Key Steps § Quality score filtering. § Masking and annotation of primer sequences. Extract UMI barcodes. § Build UMI consensus sequences. § Paired-end assembly. § Removal of duplicate sequences. 14
p. RESTO: Raw read processing # Remove reads with mean Q < 20 Filter. Seq. py quality -s read-1. fastq -q 20 Filter. Seq. py quality -s read-2. fastq -q 20 # Remove primer region and annotate reads with UMI barcode Mask. Primers. py score -s read-1_quality-pass. fastq -p cprimers. fasta --start 0 --mode cut --outname read-1 Mask. Primers. py score -s read-2_quality-pass. fastq -p ts. fasta --start 17 --barcode --mode cut --outname read-2 # Copy UMI annotation from read 2 to read 1 Pair. Seq. py -1 read-1_primers-pass. fastq -2 read-2_primers-pass. fastq --coord illumina --2 f BARCODE http: //presto. readthedocs. io 15
p. RESTO: Raw read processing # Generate UMI consensus sequences Build. Consensus. py -s read-1_primers-pass_pair-pass. fastq --prcons 0. 6 --maxerror 0. 1 --outname read-1 Build. Consensus. py -s read-2_primers-pass_pair-pass. fastq --maxerror 0. 1 --outname read-2 # Assemble paired-end consensus sequences Pair. Seq. py -1 read-1_consensus-pass. fastq -2 read-2_consensus-pass. fastq --coord presto Assemble. Pairs. py sequential -1 read-2_consensus-pass_pair-pass. fastq -2 read-1_consensus-pass_pair-pass. fastq --coord presto --rc tail -r ighv. fasta --aligner blastn --1 f CONSCOUNT PRCONS --outname reads # Remove duplicates Collapse. Seq. py -s reads_assemble-pass. fastq -n 10 --inner --uf PRCONS --cf CONSCOUNT --act sum --outname reads # Filter to sequences with at least 2 representative reads Split. Seq. py group -s reads_collapse-unique. fastq -f CONSCOUNT --num 2 --fasta # Final output file reads_collapse-unique_atleast-2. fasta http: //presto. readthedocs. io 16
QC Concerns § Error rate from alignment of Phi. X reads (unindexed) against reference genome (SAV files). § Quality scores in the 10 -25 range. § Primer identification. § Reads per UMI. § Assembly rates and overlaps. 17
V(D)J ANNOTATION AND GENOTYPING 18
V(D)J segment annotation Goals § Obtain reads annotated with V(D)J alleles and CDR 3 sequences. Key Steps § V(D)J reference alignment. § Data standardization. § Genotyping and novel allele detection. http: //changeo. readthedocs. io 19
Ig. BLAST + Change-O: V(D)J reference alignment # Import Ig. BLAST and IMGT germline databases fetch_igblastdb. sh -o /data/igblast fetch_imgtdb. sh -o /data/imgt 2 igblast. sh -i /data/imgt -o /data/igblast # Run Ig. BLAST run_igblast. sh -s reads_unique_atleast-2. fasta -d /data/igblast # Convert Ig. BLAST output to the Immcantation standard Make. Db. py igblast -i reads_unique_atleast-2. fmt 7 -s reads_unique_atleast-2. fasta -r /data/imgt/human/vdj --regions --scores --outname data # Filter to heavy chains sequences Parse. Db. py select -d data_db-pass. tab -f V_CALL -u IGHV --regex --outname igh # Final output file igh_parse-select. tab https: //bitbucket. org/kleinstein/immcantation/src/tip/scripts 20
TIg. GER: SNP detection and genotyping # Load data into R library(alakazam) library(tigger) db <- read. Changeo. Db("igh_parse-select. tab") ighv <- read. Ig. Fasta("/data/imgt/human/vdj/imgt_human_IGHV. fasta") # Identify polymorphisms and genotype nv <- find. Novel. Alleles(db, germline_db=ighv) gt <- infer. Genotype(db, germline_db=ighv, novel_df=nv) http: //tigger. readthedocs. io Gadala-Maria et al. PNAS. 2015. 21
TIg. GER: SNP detection and genotyping # Save genotype > gtseq <- genotype. Fasta(gt, ighv, nv) > write. Fasta(gtseq, "v_genotype. fasta") # Plot genotype > ggsave("genotype. pdf", plot. Genotype(gt, silent=T)) # Modify allele calls and write db > db <- cbind(db, reassign. Alleles(db, gtseq)) > write. Change. Db("igh_genotyped. tab") http: //tigger. readthedocs. io Gadala-Maria et al. PNAS. 2015. 22
CLONAL ASSIGNMENT AND DIVERSITY 23
Clonal diversity analysis Goals § Group sequences into clonally related lineages. § Analyze clone size distributions. Key Steps § Determine clonal clustering threshold. § Assign clonal groups. § Reconstruct germline sequences. § Calculate diversity metrics. 24
Clonal assignment using CDR 3 as a fingerprint Gupta et al. J Immunol. 2017. Nouri et al. In preparation. 25
SHaza. M: Tuning of the clonal assignment threshold # Load data db <- data. frame(read. Changeo. Db("igh_genotyped. tab")) # Determine threshold of distance to nearest neighbors db <- dist. To. Nearest(db, model="ham", normalize="len", v. Call. Column="V_CALL_GENOTYPED") x <- find. Threshold(db$DIST_NEAREST, method="gmm") # Plot distance-to-nearest distributions with model fit ggsave("threshold. pdf", plot(x, silent=T)) http: //shazam. readthedocs. io 26
Change-O: Clonal assignment and germline reconstruction # Assign clonal groups Define. Clones. py bygroup -d igh_genotyped. tab --model ham --norm len --dist 0. 14 --outname igh # Add full length germline sequences Create. Germlines. py -d igh_clone-pass. tab -r /data/imgt/human/vdj/*IGH[DJ]. fasta v_genotype. fasta -g dmask --cloned --vf V_CALL_GENOTYPED --outname igh # Final output file igh_germ-pass. tab http: //changeo. readthedocs. io 27
Alakazam: Clonal abundance and diversity # Load and subset data db <- read. Changeo. Db("igh_germ-pass. tab") db <- subset(db, PRCONS %in% c("Ig. M", "Ig. G", "Ig. A")) # Calculate and plot rank-abundance curve a <- estimate. Abundance(db, group="PRCONS") ggsave("abundance. pdf", plot. Abundance(a, silent=T)) # Generate Hill diversity curve d <- rarefy. Diversity(db, group="PRCONS") ggsave(“diversity. pdf", plot. Diversity. Curve(d, silent=T)) http: //alakazam. readthedocs. io Vander Heiden et al. J Immunol. 2017. Di Niro et al. Immunity. 2015. Stern, Yaari, Vander Heiden et al. Sci Trans Med. 2014. 28
Alakazam: Physicochemical properties and gene usage # Calculate CDR 3 amino acid properties p <- amino. Acid. Properties(db, seq="JUNCTION", nt=T, trim=T, label="CDR 3") http: //alakazam. readthedocs. io Vander Heiden et al. J Immunol. 2017. # V family usage by isotype and clone v <- count. Genes(db, gene="V_CALL_GENOTYPED", groups="PRCONS", clone="CLONE", mode="family") Di Niro et al. Immunity. 2015. Stern, Yaari, Vander Heiden et al. Sci Trans Med. 2014. 29
Alakazam: Lineage reconstruction and topology analysis # Build tree from a single clone x <- make. Changeo. Clone(subset(db, CLONE == 2000), text_fields="PRCONS") g <- build. Phylip. Lineage(x, dnapars="/usr/local/bin/dnapars") ggsave("tree. pdf", plot(g, layout=layout_as_tree, vertex. label=V(g)$PRCONS)) Stern, Yaari, Vander Heiden et al. Sci Transl Med. 2014. Gupta, Vander Heiden et al. Bioinformatics. 2015 30
MUTATIONS, SHM TARGETING, AND SELECTION 31
SHaza. M: Mutational load # Load data db <- data. frame(read. Changeo. Db("igh_germ-pass. tab")) db <- subset(db, PRCONS %in% c("Ig. M", "Ig. G", "Ig. A")) # Calculate total mutation frequency x <- observed. Mutations(db, region. Definition=NULL, frequency=T) http: //shazam. readthedocs. io Yaari et al. Front Immunol. 2013. Vander Heiden et al. J Immunol. 2017. Di Niro et al. Immunity. 2015. 32
SHaza. M: Models of SHM targeting biases # Build and plot SHM targeting model m <- create. Targeting. Model(db) pdf("mutability. pdf") plot. Mutability(m, nucleotides="A") plot. Mutability(m, nucleotides="C") plot. Mutability(m, nucleotides="G") plot. Mutability(m, nucleotides="T") dev. off() http: //shazam. readthedocs. io Yaari et al. Front Immunol. 2013. Vander Heiden et al. J Immunol. 2017. Di Niro et al. Immunity. 2015. 33
SHaza. M: Quantification of selection pressure # Calculate clonal consensus and selection by clone z <- collapse. Clones(db) b <- calc. Baseline(z, region. Definition=IMGT_V) # Combine selection scores for isotypes g <- group. Baseline(b, group. By="PRCONS") # Plot densities ggsave("selection. pdf", plot. Baseline. Density(g, "PRCONS", sigma. Limits=c(-1, 1), silent=T)) http: //shazam. readthedocs. io Yaari et al. Front Immunol. 2013. 34
ACKNOWLEDGEMENTS Kleinstein Lab Yale School of Medicine Funding § Steven Kleinstein § Kevin O’Connor § Gruber Science Fellowship § Namita Gupta § Joel Stern § NLM Grant T 15 LM 07056 § Mohamed Uduman § David Hafler § NIH Grant R 01 AI 104739 § Daniel Gadala-Maria University of Pittsburg § Susanna Marquez § Roberto Di Niro § US-Israel BSF Grant 2009046 § Julian Zhou § Mark Shlomchik § Nima Nouri Juno Therapeutics, Inc. Bar-Ilan § Gur Yaari § Francois Vigneault § Sonia Timberlake § David Koppstein 35
- Slides: 35