NGS data analysis in R Biostrings and Shortread

NGS data analysis in R Biostrings and Shortread Stacy Xu BD

NGS analysis n Sequencing analysis n Functionally n n n String manipulations NGS formats (sequences, intervals) Statistical model testing Graphical data representation Knowledgably n n n Large amount of raw data sets Large amount of annotations Database connections

NGS related bioconductor packages n String and interval packages n Biostrings (Herve Pages) n n Genomic. Ranges (P. Aboyoun) n n HT short-read sequences girafe (J. Toedling) n n Wrap of samtools, bcftools, tabix Short. Read (Martin Morgan) n n Genomic intervals representation Rsamtools (Martin Morgan) n n Biological string objects & Matching algorithms Genomic intervals and read alignments Annotations n Genomic. Features (M. Carlson) n n BSgenomes (Herve Pages) n n Transcript centric annotations from UCSC & Bio. Mart Biostrings-based genome annotations rtracklayer (Michael Lawrence) n Genome browsers and their annotation tracks

NGS work flow n n Biological sample/library preparation Sequencing process Sequence alignment Data interpretation n n Input sequencing data n Fasta (sequence) & fastq (sequence + qual) files n BAM & SAM files (reads with header, alignments and references) Analysis n QA, alignment, coverage, identification, etc n Data representation n Plotting coverage, quality, etc

Bio. Strings --Genomic data retrieval n n Load from BSgenome n library(BSgenome) n available. genomes() Download related files from NCBI n. fna files (whole genomic sequence) n. rnt files (rna positions) n. faa files (protein sequences in fasta format) n. ffn files (protein coding portions) n. frn files (rna coding portions) n. gbk files (genome, genbank file format ) n. gff files (genome features)

Biostrings --Create objects Containers n XString – DNA, RNA, AA n XString. Set – multiple sequences n XString. Views n Create from fasta file n n Create from scratch n Load from packages

Biostrings --Basic functions n String manipulations n Base manipulations

Bio. Strings --Pattern matching methods n (v)match. PDict n Match one or more patterns with one or more strings – not with indels, allow mismatches n (v)match. Pattern n Match one pattern with one or more strings – with indels, allow mismatches n pairwise. Alignment n Align two sequences – with indels n match. PWM n Position specific matrix matching for motif matching n match. Probe. Pair n Primer pair matching – not allow mismatches

Bio. Strings -- Pattern matching examples

Bio. Strings --Pattern matching examples n Primer pair matching

Bio. Strings --Pattern matching examples n Motif matching

Short. Read --Load sequencing data n library(Short. Read) n fastq = read. Fastq(fastq. File) n seq. ID = id(fastq) n seqs = sread(fastq) n qual. Seq = quality(fastq) n total. Reads = length(fastq) # [1] 7202568

Short. Read --Bam header n bam = scan. Bam(bam. Loc)[[1]] n names(bam) n n # # [1] "qname" [9] "mrnm" "flag" "mpos" "rname" "isize" "strand" "pos" "seq" "qual” "qwidth" "mapq" "cigar" n scan. Bam. Header(bam. Loc) n # # # # # $`C: Mi. Seq_Ecoli_DH 10 B_110721_PF. bam`$targets Ecoli. DH 10 B. fa 4686137 $`C: Mi. Seq_Ecoli_DH 10 B_110721_PF. bam`$text$`@HD` [1] "VN: 1. 3" "SO: coordinate“ $`C: Mi. Seq_Ecoli_DH 10 B_110721_PF. bam`$text$`@PG` [1] "ID: Illumina. Secondary. Analysis. Sorted. To. Bam. Converter“ $`C: Mi. Seq_Ecoli_DH 10 B_110721_PF. bam`$text$`@SQ` [1] "SN: Ecoli. DH 10 B. fa" "LN: 4686137“ [3] "M 5: 28 d 8562 f 2 f 99 c 047 d 792346835 b 20031“ $`C: Mi. Seq_Ecoli_DH 10 B_110721_PF. bam`$text$`@RG` [1] "ID: _5_1" "PL: ILLUMINA" "SM: DH 10 B_Sample 1"

Short. Read --Retrieve information from bam files n cseq = as. character(bam$seq) n cig = bam$cigar n head(cig, 2) n # [1] "150 M" n qual = bam$qual n head(qual, 2) n # A Phred. Quality instance of length 6 # width seq # [1] 150 @@CDFFFFHHHHHJJJJIJJJJJIJIJFI. . . DDD>CDCCDCDDEDDDDDCDCD@CCDCDCCCDD # [2] 150 A? 34(@: C>: 4 CCC@CA 9&)&0((34: 4(4(33+. . . BFA 3 C, IHHFFA<GIF@GAEFBFHDDDADD? ? @ n qname = bam$qname n head(qname, 2) n # [1] "_5: 1: 1: 23848: 21362" "_5: 1: 9: 8728: 9854" n rname = as. character(bam$rname) n head(rname, 2) n # [1] Ecoli. DH 10 B. fa

Short. Read --BAM QC n aln = read. Aligned(bam. Loc, type="BAM")

Short. Read --Filter fastq reads n filter 1 <- n. Filter(threshold=3) # keep only reads with fewer than 3 Ns n filter 2 <- polyn. Filter(threshold=20, nuc=c("A", "C", "T", "G")) # remove reads with 20 or more of the same letter n filter <- compose(filter 1, filter 2) # Combine filters into one n filtered. Reads <- fastq[filter(seqs)] # apply filter to sequences, and use this to remove "bad" reads n write. Fastq(filtered. Reads, output. File)

Summary n R contains the basic facilities that is needed for NGS analysis n Fast string manipulation functions are enabled in R n For large NGS experiments, other software with faster speed would be preferred n R is great tool for statistical summaries

References n Patrick Aboyoun, Sequence Alignment of Short Read n n Data using Biostrings, Nov 2009 Martin, Morgan etc, High-throughput sequence analysis with R and Bioconductor, Aug, 2011 Bioconductor at http: //bioconductor. org Part of the R code was derived from Perry Haaland Frances Tong’s work at BD, Technologies The part of PWM matching and bam QC comes from http: //manuals. bioinformatics. ucr. edu/home/htseq#TOC-Biostrings