NGS data analysis in R Biostrings and Shortread
NGS data analysis in R Biostrings and Shortread Stacy Xu BD
NGS analysis n Sequencing analysis n Functionally n n n String manipulations NGS formats (sequences, intervals) Statistical model testing Graphical data representation Knowledgably n n n Large amount of raw data sets Large amount of annotations Database connections
NGS related bioconductor packages n String and interval packages n Biostrings (Herve Pages) n n Genomic. Ranges (P. Aboyoun) n n HT short-read sequences girafe (J. Toedling) n n Wrap of samtools, bcftools, tabix Short. Read (Martin Morgan) n n Genomic intervals representation Rsamtools (Martin Morgan) n n Biological string objects & Matching algorithms Genomic intervals and read alignments Annotations n Genomic. Features (M. Carlson) n n BSgenomes (Herve Pages) n n Transcript centric annotations from UCSC & Bio. Mart Biostrings-based genome annotations rtracklayer (Michael Lawrence) n Genome browsers and their annotation tracks
NGS work flow n n Biological sample/library preparation Sequencing process Sequence alignment Data interpretation n n Input sequencing data n Fasta (sequence) & fastq (sequence + qual) files n BAM & SAM files (reads with header, alignments and references) Analysis n QA, alignment, coverage, identification, etc n Data representation n Plotting coverage, quality, etc
Bio. Strings --Genomic data retrieval n n Load from BSgenome n library(BSgenome) n available. genomes() Download related files from NCBI n. fna files (whole genomic sequence) n. rnt files (rna positions) n. faa files (protein sequences in fasta format) n. ffn files (protein coding portions) n. frn files (rna coding portions) n. gbk files (genome, genbank file format ) n. gff files (genome features)
Biostrings --Create objects Containers n XString – DNA, RNA, AA n XString. Set – multiple sequences n XString. Views n Create from fasta file n n Create from scratch n Load from packages
Biostrings --Basic functions n String manipulations n Base manipulations
Bio. Strings --Pattern matching methods n (v)match. PDict n Match one or more patterns with one or more strings – not with indels, allow mismatches n (v)match. Pattern n Match one pattern with one or more strings – with indels, allow mismatches n pairwise. Alignment n Align two sequences – with indels n match. PWM n Position specific matrix matching for motif matching n match. Probe. Pair n Primer pair matching – not allow mismatches
Bio. Strings -- Pattern matching examples
Bio. Strings -- Pattern matching examples
Bio. Strings --Pattern matching examples n Primer pair matching
Bio. Strings --Pattern matching examples n Motif matching
Short. Read --Load sequencing data n library(Short. Read) n fastq = read. Fastq(fastq. File) n seq. ID = id(fastq) n seqs = sread(fastq) n qual. Seq = quality(fastq) n total. Reads = length(fastq) # [1] 7202568
Short. Read --Bam header n bam = scan. Bam(bam. Loc)[[1]] n names(bam) n n # # [1] "qname" [9] "mrnm" "flag" "mpos" "rname" "isize" "strand" "pos" "seq" "qual” "qwidth" "mapq" "cigar" n scan. Bam. Header(bam. Loc) n # # # # # $`C: Mi. Seq_Ecoli_DH 10 B_110721_PF. bam`$targets Ecoli. DH 10 B. fa 4686137 $`C: Mi. Seq_Ecoli_DH 10 B_110721_PF. bam`$text$`@HD` [1] "VN: 1. 3" "SO: coordinate“ $`C: Mi. Seq_Ecoli_DH 10 B_110721_PF. bam`$text$`@PG` [1] "ID: Illumina. Secondary. Analysis. Sorted. To. Bam. Converter“ $`C: Mi. Seq_Ecoli_DH 10 B_110721_PF. bam`$text$`@SQ` [1] "SN: Ecoli. DH 10 B. fa" "LN: 4686137“ [3] "M 5: 28 d 8562 f 2 f 99 c 047 d 792346835 b 20031“ $`C: Mi. Seq_Ecoli_DH 10 B_110721_PF. bam`$text$`@RG` [1] "ID: _5_1" "PL: ILLUMINA" "SM: DH 10 B_Sample 1"
Short. Read --Retrieve information from bam files n cseq = as. character(bam$seq) n cig = bam$cigar n head(cig, 2) n # [1] "150 M" n qual = bam$qual n head(qual, 2) n # A Phred. Quality instance of length 6 # width seq # [1] 150 @@CDFFFFHHHHHJJJJIJJJJJIJIJFI. . . DDD>CDCCDCDDEDDDDDCDCD@CCDCDCCCDD # [2] 150 A? 34(@: C>: 4 CCC@CA 9&)&0((34: 4(4(33+. . . BFA 3 C, IHHFFA<GIF@GAEFBFHDDDADD? ? @ n qname = bam$qname n head(qname, 2) n # [1] "_5: 1: 1: 23848: 21362" "_5: 1: 9: 8728: 9854" n rname = as. character(bam$rname) n head(rname, 2) n # [1] Ecoli. DH 10 B. fa
Short. Read --BAM QC n aln = read. Aligned(bam. Loc, type="BAM")
Short. Read --Filter fastq reads n filter 1 <- n. Filter(threshold=3) # keep only reads with fewer than 3 Ns n filter 2 <- polyn. Filter(threshold=20, nuc=c("A", "C", "T", "G")) # remove reads with 20 or more of the same letter n filter <- compose(filter 1, filter 2) # Combine filters into one n filtered. Reads <- fastq[filter(seqs)] # apply filter to sequences, and use this to remove "bad" reads n write. Fastq(filtered. Reads, output. File)
Summary n R contains the basic facilities that is needed for NGS analysis n Fast string manipulation functions are enabled in R n For large NGS experiments, other software with faster speed would be preferred n R is great tool for statistical summaries
References n Patrick Aboyoun, Sequence Alignment of Short Read n n Data using Biostrings, Nov 2009 Martin, Morgan etc, High-throughput sequence analysis with R and Bioconductor, Aug, 2011 Bioconductor at http: //bioconductor. org Part of the R code was derived from Perry Haaland Frances Tong’s work at BD, Technologies The part of PWM matching and bam QC comes from http: //manuals. bioinformatics. ucr. edu/home/htseq#TOC-Biostrings
- Slides: 19