DNA sequence analysis IT Carlow Bioinformatics October 2006
DNA sequence analysis IT Carlow Bioinformatics October 2006
A, T/U, C, G • Simple code, lots of sequence • Sequence analysis – Computer intensive • • BLAST homology searching Gene/exon prediction Multiple sequence alignment Alignments in general – “Trivial”
Trivial • Could be done by hand – Computers • Quicker • More reliable • Examples – Translate DNA – Restriction sites – Synonymous codon usage
Sequence formats • Fasta Format >gi|5524211|gb|AAD 44166. 1| cytochrome b [Elephas maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX • Phylip Format 4 131 IXI_234 TSPASIRPPA GPSSRPAMVS SRRTRPSPPG PRRPTGRPCC SAAPRRPQAT IXI_235 TSPASIRPPA GPSSR-----RPSPPG PRRPTGRPCC SAAPRRPQAT IXI_236 TSPASIRPPA GPSSRPAMVS SR--RPSPPP PRRPPGRPCC SAAPPRPQAT IXI_237 TSPASLRPPA GPSSRPAMVS SRR-RPSPPG PRRPT----C SAAPRRPQAT • CLUSTAL W(1. 4) multiple sequence alignment IXI_234 TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT IXI_235 TSPASIRPPAGPSSR-----RPSPPGPRRPTGRPCCSAAPRRPQAT IXI_236 TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT IXI_237 TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT • http: //thr. cit. nih. gov/molbio/readseq/
DNA sequence analysis • Look for EMBOSS – A suite of programs with the same look&feel – http: //bioweb. pasteur. fr/seqanal/dna/intro-uk. html
Translation • DNA anti-parallel. – One strand 5’ -3’ matches the complementary strand 3’ – 5’ – Translation, transcription always 5’ – 3’ • Six possible translations, 3 each strand • ATGCCCGCATTTGAATAA Frameshift errors • ATGCCCGCATTTGAATAA Frameshift mutations • Stop codons underlined
Genetic code The “Universal” Genetic Code. Phe UUU Ser UCU Tyr UAU Cys UGU UUC UCC UAC UGC Leu UUA UCA ter UAA ter UGA UUG UCG ter UAG Trp UGG Leu CUU Pro CCU His CAU Arg CGU CUC CCC CAC CGC CUA CCA Gln CAA CGA CUG CCG CAG CGG Ile AUU Thr ACU Asn AAU Ser AGU AUC ACC AAC AGC AUA ACA Lys AAA Arg AGA Met AUG ACG AAG AGG Val GUU Ala GCU Asp GAU Gly GGU GUC GCC GAC GGC GUA GCA Glu GAA GGA GUG GCG GAG GGG
Exceptions to the code • • • #1: Yeast Mitochondrial Code: CUN=T AUA=M UGA=W #2: Mitochondrial Code of Vertebrates: AGR=* AUA=M UGA=W #3: Mitochondrial Code of Filamentous fungi: UGA=W #4: Mitochondrial Code of Insects and platyhelminths: AUA=M UGA=W AGR=S #5: Nuclear Code of Candida cylindracea: CUG=S (*) #6: Nuclear Code of Ciliata: UAR = Q #7: Nuclear Code of Euplotes: UGA=C #8: Mitochondrial Code of Echinoderms: UGA=W AGR=S AAA=N #9: Mitochondrial Code of Ascidaceae: UGA=W AGR=G AUA=M #10: Mitochondrial Code of Platyhelminthes: UGA=W AGR=S UAA=Y AAA=N #11: Nuclear Code of Blepharisma: UAG=Q (*) (see Nature 341: 164):
Start codons • • • ATG the “universal” start codon … but 10% E. coli genes start with GTG 1% start with TTG. Bioinformaticians only make predictions Molecular biologists verify
Restriction sites • Essential for the construction of plasmids • A key tool for molecular biology • Hundreds available commercially – Need to decide which to order – Costs from $3. 80/1000 units - $500/1000 Bam. H 1 5'G’GATCC 3'CCTAG’G Eco. R 1 5'G’AATTC 3'CTTAA’G Alu 1 5'AG’CT 3'TC’GA • http: //tools. neb. com/NEBcutter 2/index. php • Usually need an enzyme that cuts once
Promoter Prediction • To find start of transcript (97% Human genome not coding) • False positive rate too high – Predicted 1 / kb reasonable 1 / 30 kb • RNA pol. II transcribes DNA – RNA – Needs general transcription factors (GTFs) • Also specific (species, tissue, devt stage) TF • TF binding sites short and “fuzzy” • 7% of vertebrate genes are TFs
Promoters 2 NF-AT 4 matrix (3 known sites) and consensus: A 00333001 C 12000002 G 00000110 T 21000220 TCAAATTC Predicts five sites in 3 Kb of human IL-11: Bp 007 TTAAAGGC Bp 248 ACAAATTC Bp 1959 GAGTTTGA Bp 2154 TCAAAGGA Bp 2181 GACTTTTA Ask if TF site relevant to your cell type is present.
Primer design • You will be asked to design primers for sequencing, PCR etc. • Manual pages cover this • Computationally trivial, so lots of choice for available websites
Trivial but time-consuming • Genome trawls for repeats – LINES – SINES – Microsatellites – Masking genomic seq prior to gene finding • Codon usage – Codons, codonw, gcua,
Not-trivial • NA secondary structure – EMBOSS einverted for short palindromes – http: //bioweb. pasteur. fr/seqanal/interfaces/einverted. html • Huge database of 16 s. RNA structures
Secondary Structure • DNA (and RNA) can form base-pairs. • Not all of these are with complementary strands. Bioinformatic view = a cartoon Closer to reality
16 s RNA Gram -ve Gram +ve Evolutionary consequences? Coordinated/dependent mutational change
RDP • Ribosomal Database Project-II Release 9 Notes • RDP Release 9. 42 (Release 9, update 42) consists of 262, 030 aligned annotated 16 S r. RNA sequences, along with five online analysis tools. Update 42 was released on Sep 14, 2006.
- Slides: 18