Bioinformatics an overview MingJing Hwang Institute of Biomedical

Bioinformatics: an overview Ming-Jing Hwang (黃明經) Institute of Biomedical Sciences Academia Sinica http: //gln. ibms. sinica. edu. tw/

The human genome project Year 2001

Promises “More will happen in biology in the next 10 years than in the past 50” (Craig Venter, Celera Genomics). v “We should be able to uncover the major hereditary contributions to common illnesses like diabetes and mental illness, probably in the next three to five years” (Francis Collins, head of HGP). v

Genetics & Genomics From DNA to population Source: gsk

What makes us human ? q The difference between you & chimp is ~1. 24% q The difference between you and Maggie is ~0. 1%

Hunting for disease genes Source: gsk

Genes and Diseases Penetrance: the likelihood that a person carrying a particular mutant gene will have an altered phenotype Source: gsk

phenotype and genotype v Many different genotypes can have same phenotype v Many genotypes do not change the phenotype v One phenotype could be due to many different genotypes -- statistical genetics

The common variant – common disease (CV-CD) hypothesis It is believed that most polygenic contributions to disease susceptibility will arise from variants that are relatively common in the susceptible population.

Genetic variations v SNP constitute 90% of human genetic variations v Other forms of variations include insertion, deletion, and differences in the copy number of tandem repeats or large genomic segments, etc.

Three phases of human genome sequencing v The genome map (draft in 2001, “finished” in 2003) v The SNP map (TSC, 2001) v The haplotype map (Hap. Map, 2005)

Source: gsk

pharmacogenomics 8/2 (二) 10: 00 pm on PTS (CH 13)

(Nature, 2004) (PNAS, 2005)

Common SNPs Kruglyak & Nickerson, 2001

db. SNP summary (NCBI build 124) July, 2005

Haplotype structure of the human genome Goldstein, 2001

Rationale of Hap. Map In a given population, 55 percent of people may have one version of a haplotype, 30 percent may have another, 8 percent may have a third, and the rest may have a variety of less common haplotypes. The International Hap. Map Project is identifying these common haplotypes in four populations from different parts of the world. It also is identifying "tag" SNPs that uniquely identify these haplotypes. By testing an individual's tag SNPs (a process known as genotyping), researchers will be able to identify the collection of haplotypes in a person's DNA. The number of tag SNPs that contain most of the information about the patterns of genetic variation is estimated to be about 300, 000 to 600, 000, which is far fewer than the 10 million common SNPs.

Beyond genome

Chemical genomics

Bioinformatics has many subdisciplines v Genome Informatics (DNA sequence) v Transcriptome Informatics (expression) v Proteome informatics (ID, post-transl. mod. ) v Protein Informatics (protein struct. /funct. ) v Evolutionary Informatics v Biomedical Informatics (human disease) …

Briefings in bioinformatics (Mar 2005) v v v The many faces of sequence alignment (Altman ) Bioinformatics analysis of alternative splicing (Lee & Wang) Putting microarray in a context: Integrated analysis of diverse biological data (Troyanskaya) Bioinformatics approaches and resources for single nucleotide polymorphism functional analysis (Mooney) A survey of current work in biomedical text mining (Cohen & Hersh) Current efforts in the analysis of RNAi and RNAi target genes (Bengert & Dandekar)

Sequence alignment v The problem is still not solved. Sequence alignment methodology and tool development continue to grow, indicating that the alignment problem is still not solved. How can that be, after nearly forty years of research and literally hundreds of available tools? v Why should alignments remain an open problem? It is not a single problem but rather a collection of many quite diverse questions that all have in common the search for sequence similarity The exponential expansion of biological sequence databases – faster than Moore’s law Batzoglou, 2005

Sequence alignment challenges v Sensitivity and specificity v Speed v Evaluation v Low similarity v Rearrangements v Orthology detection v Multiple (genome) alignments

Evolution of functional important regions over time Miller et al. , 2005

Evolutionary Informatics/ Comparative genomics (Ureta-Vidal, Ettwiller, Birney, 2003)

Schema of genome alignment (2003)

Genome alignment recent reviews v v v An Applications-Focused Review of Comparative Genomics Tools: Capabilities, Limitations and Future Challenges (Chain et al. Briefings in bioinformatics, 2003) The many faces of sequence alignment (Batzoglou, Briefings in bioinformatics, 2005) Comparative genomics (Miller et al. Annu. Rev. Genomics Hum. Genet. 2004)

RNAi: post-translational gene regulaion Computational identification of mi. RNAs v Computational prediction of mi. RNA targets v mi. RNA data resources v

Transcriptomics: tools for understanding the body plan

Microarray & Integrated analysis Troyanskaya, 2005

Proteomics Initial goal: identification of all proteins expressed by a cell or tissue

From 1 D to 3 D : The Holy Grail of Structural Bioinformatics MADWVTGKVTKVQNWTDALFSLTVHAPVL PFTAGQFTKLGLEIDGERVQRAYSYVNSPD NPDLEFYLVTVPDGKLSPRLAALKPGDEV QVVSEAAGFFVLDEVPHCETLWMLATGTAI GPYLSILRLGKDLDRFKNLVLVHAARYAAD LSYLPLMQELEKRYEGKLRIQTVVSRETAA GSLTGRIPALIESGELESTIGLPMNKETSHV MLCGNPQMVRDTQQLLKETRQMTKHLRR RPGHMTAEHYW…

Structural Bioinformatics: Sequence/Structure Relationship Percent Identity 100 90 All possible sequences of amino acids 80 Protein structures observed in nature 70 60 50 40 30 20 Protein sequences observed in nature Twilight zone Midnight zone 10 0

Structure Prediction Methods Homology modeling Fold recognition ab initio 0 10 20 30 40 50 60 70 80 90 100 % sequence identity

CASP Experiments

Some CASP 4 successes Baker’s group

3 D to 1 D? Science 2003

A computer-designed protein (93 aa) with 1. 2 A resolution

Sequence/Structure Gap Sequence Structure

Structural Genomics: solving fold representatives Baker & Sali, 2001

Structural Genomics: overview v v v When: 1997 by Barry Honig, Wayne Henderickson and colleagues in a DOE’s Advanced Photon Source (APS) proposal Goals: 10, 000 structures (100 -200 str/center/yr) – each representing a protein family – in 5(10) years Enabling factors: genome sequences, technology advancement (synchrotron & MAD, etc. ) Cost: reducing current ~US$200, 000/str to $10, 000/str (est. $1. 5 -5 billion US) Players: academic & industry

Flowchart of a SG project Burley etal. , 1999

PSI phase I (pilot) centers v v v v v Berkeley Structural Genomics Center focused on two bacterial species with extremely small genomes to study proteins essential for independent life. Principal investigator: Sung-Hou Kim, Lawrence Berkeley National Laboratory Center for Eukaryotic Structural Genomics, based in Wisconsin, focused on protein production, characterization, and structure determination from Arabidopsis thaliana, a plant that is frequently used in laboratory research and that has many genes in common with humans and animals. Principal investigator: John Markley, University of Wisconsin, Madison Joint Center for Structural Genomics, based in California, focused on novel structures from thermophilic microorganisms and on human proteins thought to be involved in cell signaling. Principal investigator: Ian Wilson, The Scripps Research Institute Midwest Center for Structural Genomics, based in Illinois, selected bacterial targets related to disease and proteins from all three kingdoms of life. The emphasis was on previously unknown folds and on proteins from disease-causing organisms. Principal investigator: Andrzej Joachimiak, Argonne National Laboratory New York Structural Genomics Research Consortium solved protein structures for disease-related proteins from eukaryotes and bacteria. Principal investigator: Stephen K. Burley, Structural Genomi. X, Inc. Northeast Structural Genomics Consortium, based in New Jersey, focused on target proteins from various model organisms, including the fruit fly, yeast, and roundworm. It used both X-ray crystallography and NMR spectroscopy. Principal investigator: Gaetano Montelione, Rutgers University The Southeast Collaboratory for Structural Genomics, based in Georgia, determined structures from the prokaryotic model organism, Pyrococcus furiosus, and the eukaryotic model organism C. elegans, as well as some human proteins. Principal investigator: Bi-Cheng Wang, University of Georgia Structural Genomics of Pathogenic Protozoa Consortium, based in Washington, solved protein structures from organisms known as protozoans, many species of which cause deadly diseases such as sleeping sickness, malaria, and Chagas' disease. Principal investigator: Wim G. J. Hol, University of Washington TB Structural Genomics Consortium, based in New Mexico, analyzed protein structures from Mycobacterium tuberculosis. Principal investigator: Thomas Terwilliger, Los Alamos National Laboratory

PSI Pilot Phase Facts at a Glance v v v Goal: To develop new approaches and tools needed to streamline and automate the steps of protein structure determination, and to incorporate those methods into high-throughput pipelines that use DNA sequence information to generate three-dimensional protein structure models Project period: September 2000 to June 2005 Funding: $270 million (funded largely by the National Institute of General Medical Sciences, with additional support from the National Institute of Allergy and Infectious Diseases) Number of Centers: 9 (6 survived to phase II) Solved protein structures: More than 1, 100 Unique structures solved (structures sharing less than 30 percent of their sequence with other known proteins): More than 700

PDB content growth (May 2005)

Many bottlenecks remain: target tracking by PDB (Sep 2002)

Current (phase II) PSI centers

Hybrid approach for solving macromolecular complex structures

Protein network: an integrated approach Aloy et al, 2004

Bioinformatics and Drug Design Scientific America 2000

Yeast protein interaction network Nat Rev Genet. 2004

Network parameters v v v Degree (connectivity): k Degree distribution: P(k), probability that a selected node has exactly k links. Scale-free network: degree distribution approximates a power law, P(k) ~ k-γ (γ: degree exponent) Log(P(k)) P(k) Log(k) k Barabasi & Oltvai, Nat Rev Genet. 2004

Network models Barabasi & Olvtai, Nat Rev Genet. 2004

Scale-free networks P(k) ~ k-γ, (γin: in-degree & γout: out-degree exponent) Albert & Barabasi, Reviews of Modern Physics, 2002

Challenges in network biology v Network databases v Information integration v Organization characteristics and principles v Design rules v Evolution mechanisms v Validation

Neuroinformatics: neuroscience & bioinformatics The human brain project -UC Davis http: //nir. cs. ucdavis. edu/index. jsp http: //ncmir. ucsd. edu/NCDB/

Bioinformatics Journals Bioinformatics v Nucleic Acids Research v BMC Bioinformatics v Briefings in Bioinformatics v Proteins v J. Mol. Biol. v… v PNAS v PLo. S computational biology v Genome Research v… v

Scope of bioinformatics v v v v v Genome analysis Sequence analysis Phylogenetics Structural bioinformatics Gene expression Genetic and population analysis Systems biology Data and text mining Databases and ontologies

Sample articles of a recent issue v v v v v Exon–domain correlation and its corollaries Functional annotation from predicted protein interaction networks HYPROSP II-A knowledge-based hybrid method for protein secondary structure prediction based on local prediction confidence Comparative interactomics analysis of protein family interaction networks using PSIMAP (protein structural interactome map) Semi-supervised protein classification using cluster kernels A new progressive-iterative algorithm for multiple structure alignment Practical FDR-based sample size calculations in microarray experiments Mining genetic epidemiology data with Bayesian networks I: Bayesian networks and example application (plasma apo. E levels) Inferring protein–protein interactions through high-throughput interaction data from diverse organisms A latent variable model for chemogenomic profiling

NAR Database issue (Jan. 2005) Categories # 1. Nucleotide Sequence Databases 53 2. RNA Sequence Databases 34 3. Protein Sequence Databases 105 4. Structure Databases 64 5. Genomic Databases (non-human) 134 6. Metabolic Enzyme & Pathways; Signals Pathways 36 7. Human & Other Vertebrate Genomes 64 8. Human Genes & Diseases 69 9. Microarray Data & Other Gene Expression Databases 42 10. Proteomics Resources 7 11. Other Molecular Biology Databases 17 12. Organelle Databases 18 13. Plant Databases 48 14. Immunological Databases 20 Total 711 http: //nar. oupjournals. org/cgi/content/full/33/suppl_1/D 5/TBL 1

NAR Web Server Issue (July 2005) Year of publication 2004 2005 Total # 129 (137) 166 295

Computer Related (2) Bio-* Programming Tools (1) Statistics (1) DNA (57) Annotations (9) Gene Prediction (4) Mapping and Assembly (1) Phylogeny Reconstruction (4) Sequence Feature Detection (16) Sequence Polymorphisms (8) Sequence Retrieval and Submission (3) Tools For the Bench (12) Education (1) Directories and Portals (1) Expression (48) c. DNA, EST, SAGE (8) Gene Regulation (22) Microarrays (16) Splicing (2) Human Genome (13) Annotations (3) Health and Disease (3) Other Resources (2) Sequence Polymorphisms (5) Model Organisms (9) Microbes (4) Mouse and Rat (2) Plants (1) Yeast (2) Other Molecules (2) Carbohydrates (2) Protein (131) 2 -D Structure Prediction (10) 3 -D Structure Prediction, Comparison (34) 3 -D Structure Retrieval, Viewing (6) Biochemical Features (8) Domains and Motifs (25) Function (10) Interactions, Pathways, Enzymes (13) Localization and Targeting (7) Phylogeny Reconstruction (5) Proteomics (2) Sequence Features (6) Sequence Retrieval (5) RNA (15) Functional RNAs (5) Motifs (3) Sequence Retrieval (2) Structure Prediction, Visualization, and Design (5) Sequence Comparison (29) Alignment Editing and Visualization (2) Analysis of Aligned Sequences (12) Comparative Genomics (7) Multiple Sequence Alignments (2) Pairwise Sequence Alignments (2) Similarity Searching (4) Literature (5) Search Tools (3) Text Mining (2) http: //bioinformatics. ubc. ca/resources/links_directory/narweb 2005/

NAR database issue (Jan. 2005) Categories # URL not available /not working 1. Nucleotide Sequence Databases 53 3 41 9 2. RNA Sequence Databases 34 4 23 7 3. Protein Sequence Databases 104 4 66 34 4. Structure Databases 64 7 9 48 5. Genomic Databases (non-human) 134 11 105 17 6. Metabolic Enzyme & Pathways; Signals Pathways 36 3 20 13 7. Human & Other Vertebrate Genomes 62 2 42 18 8. Human Genes & Diseases 69 4 54 11 9. Microarray Data & Other Gene Expression Databases 42 4 31 7 10. Proteomics Resources 7 0 5 2 11. Other Molecular Biology Databases 17 1 10 6 12. Organelle Databases 18 1 13 4 13. Plant Databases 48 11 36 1 14. Immunological Databases 20 0 17 3 Total 708 55 474 179 Not recommended Recommended 138 can be downloaded

An example of our curation

Two keys in bioinformatics research v Solve a significant biological question v Develop a must-use application tool