A A 2015 2016 CORSO DI BIOINFORMATICA 2

A. A. 2015 -2016 CORSO DI BIOINFORMATICA 2 per il CLM in BIOLOGIA EVOLUZIONISTICA Scuola di Scienze, Università di Padova Docenti: Prof. Giorgio Valle Prof. Stefania Bortoluzzi

• Introduzione alla bioinformatica • Biosequenze, geni, genoma e trascrittoma • Metodologie high-throughput • Database primari e secondari • Dati struttturali

DEFINIZIONI DI BIOINFORMATICA • The application of computer technology to organize and analyze biological data. • Analysis of proteins, genes, and genomes using computer algorithms. • NIH: “research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, analyze, or visualize such data”

BIOINFORMATICA Medical Informatics Evolutionary Biology Computational Biology Bioinformatics Genomics Pharmacogenomics Proteomics

BIOINFORMATICA The study and application of computing methods for classical biology

BIOINFORMATICA Analysis and comparison of the entire genome of a single species or of multiple species

BIOINFORMATICA Study of how the genome is expressed in proteins, and of how these proteins function and interact

BIOINFORMATICA The application of genomic methods to identify drug targets, for example, searching entire genomes for potential drug receptors

BIOINFORMATICA Study of the evolutionary processes that produced the diversity of life, the descent of species, and the origin of new species.

BIOINFORMATICA The study and application of computing methods to improve communication, understanding, and management of medical data

BIOINFORMATICA: AMBITI PRINCIPALI ? • Sviluppo e implementazione di software per conservare, diffondere e elaborare diversi tipi di informazione • Sviluppo di nuovi algoritmi e metodi statistici per integrare dati diversi, ricercare e studiare diversi tipi di relazioni e interazioni in grandi dataset • Analisi e interpretazione di dati di varia natura, quali biosequenze, strutture, interazioni.

BIOINFORMATICA: SCOPI? • Aumentare la comprensione della biologia a livello funzionale • Modellizzare il funzionamento delle cellule e degli organismi • Fornire informazioni utili a migliorare la qualità della vita (malattie, tumori) • Database (design, handling, . . . ) • Analisi di dati d’espressione • Mappaggio di geni e genomi • Predizione di funzioni geniche • Predizione di strutture • Identificazione di fattori di rischio per malattie • Identificazione di target • Studio delle reti regolative e • Drug design metaboliche • Terapia genica • …

BIOINFORMATICA: QUALI DATI? • • Sequence data Structural information Expression data Molecular interaction data Mutation data Phenotypic data Imaging data

BIOINFORMATICA 3 PROSPETTIVE ALBERO DELLA VITA CELLULA ORGANISMO

I : CELLULA

I : CELLULA CENTRAL DOGMA OF MOLECULAR BIOLOGY DNA Trascrizione Genoma Proteina RNA Traduzione Trascrittoma Proteoma CENTRAL DOGMA OF BIOINFORMATICS AND GENOMICS

I : CELLULA

I : CELLULA Il ruolo della bioinformatica • Questi miliardi di sequenze presentano sfide e opportunità, • per studiare moltissimi problemi biologici diversi, quali: • Stati cellulari in relazione a ciclo cellulare, differenziamento, malattia ecc. • Regolazione dei processi, • …

II : ORGANISMO Tempo Sviluppo Spazio e stato Regione del corpo, fisiologia, patologia, farmacologia

II : ORGANISMO Il ruolo della bioinformatica

III : ALBERO DELLA VITA Darwin (1837) Haeckel (1866) “The green and budding twigs may represent existing species; and those produced during former years may represent the long succession of extinct species. ” «In biologia niente ha senso se non alla luce dell’evoluzione» (Dobzhansky, 1973)

III : ALBERO DELLA VITA • Gli esseri viventi oggi esistenti si sono evoluti a partire da altri esseri viventi ancestrali e sono legati da relazioni di tipo evolutivo • Lo studio del patrimonio genetico (genoma) delle specie permette di ricostruirne la storia passata e le relazioni con altre specie alberi filogenetici TREE OF LIFE WITH ENDOSYMBIOSIS

III : ALBERO DELLA VITA Il ruolo della bioinformatica • Storicamente l’evoluzione molecolare è stato il primo ambito che ha richiesto al nascita della bioinformatica • Studiare i processi evolutivi (da macro- a micro-evoluzione) • Ricostruire la storia passata • Comprendere le pressioni evolutive in relazione alle informazioni funzionali e viceversa

DATABASES AND DATA RETRIEVAL Biosequences and Gene-related info

Alfabeto molecolare GLI ACIDI NUCLEICI E LE PROTEINE SONO POLIMERI LINEARI BIOSEQUENZE • DNA e RNA sono polimeri lineari di nucleotidi, specializzati nel deposito, nella trasmissione e nell’utilizzazione dell’informazione genetica • Gli acidi nucleici possono assumere specifiche forme nello spazio 3 D, come le proteine, e svolgere attività diverse (ad es. catalisi) IL DOGMA CENTRALE DELLA BIOLOGIA Molti altri RNA non coding

THE BIG DATA ERA • La biologia molecolare è nell'era dei "big data” • Le metodologie sperimentali hightroughput permettono di studiare moltissimi processi su scala genomica, o utilizzando la genomica comparativa • La grande disponibilità di dati sperimentali e conoscenza richiede approcci quantitativi basati sull'informatica e la statistica per lo studio dei fenomeni biologici

Deep sequencing Evoluzione delle tecnologie di sequenziamento 1° generation: Standard Sanger 2° generation: Next Generation Sequencing (NGS) 3° generation: Ion Torrent, Nanopore Roche/454 Illumina/Solexa ABI/Solid Life. Technologies’ Oxford Nanopore Ion. Torrent Since: ‘ 70 s 2004 2011 Read length: 1000 bp 35 -1000 bp 100 bp Throughput: 300 kb/run 700 -50 Mb/day 12 Gb/day

Scopi dell’Analisi di Sequenziamento GENOME DE-NOVO AMPLICON TRASCRIPTOME RE-SEQUENCING EXOME WHOLE GENOME LONG CODING • Grande mole di dati prodotti; Identificazione cause malattie sconosciute o poco conosciute; Applicazione delle conoscenze acquisite (farmacogenomica). mi. RNA OTHER Limiti Aspetti positivi • • NON CODING SHORT • • Necessità di server capienti; Costruzione di strutture bioinformatiche complesse; Necessità di database integrati; Diverse domande biologiche.

RNA-seq for Reverse Engineering of the genome state Cells/Biosamples Library preparation Sequencing Computational analysis for reverse engineering

Genoma • < 3% del genoma umano codifica proteine • Evidenze recenti ottenute con genomic-tiling array e sequenziamento del trascrittoma hanno mostrato che >70% del genoma è trascritto in maniera pervasiva in • RNA codificante ( m. RNA) • Moltissimi prodotti trascrizionali sono RNA, piccoli e lunghi, con scarsissimo potenziale codificante La maggior parte del DNA eucariotico trascritto è non codificante • La production phase di ENCODE ha mostrato che >80% del genoma è biologicamente attivo e funzionale (ruolo regolativo per la maggior parte delle sequenze)

Il trascrittoma non-codificante u Le molecole di RNA • RNA non codificanti noti possono contemporaneamente da molto tempo: contenere informazione di § r. RNA e t. RNA nella sequenza e possedere traduzione plasticità strutturale sn. RNA e sno. RNA nel processamento degli m. RNA § § ribozimi • Ipotesi del “mondo ad RNA” u Gli RNA possono sia interagire con DNA ed altri RNA per appaiamento delle basi complementari, sia fornire siti di legame per proteine

DNA Transcription RNA Processing m. RNA Translation proteins • <3% of the genome is important since transcribed/coding • abundant “junk DNA” nc. RNA

DNA • >70% transcribed in “dark matter” Transcription Alternative TSSs Processing RNA transcripts/precursors Splicing Nuclear export Processing Polyadenylation Silencing Editing Turn-over Trans-splicing Sequestration m. RNA Translation proteins • Diverse functional roles for nc. RNAs uncovered nc. RNA mi. RNAs pi. RNAs sn. RNAs lnc. RNAs sno. RNAs circ. RNAs t. RFs ?

Primary Databases: Databases consisting of data derived experimentally such as nucleotide sequences and three dimensional structures are known as primary databases. Secondary Databases: Those data that are derived from the analysis, treatment or integration of primary data such as secondary structures, hydrophobicity plots, and domain are stored in secondary databases.

DATABASE PRIMARI DATABASE DI SEQUENZE NUCLEOTIDICHE Collezioni di singoli record, ognuno dei quali contiene un tratto di DNA o RNA con delle annotazioni. Ogni record viene anche chiamato ENTRY, e ha un codice che lo identifica univocamente (ACCESSION NUMBER). Banche dati primarie di sequenze nucleotidiche EMBL nucleotide database, ora gestita dall’EBI (1980) §EMBL = European Molecular Biology Laboratory (Heidelberg) §EBI = European Bioinformatics Institute (Hinxton, UK) Gen. Bank = banca dell NIH gestita dal NCBI (1982) §NIH = National Institutes of Health (Stuttura USA) §NCBI = National Center for Biotechnology Information, Bethesda, Maryland DDBJ = banca DNA giapponese (1986) §DDBJ = DNA Data. Base of Japan

DATABASE DI SEQUENZE NUCLEOTIDICHE – Gen. Bank SUBMISSION DIRETTA La gran parte delle sequenze finisce in uno dei tre database perché l’autore (il laboratorio dove tale sequenza è stata ottenuta) la invia direttamente. La sequenza viene quindi inserita e il record corrispondente resta di proprietà solo di quel database, l’unico con il diritto di modificarlo. Il database che riceve la sequenza la invia poi agli altri due. ANNOTAZIONE Ci sono poi anche degli “annotatori” che prendono le sequenze dalle riviste scientifiche e le trasferiscono nel database. Problema della ridondanza

Un po’ di storia … SCAMBIO DI DATI Nel 1988, i gruppi responsabili dei 3 database si seq. nucleotidiche si sono organizzati nell’International Collaboration of DNA Sequence Databases per utilizzare un formato comune e scambiarsi giornalmente le sequenze. 2002 fondato WGS (Whole Genome Shotgun)

DATABASE DI SEQUENZE NUCLEOTIDICHE – Gen. Bank GENBANK AND WGS STATISTICS Gen. Bank Bases 680338 1, Dec 1982 … … 209, Aug 2015 199823644287 WGS Sequences 606 Bases 187066846 1163275601001 Sequence 302955543

DATABASE PRIMARI DATABASE DI SEQUENZE PROTEICHE SWISS-PROT Database di sequenze proteiche annotate, “scarsamente” ridondanti e cross-referenced Contiene: • SWISS-PROT • Tr. EMBL, supplemento a SWISS-PROT costituito dalle sequenze annotate al computer, come traduzione di tutte le sequenze codificanti presenti all’EMBL Tr. EMBL contiene due sezioni: – – SP-Tr. EMBL, sequenze da incorporare in SWISSPROT, con AC. REM-Tr. EMBL, remaining (immunoglobuline, proteine sintetiche, . . . ), senza AC. Oggi parte di Universal Protein Knowledgebase (Uni. Prot)

LOCUS DEFINITION ACCESSION VERSION DBLINK DBSOURCE KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL PUBMED REFERENCE AUTHORS TITLE JOURNAL COMMENT AIL 58882 140 aa linear BCT 29 -AUG-2014 crystallin [Staphylococcus aureus]. AIL 58882. 1 GI: 675303284 Bio. Project: PRJNA 240091 accession CP 007499. 1. Staphylococcus aureus Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcus. 1 (residues 1 to 140) Benson, M. A. , Ohneck, E. A. , Ryan, C. , Alonzo, F. III, Smith, H. , Narechania, A. , Kolokotronis, S. O. , Satola, S. W. , Uhlemann, A. C. , FEATURES Sebra, R. , Deikus, G. , Shopsin, B. , Planet, P. J. and Torres, V. J. source Evolution of hypervirulence by a MRSA clone through acquisition of a transposable element Mol. Microbiol. 93 (4), 664 -681 (2014) 24962815 Protein 2 (residues 1 to 140) Planet, P. J. , Narechania, A. , Shopsin, B. and Torres, V. Region Direct Submission Submitted (18 -MAR-2014) Pediatrics, Columbia University, 650 West 168 th St, New York, NY 10032, USA Annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (released 2013). Information about the Pipeline can be found here: http: //www. ncbi. nlm. nih. gov/genome/annotation_prok/ Region ##Genome-Annotation-Data-START## Annotation Provider : : NCBI Annotation Date : : 03/20/2014 14: 06: 33 Annotation Pipeline : : NCBI Prokaryotic Genome Annotation Pipeline Annotation Method : : Best-placed reference protein set; Gene. Mark. S+ Annotation Software revision : : 2. 4 (rev. 429283) Features Annotated : : Gene; CDS; r. RNA; t. RNA; nc. RNA; repeat_region Genes : : 2, 836 CDS : : 2, 729 Pseudo Genes : : 29 r. RNAs : : 19 ( 5 S, 16 S, 23 S ) t. RNAs : : 59 nc. RNA : : 0 Frameshifted Genes : : 23 ##Genome-Annotation-Data-END## CDS FEATURES … Location/Qualifiers 1. . 140 /organism="Staphylococcus aureus" /strain="2395 USA 500" /db_xref="taxon: 1280" 1. . 140 /product="crystallin" 1. . 137 /region_name="Ibp. A" /note="Molecular chaperone (small heat shock protein) [Posttranslational modification, protein turnover, chaperones]; COG 0071" /db_xref="CDD: 223149" 36. . 124 /region_name="alpha-crystallin-Hsps_p 23 -like" /note="alpha-crystallin domain (ACD) found in alpha-crystallin-type small heat shock proteins, and a similar domain found in p 23 (a cochaperone for Hsp 90) and in other p 23 -like proteins; cl 00175" /db_xref="CDD: 260235" 1. . 140 /locus_tag="CH 51_12820" /coded_by="CP 007499. 1: 2592248. . 2592670" /inference="EXISTENCE: similar to AA sequence: Ref. Seq: WP_001010521. 1" /note="Derived by automated computational analysis using gene prediction method: Protein Homology. " /transl_table=11 ORIGIN 1 mnfnqfenqn ffngnpsdtf kdlgkqvfny fstpsfvtni yetdelyyle aelagvnked 61 isidfnnntl tiqatrsaky kseqlilder nfeslmrqfd feavdkqhit asfengllti 121 tlpkikpsne ttsstsipis //

PDB DATABASE PRIMARI • Database di strutture 3 -D di proteine e acidi nucleici • Dati ottenuti sperimentalmente e sottomessi direttamente dai ricercatori • Fondato nel 1971

Myoglobin structure ribbon vs atom positions

PDB files • The most common format for storage and exchange of atomic coordinates for biological molecules is PDB file format • PDB file format is a text (ASCII) format, with an extensive header that can be read and interpreted either by programs or by people • Next slide: PDB file format

nome composto organismo autore referenze risoluzione sequenza residuo 1 residuo 2 num. atomo tipo residuo num. residuo x y z

$Occupancy: the fraction of unit cells that contain the atom in this particular location,$

Occupancy: the fraction of unit cells that contain the atom in this particular location, usually 1. 00, or all of them (can be used to represent alternative conformations of side chains); Temperature factor: an indication of uncertainty in this atom's position due to disorder or thermal vibrations (can be used by graphics programs to represent the relative mobility of different parts of a protein)

Occupancy Alternative conformations: myoglobin aa with two conf.

Temperature factor Electron density depends on vibrations Atoms colored by the temperature factors

PDB file example HEADER SYNTHETIC PROTEIN MODEL 02 -JUL-90 1 AL 1 2 COMPND ALPHA - 1 (AMPHIPHILIC ALPHA HELIX) 1 AL 1 3 SOURCE SYNTHETIC 1 AL 1 4 AUTHOR C. P. HILL, D. H. ANDERSON, L. WESSON, W. F. DE*GRADO, D. EISENBERG 1 AL 1 5 REVDAT 2 15 -JAN-95 1 AL 1 A 1 HET 1 AL 1 A 1 REVDAT 1 15 -OCT-91 1 AL 1 0 1 AL 1 6 JRNL AUTH C. P. HILL, D. H. ANDERSON, L. WESSON, W. F. DE*GRADO, 1 AL 1 7 JRNL AUTH 2 D. EISENBERG 1 AL 1 8 JRNL TITL CRYSTAL STRUCTURE OF ALPHA=1=: IMPLICATIONS FOR 1 AL 1 9 JRNL TITL 2 PROTEIN DESIGN 1 AL 1 10 JRNL REF SCIENCE V. 249 543 1990 1 AL 1 11 JRNL REFN ASTM SCIEAS US ISSN 0036 -8075 038 1 AL 1 12 REMARK 1 1 AL 1 13 REMARK 1 REFERENCE 1 1 AL 1 14 REMARK 1 AUTH D. EISENBERG, W. WILCOX, S. M. ESHITA, P. M. PRYCIAK, S. P. HO 1 AL 1 15 REMARK 1 TITL THE DESIGN, SYNTHESIS, AND CRYSTALLIZATION OF AN 1 AL 1 16 REMARK 1 TITL 2 ALPHA-*HELICAL PEPTIDE 1 AL 1 17 REMARK 1 REF PROTEINS. STRUCT. , FUNCT. , V. 1 16 1986 1 AL 1 18 REMARK 1 REF 2 GENET. 1 AL 1 19 REMARK 1 REFN ASTM PSFGEY US ISSN 0887 -3585 867 1 AL 1 20 REMARK 2 1 AL 1 21 REMARK 2 RESOLUTION. 2. 7 ANGSTROMS. 1 AL 1 22 REMARK 3 1 AL 1 23 REMARK 3 REFINEMENT. BY THE RESTRAINED LEAST SQUARES PROCEDURE OF J. 1 AL 1 24 REMARK 3 KONNERT AND W. HENDRICKSON (PROGRAM *PROLSQ*). THE R 1 AL 1 25 REMARK 3 VALUE IS 0. 255 FOR ALL DATA. THE R VALUE IS 0. 211 FOR ALL 1 AL 1 26 REMARK 3 REFLECTIONS IN THE RESOLUTION RANGE 10. 0 TO 2. 7 ANGSTROMS 1 AL 1 27 REMARK 3 WITH FOBS. GT. 2*SIGMA(FOBS). THE RMS DEVIATION FROM 1 AL 1 28 REMARK 3 IDEALITY OF THE BOND LENGTHS IS 0. 013 ANGSTROMS. THE RMS 1 AL 1 29 REMARK 3 DEVIATION FROM IDEALITY OF THE BOND ANGLE DISTANCES IS 1 AL 1 30

PDB file example SEQRES 1 13 ACE GLU LEU LYS LEU GLU LEU LYS GLY 1 AL 1 39 HET SO 4 13 5 SULFATE ION 1 AL 1 A 5 FORMUL 2 SO 4 S 1 1 AL 1 41 HELIX 1 HL 1 ACE 0 LEU 10 1 1 AL 1 42 CRYST 1 62. 350 90. 00 I 41 3 2 48 1 AL 1 43 ORIGX 1 1. 000000 0. 000000 0. 00000 1 AL 1 44 ORIGX 2 0. 000000 1. 000000 0. 000000 0. 00000 1 AL 1 45 ORIGX 3 0. 000000 1. 000000 0. 00000 1 AL 1 46 SCALE 1 0. 016038 0. 000000 0. 00000 1 AL 1 47 SCALE 2 0. 000000 0. 016038 0. 000000 0. 00000 1 AL 1 48 SCALE 3 0. 000000 0. 016038 0. 00000 1 AL 1 49 ATOM 1 C ACE 0 31. 227 38. 585 11. 521 1. 00 25. 00 1 AL 1 50 ATOM 2 O ACE 0 30. 433 37. 878 10. 859 1. 00 25. 00 1 AL 1 51 ATOM 3 CH 3 ACE 0 30. 894 39. 978 11. 951 1. 00 25. 00 1 AL 1 52 ATOM 4 N GLU 1 32. 153 37. 943 12. 252 1. 00 25. 00 1 AL 1 53 ATOM 5 CA GLU 1 32. 594 36. 639 11. 811 1. 00 25. 00 1 AL 1 54 ATOM 6 C GLU 1 32. 002 35. 428 12. 514 1. 00 25. 00 1 AL 1 55 ATOM 7 O GLU 1 32. 521 34. 279 12. 454 1. 00 25. 00 1 AL 1 56 ATOM 8 CB GLU 1 34. 093 36. 609 11. 812 1. 00 25. 00 1 AL 1 57 … ATOM 102 OXT GLY 12 20. 888 27. 022 1. 650 1. 00 25. 00 1 AL 1 144 TER 103 GLY 12 1 AL 1 145 HETATM 104 S SO 4 13 31. 477 38. 950 15. 821 0. 50 25. 00 1 AL 1 146 HETATM 105 O 1 SO 4 13 31. 243 38. 502 17. 238 0. 50 25. 00 1 AL 1 147 HETATM 106 O 2 SO 4 13 30. 616 40. 133 15. 527 0. 50 25. 00 1 AL 1 148 HETATM 107 O 3 SO 4 13 31. 158 37. 816 14. 905 0. 50 25. 00 1 AL 1 149 HETATM 108 O 4 SO 4 13 32. 916 39. 343 15. 640 0. 50 25. 00 1 AL 1 150 CONECT 104 105 106 107 108 1 AL 1 151 CONECT 105 104 1 AL 1 152 CONECT 106 104 1 AL 1 153 CONECT 107 104 1 AL 1 154 CONECT 108 104 1 AL 1 155 MASTER 29 0 1 1 0 0 0 6 100 1 5 1 1 AL 1 A 6 END 1 AL 1 157

Subunits view Interactive view

DATABASE SECONDARI

DATABASE SECONDARI Uni. Prot (Universal Protein Resource) Il piu’ grande catalogo di informazioni sulle proteine. Contiene informazioni sulla sequenza e sulla funzione di proteine ed e’ ottenuto dall’insieme delle informazioni contenute in Swiss. Prot, Tr. EMBL e PIR.

Uni. Prot http: //www. uniprot. org/uniprot/ Uni. Prot Knowledgebase, due parti: • Records annotati manualmente, informazioni dalla letteratura (Uni. Prot. KB/Swiss-Prot) • Records risultato di analisi computazionali, in attesa di annotazione completa (Uni. Prot. KB/Tr. EMBL).

NCBI GENE Interfaccia unificata per cercare informazioni su sequenze e loci genetici. Presenta informazioni sulla nomenclatura ufficiale, accession numbers, fenotipi, MIM numbers, Uni. Gene clusters, omologia, posizioni di mappa e link a numerosi altri siti web.

NCBI GENE

NCBI GENE Ref. Seq - Reference Sequence collection of genomic DNA, transcripts, and proteins. Distinguishing Features: • • • non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current knowledge of sequence data and biology data validation and format consistency accessions with '_' character ongoing curation by NCBI staff and collaborators, with reviewed records indicated

DATABASE SECONDARI NCBI - Information retrieval system • E' stato sviluppato all’NCBI (National Center for Biotechnology Information, USA) permettere l'accesso a dati di biologia molecolare e citazioni bibliografiche. • Sfrutta il concetto di “neighbouring”: possibilita' di collegare tra loro oggetti diversi di database differenti, indipendentemente dal fatto che essi siano direttamente “cross-referenced”. • Tipicamente, permette l'accesso a database di sequenze nucleotidiche, di sequenze proteiche, di mappaggio di cromosomi e di genomi, di struttura 3 D e bibliografici (Pub. Med).

Pub. Med

Bookshelf

Database secondari di strutture • PFAM, CATH, SCOP • Organizzano strutture in base a criteri gerarchici, evoluzionistici e di similarità strutturale • Banche dati secondarie derivate da PDB

Pfam • Proteins contain conserved regions • Based on the conserved regions, proteins are classified into families • Domains can be considered as building blocks of proteins. • Some domains can be found in many proteins with different functions, while others are only found in proteins with a certain function. • The presence of a particular domain can be indicative of the function of the protein.

Pfam • The Pfam database is a large collection of protein domain families. • Each family is represented by multiple sequence alignments and hidden Markov models (HMMs). • HMM -> modelli probabilistici che descrivono evoluzione e conservazione di famiglie proteiche • Provides links to external databases like PDB, SCOP, CATH etc.

Procedura per la costruzione degli allineamenti Allineamenti seed, curati, di membri rappresentativi della famiglia HMM All. full contengono tutti i membri della famiglia

Pfam classification • Family: A collection of related protein regions • Domain: A structural unit • Repeat: A short unit which is unstable in isolation but forms a stable structure when multiple copies are present • Motifs: A short unit found outside globular domains • Related Pfam entries are grouped together into clans; the relationship may be defined by similarity of sequence, structure or profile. HMM.

Due sezioni: Pfam-A and Pfam-B • Allineamenti seed e full Pfam-A • Famiglie “incomplete” Pfam-B

• Pfam-B families are un-annotated and of lower quality as they are generated automatically from the non-redundant clusters of the latest ADDA (Automatic Domain Decomposition Algorithm) database release. • Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found.

Pfam HMM logo Seed alignment

CATH Protein Structure Classification Database at UCL • Classification of proteins based on domain structures • Each protein chopped into individual domains and assigned into homologous superfamilies. • Hierarchial domain classification of PDB entries.

CATH hierarchy • Class – derived from secondary structure content is assigned automatically • Architecture – describes gross orientation of secondary structures, independent of connectivity (based on known structures) • Topology – clusters structures according to their topological connections and numbers of secondary structures • Homologous superfamily – this level groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous

Class, C-level mainly-alpha, mainly-beta and alphabeta (including alternating alpha/beta structures and alpha+beta structures) plus a fourth class with low secondary structure content. Architecture, A-level Overall shape; ignores the connectivity between the secondary structures. Assigned manually using literature for well-known architectures (e. g the betapropellor or alpha four helix bundle) as reference. Topology (Fold family), T-level Structures are grouped into fold families at this level depending on both the overall shape and connectivity of the secondary structures. This is done using the structure comparison algorithm SSAP.

CATH – dominio maggiore serina idrossimetiltransferasi umana

SCOP Structural Classification of Proteins • Description of structural and evolutionary relationships between all the proteins with known structures • Uses the PDB entries • Search using keywords or PDB identifiers

Hierarchy in SCOP • While the four major levels of CATH are class, architecture, topology and homologous superfamily SCOP uses: • Class (all α, all β, α/β, α + β) • Fold • Superfamily • Family • Species • SCOP database is mainly based on expert knowledge, while CATH grounds more on automation

What about Genomic databases? Saranno trattati nella parte del corso riguardante la Genomica

Homework: leggere tre abstract e un articolo scelti dall’Ultimo NAR DB issue http: //www. oxfordjournals. org/our_journals/nar/database/c/