G Paolella High performance computing per lannotazione e

  • Slides: 39
Download presentation
G. Paolella High performance computing per l’annotazione e il mining di genomi interi G.

G. Paolella High performance computing per l’annotazione e il mining di genomi interi G. Paolella Napoli, 18/12/ 2007

DG-CST 1022 genes related to genetically transmitted disease G. Paolella Napoli, 18/12/ 2007 2

DG-CST 1022 genes related to genetically transmitted disease G. Paolella Napoli, 18/12/ 2007 2

CST Identificazione e caratterizzazione di sequenze nucleotidiche conservate tra uomo e topo (CSTs) in

CST Identificazione e caratterizzazione di sequenze nucleotidiche conservate tra uomo e topo (CSTs) in altre specie. H. Sapiens M. Musculus CST identificate in geni associati a malattie: 64. 495. Analisi da effettuare mediante BLAST contro altri genomi (ratto, cane, scimmia, pollo, etc). G. Paolella Napoli, 18/12/ 2007 3

Kin. Web 500 genes coding for human protein kinases G. Paolella Napoli, 18/12/ 2007

Kin. Web 500 genes coding for human protein kinases G. Paolella Napoli, 18/12/ 2007 4

Kin. Web DB (d) (a) (b) (e) (c) G. Paolella Napoli, 18/12/ 2007 5

Kin. Web DB (d) (a) (b) (e) (c) G. Paolella Napoli, 18/12/ 2007 5

Pipeline units G. Paolella Napoli, 18/12/ 2007 6

Pipeline units G. Paolella Napoli, 18/12/ 2007 6

High throughput sequencing Scaffolds Contigs Assemble … … Annotation gene. Cluster A gene. A

High throughput sequencing Scaffolds Contigs Assemble … … Annotation gene. Cluster A gene. A G. Paolella t. RNA prom opr. A opr. B Napoli, 18/12/ 2007 7

Sequencing At CEINGE, Nonomuraea sequencing genome project by 454 FLX is in progress G.

Sequencing At CEINGE, Nonomuraea sequencing genome project by 454 FLX is in progress G. Paolella Napoli, 18/12/ 2007 8

Annotation G. Paolella Napoli, 18/12/ 2007 9

Annotation G. Paolella Napoli, 18/12/ 2007 9

Annotation Steps • Identification of genes and other genetic elements. • Protein functional annotation.

Annotation Steps • Identification of genes and other genetic elements. • Protein functional annotation. • Cellular process annotation. • Identification of ORFs, t. RNAs, r. RNAs • Scanning for signals, such as promoters and micro. RNAs • Identification of operons and gene clusters • Comparison with known genomes/proteins • Identification of orthologs and paralogs • Characterization of protein domains • Reconstruction of complete metabolic pathways • … G. Paolella Napoli, 18/12/ 2007 10

SLS in Bacteria E. coli k 12 ERIC Stem Loop Structure (SLS) Protein and

SLS in Bacteria E. coli k 12 ERIC Stem Loop Structure (SLS) Protein and coding genes Forward strand Reverse strand. Rib (BIME family) G. Paolella Napoli, 18/12/ 2007 11

Identification of SLSs in bacterial genomes real shuffled G. Paolella Napoli, 18/12/ 2007 12

Identification of SLSs in bacterial genomes real shuffled G. Paolella Napoli, 18/12/ 2007 12

Clustering by sequence similarity BLAST all vs all SLSs X Blast e-value Markov clustering

Clustering by sequence similarity BLAST all vs all SLSs X Blast e-value Markov clustering (MCL) G. Paolella Napoli, 18/12/ 2007 13

RESULTS Folding probability Clustered SLSs Random sequences p = probability that the Minimum Free

RESULTS Folding probability Clustered SLSs Random sequences p = probability that the Minimum Free Energy (MFE) of a given sequence is equal to a distribution of MFE computed with random sequences. (RANDFOLD) G. Paolella Napoli, 18/12/ 2007 14

RESULTS Grouping clusters 13739= 98 clusters manual refinement 92 families G. Paolella Napoli, 18/12/

RESULTS Grouping clusters 13739= 98 clusters manual refinement 92 families G. Paolella Napoli, 18/12/ 2007 15

Identification of all family members by HMM Sequence alignment HMM build New alignment HMM

Identification of all family members by HMM Sequence alignment HMM build New alignment HMM Final elements Align & extend matches G. Paolella Genome search Napoli, 18/12/ 2007 16

An example of elongation process: Myg-1 M. genitalium G. Paolella Napoli, 18/12/ 2007 17

An example of elongation process: Myg-1 M. genitalium G. Paolella Napoli, 18/12/ 2007 17

Examples Complex structures Efa-1 (E. faecalis) Pae-1 (P. aeuruginosa) G. Paolella Napoli, 18/12/ 2007

Examples Complex structures Efa-1 (E. faecalis) Pae-1 (P. aeuruginosa) G. Paolella Napoli, 18/12/ 2007 18

Bacterial SLSs Eric (Escherichia coli) Pae-1 (Pseudomonas aeuruginosa) G. Paolella Napoli, 18/12/ 2007 19

Bacterial SLSs Eric (Escherichia coli) Pae-1 (Pseudomonas aeuruginosa) G. Paolella Napoli, 18/12/ 2007 19

Secondary structure prediction analysis of families Novel (37) Contain known motifs (12) Novel (30)

Secondary structure prediction analysis of families Novel (37) Contain known motifs (12) Novel (30) Known (20) Predict to be structured (57) Contain known motifs (14) Known (5) Predict to be not structured (35) % of genic families 100 RNAz test G. Paolella 50 Napoli, 18/12/ 2007 0 20

Cluster 4 x 14 x 2=112 procs 2. 8 GHz 4 x 14 x

Cluster 4 x 14 x 2=112 procs 2. 8 GHz 4 x 14 x 2=112 GB RAM 2 GB/s per scheda - 4 GB/s aggregata G. Paolella 2. 8 GHz biproc. node, 2 GB RAM 160 GB HD Napoli, 18/12/ 2007 21

Processing time G. Paolella Napoli, 18/12/ 2007 22

Processing time G. Paolella Napoli, 18/12/ 2007 22

The procedure requires high performance computing Characterization Identification Blast + MCL Randfold Blast +

The procedure requires high performance computing Characterization Identification Blast + MCL Randfold Blast + MCL RNAz Pcma SCo. PE GRID computing G. Paolella HMMbuild HMMcalibrate HMMsearch Pcma Infernal n Napoli, 18/12/ 2007 23

Sito medicina • • • HD attached to the system: 1 Cluster Element (CE)

Sito medicina • • • HD attached to the system: 1 Cluster Element (CE) 5 Worker nodes (WN) biproc (expandable up to 40) 1 Storage Element (SE) with 50 Gb 1 User Interface (UI) G. Paolella Napoli, 18/12/ 2007 24

BLAST • Eseguibile submitted da un repository locale di programmi • Librerie di dati

BLAST • Eseguibile submitted da un repository locale di programmi • Librerie di dati genomici conservate su SE • Esempio Blast delle 65597 CST contro genomi di cane, gallo, scimmia e ratto. • Numero jobs sottomessi 67 • Gruppo di sequenze di input: 1000 sequenze • Tempo totale di esecuzione dei 67 jobs: 4 ore • Tempo medio per job: 18 minuti (2 spesi per scaricare il dataset). • • Tempo CPU Ricerca di 1 sequenza nel genoma di topo => 5 sec. 64. 495 sequenze => 3, 75 giorni 10 genomi => 37, 5 giorni G. Paolella Napoli, 18/12/ 2007 25

Ricerca strutture secondarie Identificazione e caratterizzazione in genomi batterici di famiglie di sequenze ripetute

Ricerca strutture secondarie Identificazione e caratterizzazione in genomi batterici di famiglie di sequenze ripetute che condividono una struttura secondaria conservata. Analisi da effettuare su oltre 300 genomi batterici Esempio Ricerca di una famiglia in un genoma =====> 6 ore. Ricerca di 50 famiglie in un genoma =====> 12, 5 giorni Ricerca di 50 famiglie in 300 genomi =====> 10 anni G. Paolella Napoli, 18/12/ 2007 26

Rand. Fold Identificazione di sequenze potenzialmente strutturate nel trascrittoma umano. Analisi da effettuare mediante

Rand. Fold Identificazione di sequenze potenzialmente strutturate nel trascrittoma umano. Analisi da effettuare mediante RANDFOLD sui geni frammentati a finestre di 50 basi in sequenze di 150 basi. Esempio Geni : 408 pari a circa 14 mln di basi Sequenze di 150 basi generate: 291. 589 Analisi di 1 sequenza =====> 45 sec. Analisi di 291. 589 sequenze =====> 152 giorni. G. Paolella Napoli, 18/12/ 2007 27

What about more interactive uses ? G. Paolella Napoli, 18/12/ 2007

What about more interactive uses ? G. Paolella Napoli, 18/12/ 2007

CAPRI G. Paolella Napoli, 18/12/ 2007 29

CAPRI G. Paolella Napoli, 18/12/ 2007 29

Requests distribution on the cluster Cluster Manager sql Cluster Status Display server Status. DB

Requests distribution on the cluster Cluster Manager sql Cluster Status Display server Status. DB Broker Cluster activity sql get. Node http request for an available node Updates from node agents Access server schedule r Access server web requests schedule r web display of results rsh launch on the node . . . Cluster Access server schedule r Private network DBs

Hierarchical node organization Broker DB node node virtual node DB Grid node node G.

Hierarchical node organization Broker DB node node virtual node DB Grid node node G. Paolella Napoli, 18/12/ 2007 31

Integrated image storage and processing environment Digital images are produced by a variety of

Integrated image storage and processing environment Digital images are produced by a variety of microscope devices. PROJECT -------------------- *Project title *Experiment name, *Author, group leader, ecc. *Cell line *Colture conditions *Fixation and inclusion methods, stainings, ecc *Objective *Focus Position *Stage position x/y *Exposure time *Resolution, ecc. The management of large number of images requires the use of databases (DB), Processing of the acquired images is often necessary to enhance the visibility of cell features, that would otherwise be hidden G. Paolella Napoli, 18/12/ 2007 32

The image processing system: IPROC G. Paolella Napoli, 18/12/ 2007 33

The image processing system: IPROC G. Paolella Napoli, 18/12/ 2007 33

Data input: text description well 1 File format well 2 Version number 1 features

Data input: text description well 1 File format well 2 Version number 1 features tab-delimited well 3 well 4 Name filename Depth size 16 bit Channel 3 Channel 2 Channel 1 Position n Time 1 wdim size 4 where files cdim size 3 where files pdim size n where files tdim size n unit min scale 10 where files ldim size n unit µm scale 0. 4 where layers Time n Time 2 l 1 ln G. Paolella Napoli, 18/12/ 2007 34

Image processing Info panel for each frame Image processing menus IPROC hide/show control command

Image processing Info panel for each frame Image processing menus IPROC hide/show control command Acquisition parameters Buttons to slide the acquisition G. Paolella Napoli, 18/12/ 2007 35

Image processing modules Image Magick Package i. Proc. Step Image. Magick PHP Package i.

Image processing modules Image Magick Package i. Proc. Step Image. Magick PHP Package i. Proc. Step PHP i. Proc. Step PERL Package i. Proc. Step Perl Command Line Packages command. Line program adapter image in adapter image out G. Paolella Napoli, 18/12/ 2007 36

IPROC architecture page i. Page area i. Pane G a t e w a

IPROC architecture page i. Page area i. Pane G a t e w a y image procsteps HPC on Cluster nodes data + images G. Paolella Napoli, 18/12/ 2007 37

Parallel processing IPRO C CLUSTER Access Server Cluster Nodes G. Paolella Napoli, 18/12/ 2007

Parallel processing IPRO C CLUSTER Access Server Cluster Nodes G. Paolella Napoli, 18/12/ 2007 38

The group Angelo Boccia Gianluca Busiello Mauro Petrillo Concita Cantarella Luca Cozzuto Leandra Sepe

The group Angelo Boccia Gianluca Busiello Mauro Petrillo Concita Cantarella Luca Cozzuto Leandra Sepe Vittorio Lucignano Marisa Passaro G. Paolella Napoli, 18/12/ 2007 39