Molecular Entity Types Phenotypic Entity Types Gene Differentiation

  • Slides: 63
Download presentation
Molecular Entity Types Phenotypic Entity Types Gene Differentiation Status Clinical Stage Genomic Information Malignancy

Molecular Entity Types Phenotypic Entity Types Gene Differentiation Status Clinical Stage Genomic Information Malignancy Types Phenomic Information Histology Variation Site Developmental State Heredity Status Genomic Variation associated with Malignancy

Flow Chart for Manual Annotation Process Auto-Annotated Texts Biomedical Literature Machine-learning Algorithm Annotators (Experts)

Flow Chart for Manual Annotation Process Auto-Annotated Texts Biomedical Literature Machine-learning Algorithm Annotators (Experts) Entity Definitions Manually Annotated Texts Annotation Ambiguity

Defining biomedical entities A point mutation was found at codon 12 (G A). Variation

Defining biomedical entities A point mutation was found at codon 12 (G A). Variation

Defining biomedical entities Data Gathering A point mutation was found at codon 12 (G

Defining biomedical entities Data Gathering A point mutation was found at codon 12 (G A). Variation A point mutation was found at codon 12 Variation. Type Variation. Location Data Classification (G A). Variation. Initial. State Variation. Altered. State

Defining biomedical entities n Conceptual boundaries ¨ Sub-classification of entities

Defining biomedical entities n Conceptual boundaries ¨ Sub-classification of entities

Defining biomedical entities n Conceptual boundaries Sub-classification of entities ¨ Levels of specificity ¨

Defining biomedical entities n Conceptual boundaries Sub-classification of entities ¨ Levels of specificity ¨

Levels of specificity Gene Entity Malignancy type Entity Gene Protein kinase (Super family) MAPK

Levels of specificity Gene Entity Malignancy type Entity Gene Protein kinase (Super family) MAPK (Gene family) MAPK 10 Cancer/Tumor Carcinoma Lung carcinoma Squamous cell lung carcinoma

Defining biomedical entities n Conceptual boundaries Sub-classification of entities ¨ Levels of specificity ¨

Defining biomedical entities n Conceptual boundaries Sub-classification of entities ¨ Levels of specificity ¨ Conceptual overlaps between entities ¨ Symptom: Subjective or objective evidence of disease. Disease: A specific pathological process with a characteristic set of symptoms. Arrhythmia vs. Long QT Syndrome

Defining biomedical entities n Conceptual boundaries Sub-classification of entities ¨ Levels of specificity ¨

Defining biomedical entities n Conceptual boundaries Sub-classification of entities ¨ Levels of specificity ¨ Conceptual overlaps between entities ¨ Domain-specific clarification ¨ Gene entity clarification: Regulation element -- promoters (eg. TATA box)

Defining biomedical entities n Conceptual boundaries ¨ Sub-classification of entities ¨ Levels of specificity

Defining biomedical entities n Conceptual boundaries ¨ Sub-classification of entities ¨ Levels of specificity ¨ Conceptual overlaps between entities ¨ Domain-specific clarification n Syntactical boundaries ¨ Text boundary issues The K-ras gene……

Defining biomedical entities n Conceptual boundaries ¨ Sub-classification of entities ¨ Levels of specificity

Defining biomedical entities n Conceptual boundaries ¨ Sub-classification of entities ¨ Levels of specificity ¨ Conceptual overlaps between entities ¨ Domain-specific clarification n Syntactical boundaries ¨ Text boundary issues (The K-ras gene) ¨ Pronoun co-reference (this gene, it, they)

Defining biomedical entities n Conceptual boundaries ¨ Sub-classification of entities ¨ Levels of specificity

Defining biomedical entities n Conceptual boundaries ¨ Sub-classification of entities ¨ Levels of specificity ¨ Conceptual overlaps between entities ¨ Domain-specific clarification n Syntactical boundaries ¨ Text boundary issues (The K-ras gene) Co-reference (this gene, it, they) ¨ Structural overlap -- entity within entity (same entity type) ¨ MAP kinase

Defining biomedical entities n Conceptual boundaries ¨ Sub-classification of entities ¨ Levels of specificity

Defining biomedical entities n Conceptual boundaries ¨ Sub-classification of entities ¨ Levels of specificity ¨ Conceptual overlaps between entities ¨ Domain-specific clarification n Syntactical boundaries ¨ Text boundary issues (The K-ras gene) Pronoun co-reference (this gene, it, they) ¨ Structural overlap -- entity within entity (different entity type) ¨ Squamous cell lung carcinoma

Defining biomedical entities n Conceptual boundaries ¨ Sub-classification of entities ¨ Levels of specificity

Defining biomedical entities n Conceptual boundaries ¨ Sub-classification of entities ¨ Levels of specificity ¨ Conceptual overlaps between entities ¨ Domain-specific clarification n Syntactical boundaries ¨ Text boundary issues (The K-ras gene) Co-reference (this gene, it, they) ¨ Structural overlap -- entity within entity ¨ Discontinuous mentions (N- and K-ras ) ¨

Semantic ambiguity challenges n Ambiguity within an entity type CAT catalase glycine-N-acyltransferase (GLYAT)

Semantic ambiguity challenges n Ambiguity within an entity type CAT catalase glycine-N-acyltransferase (GLYAT)

Semantic ambiguity challenges n n Ambiguity within an entity type Ambiguity between entity types

Semantic ambiguity challenges n n Ambiguity within an entity type Ambiguity between entity types CAT Gene entity Organism

Semantic ambiguity challenges n n n Ambiguity within entity types Ambiguity between entity types

Semantic ambiguity challenges n n n Ambiguity within entity types Ambiguity between entity types Gene entity ambiguity 3% of human genes share aliases ¨ Huge ambiguity of genes between species (mouse and human) ¨ Gene. general, Gene. gene/RNA, Gene. protein ¨

Gene RNA Protein Variation Type Location Initial State Altered State Malignancy Type Site Histology

Gene RNA Protein Variation Type Location Initial State Altered State Malignancy Type Site Histology Clinical Stage Differentiation Status Heredity Status Developmental State Physical Measurement Cellular Process Expressional Status Environmental Factor Clinical Treatment Clinical Outcome Research System Research Methodology Drug Effect

http: //www. ldc. upenn. edu/mamandel/itre/annotators/onco/definitions. html

http: //www. ldc. upenn. edu/mamandel/itre/annotators/onco/definitions. html

Manual Annotation Corpus Release Jena University Language & Information Engineering Lab: http: //www. julielab.

Manual Annotation Corpus Release Jena University Language & Information Engineering Lab: http: //www. julielab. de K Bretonnel Cohen and Lawrence Hunter, BMC Bioinformatics. 2006; 7(Suppl 3): S 5.

Summary -- Entity Definition n Developed iterative process for biomedical entity definition; n Defined

Summary -- Entity Definition n Developed iterative process for biomedical entity definition; n Defined genomic and phenotypic entities with distinct conceptual and syntactical boundaries in genomic variation of malignancy; n Constructed a manually annotated corpus with 1442 oncologyfocused articles.

Named Entity Extractors Mycn is amplified in neuroblastoma. Gene Variation type Malignancy type

Named Entity Extractors Mycn is amplified in neuroblastoma. Gene Variation type Malignancy type

Automated Extractor Development n Training and testing data 1442 cancer-focused MEDLINE abstracts ¨ 70%

Automated Extractor Development n Training and testing data 1442 cancer-focused MEDLINE abstracts ¨ 70% for training, 30% for testing ¨

Automated Extractor Development n Training and testing data 1442 cancer-focused MEDLINE abstracts ¨ 70%

Automated Extractor Development n Training and testing data 1442 cancer-focused MEDLINE abstracts ¨ 70% for training, 30% for testing ¨ n Machine-learning algorithm Conditional Random Field (CRF) ¨ Sets of Features ¨ Lung cancer is the MType Mtype … of carcinoma deaths worldwide.

Automated Extractor Development n Training and testing data 1442 cancer-focused MEDLINE abstracts ¨ 70%

Automated Extractor Development n Training and testing data 1442 cancer-focused MEDLINE abstracts ¨ 70% for training, 30% for testing ¨ n Machine-learning algorithm Conditional Random Fields (CRFs) ¨ Sets of Features n Orthographic features (capitalization, punctuation, digit/number/alphanumeric/symbol); n Character-N-grams (N=2, 3, 4); n Prefix/Suffix: (*oma); n Offsite conjuction (3 consecutive word tokens); n Domain-specific lexicon (NCI neoplasm list). ¨

Extractor Performance • Precision: (true positives)/(true positives + false positives) • Recall: (true positives)/(true

Extractor Performance • Precision: (true positives)/(true positives + false positives) • Recall: (true positives)/(true positives + false negatives)

CRF-based Extractor vs. Pattern Matcher n The testing corpus 39 manually annotated MEDLINE abstracts

CRF-based Extractor vs. Pattern Matcher n The testing corpus 39 manually annotated MEDLINE abstracts selected ¨ 202 malignancy type mentions identified ¨ n The pattern matching system 5, 555 malignancy types extracted from NCI neoplasm ontology ¨ Case-insensitive exact string matching applied ¨ 85 malignancy type mentions (42. 1%) recognized correctly ¨ n The malignancy type extractor 190 malignancy type mentions (94. 1%) recognized correctly ¨ Included all the baseline-identified mentions ¨

The Types of Mentions NOT Identified by Pattern Matching Mention Types Mention Examples NCI

The Types of Mentions NOT Identified by Pattern Matching Mention Types Mention Examples NCI List Acronyms NB Neuroblastoma Lexical variants (plural forms) Renal cell carcinomas Renal cell carcinoma Polymorphic expressions Lung cancer (tumor/tumour) Lung neoplasm higher levels of specificity Solid tumor <More specific tumor> Tumor names with modifiers Translocation carcinoma Carcinoma

Normalization abdominal neoplasm abdomen neoplasm Abdominal tumour Abdominal neoplasm NOS Abdominal tumor Abdominal Neoplasms

Normalization abdominal neoplasm abdomen neoplasm Abdominal tumour Abdominal neoplasm NOS Abdominal tumor Abdominal Neoplasms Abdominal Neoplasm, Abdominal Neoplasms, Abdominal Neoplasm of abdomen Tumour of abdomen Tumor of abdomen ABDOMEN TUMOR Unique Identifier

Normalization abdominal neoplasm abdomen neoplasm Abdominal tumour Abdominal neoplasm NOS Abdominal tumor Abdominal Neoplasms

Normalization abdominal neoplasm abdomen neoplasm Abdominal tumour Abdominal neoplasm NOS Abdominal tumor Abdominal Neoplasms Abdominal Neoplasm, Abdominal Neoplasms, Abdominal Neoplasm of abdomen Tumour of abdomen Tumor of abdomen ABDOMEN TUMOR UMLS metathesaurus Concept Unique Identifier (CUI) 19, 397 CUIs with 92, 414 synonyms C 0000735

Normalization – Computational Procedures n Rule-based algorithm Applied to both entity mentions and vocabulary

Normalization – Computational Procedures n Rule-based algorithm Applied to both entity mentions and vocabulary terms (UMLS metathesaurus) n Case insensitivity (carcinoma/Carcinoma) n Space/punctuation removal (lung-cancer/lungcancer) n Stemming (neuroblastoma/neuroblastomas) ¨ Applied to mentions only n First/last character removal (additional space/punctuation) n First/last word removal (translocation lung carcinoma) ¨ ¨ Evaluate the accuracy and the priority of the rules n 1, 000 randomly selected entity mentions n Choose the best performed rule combination and sequences

MEDLINE Data Processing n Tagging MEDLINE pre-2006 abstracts 15, 433, 668 MEDLINE abstracts ¨

MEDLINE Data Processing n Tagging MEDLINE pre-2006 abstracts 15, 433, 668 MEDLINE abstracts ¨ 9, 153, 340 redundant and 580, 002 distinct malignancy type mentions ¨ ~60% extracted mentions matched to UMLS CUIs ¨ 1, 642 CPU-hours (2. 44 days on a 28 -CPU cluster) ¨ n Infrastructure construction (postgre. SQL Database)

Gene-Malignancy-Evidence Matrix 21, 493, 687 normalized gene symbols (16, 875 unique) Gene Malignancy Evidence

Gene-Malignancy-Evidence Matrix 21, 493, 687 normalized gene symbols (16, 875 unique) Gene Malignancy Evidence A 1 BG …… ABCC 1 …… B 3 GAT 1 …… ERVK 6 …… NFKB 1 …… VIM VIM …… Adenocarcinoma …… Lung Carcinoma …… Breast Neoplasm …… 1634938 2292657 3566173 …… 11156254 11159731 11172691 …… 6870377 9129046 9701020 …… 9056412 9620301 9640365 …… 12842827 12901803 12934082 …… 12375611 12657940 12673425 …… Stage IV Melanoma of the Skin …… Colon Carcinoma …… Gastrointestinal Stromal Tumor ……

Gene-Malignancy-Evidence Matrix 5, 398, 954 normalized malignancy types (4, 166 CUIs) Gene Malignancy Evidence

Gene-Malignancy-Evidence Matrix 5, 398, 954 normalized malignancy types (4, 166 CUIs) Gene Malignancy Evidence A 1 BG …… ABCC 1 …… B 3 GAT 1 …… ERVK 6 …… NFKB 1 …… VIM VIM …… Adenocarcinoma …… Lung Carcinoma …… Breast Neoplasm …… 1634938 2292657 3566173 …… 11156254 11159731 11172691 …… 6870377 9129046 9701020 …… 9056412 9620301 9640365 …… 12842827 12901803 12934082 …… 12375611 12657940 12673425 …… Stage IV Melanoma of the Skin …… Colon Carcinoma …… Gastrointestinal Stromal Tumor ……

Gene-Malignancy-Evidence Matrix 3, 100, 773 distinct Gene-Malignancy-Evidence relations Gene Malignancy Evidence A 1 BG

Gene-Malignancy-Evidence Matrix 3, 100, 773 distinct Gene-Malignancy-Evidence relations Gene Malignancy Evidence A 1 BG …… ABCC 1 …… B 3 GAT 1 …… ERVK 6 …… NFKB 1 …… VIM VIM …… Adenocarcinoma …… Lung Carcinoma …… Breast Neoplasm …… 1634938 2292657 3566173 …… 11156254 11159731 11172691 …… 6870377 9129046 9701020 …… 9056412 9620301 9640365 …… 12842827 12901803 12934082 …… 12375611 12657940 12673425 …… Stage IV Melanoma of the Skin …… Colon Carcinoma …… Gastrointestinal Stromal Tumor ……

Ranked by Frequency

Ranked by Frequency

Summary -- Extractor Development and Application n Developed well-performed automated entity extractors across genomic

Summary -- Extractor Development and Application n Developed well-performed automated entity extractors across genomic and phenotypic domains; n Constructed rule-based computational procedure for normalization; n Applied the extractors and normalizers to all MEDLINE abstracts; n Imported the extracted information into a relational database.

Text Mining Applications -- Hypothesizing NB Candidate Genes

Text Mining Applications -- Hypothesizing NB Candidate Genes

Text Mining Applications -- Hypothesizing NB Candidate Genes Two distinct subtypes of neuroblastoma Developmenta

Text Mining Applications -- Hypothesizing NB Candidate Genes Two distinct subtypes of neuroblastoma Developmenta l State NB Subtype A NB Subtype B Younger age Older age Biology Clinical Stage Differentiation Lower Stage Proliferation Higher Stage Clinical Outcome Trk Expression Favorable High level expression of NTRK 1 Unfavorable High level expression of NTRK 2

Text Mining Applications -- Hypothesizing NB Candidate Genes n Two distinct subtypes of neuroblastoma

Text Mining Applications -- Hypothesizing NB Candidate Genes n Two distinct subtypes of neuroblastoma Distinct clinical behaviors (favorable vs. unfavorable) • NGF/NTRK 1 (Trk. A) vs. BDNF/NTRK 2 (Trk. B) signaling pathways • Trk Signaling Angiogenesis Differentiation Drug Resistance Tumorigenicity NB Subtype A NTRK 1/NGF Inhibits Yes Inhibits NB Subtype B NTRK 2/BDNF Promotes No Promotes

Text Mining Applications -- Hypothesizing NB Candidate Genes n Two distinct subtypes of neuroblastoma

Text Mining Applications -- Hypothesizing NB Candidate Genes n Two distinct subtypes of neuroblastoma Distinct clinical behaviors (favorable vs. unfavorable) • NGF/NTRK 1 (Trk. A) vs. BDNF/NTRK 2 (Trk. B) signaling pathways • Determine the early response genes differentiating the two pathways • More precise prognosis and clinical intervention •

Text Mining Applications -- Hypothesizing NB Candidate Genes NTRK 1 NTRK 2 SH-SY 5

Text Mining Applications -- Hypothesizing NB Candidate Genes NTRK 1 NTRK 2 SH-SY 5 Y NGF BDNF RNA extraction at 0, 1. 5 hrs, 4 hrs and 12 hrs Affymetrix U 133 A Expression Array (RMAexpress normalization, SAM test) 751 differentially expressed genes

Text Mining Applications -- Hypothesizing NB Candidate Genes Microarray Expression Data Analysis symbol NALP

Text Mining Applications -- Hypothesizing NB Candidate Genes Microarray Expression Data Analysis symbol NALP 1 RALY Gene Set 1: NTRK 1 , NTRK 2 CDC 2 L 6 RASGRP 2 KCNK 3 468 RPS 6 KA 1 SEC 61 A 2 VGF CACNA 1 C TBX 3 283 THRA B 4 GALT 5 NRXN 2 GNB 5 Gene Set 2: NTRK 2 , NTRK 1 RAI 2 FRS 3

Text Mining Applications -- Hypothesizing NB Candidate Genes n Differentially represented genes in biomedical

Text Mining Applications -- Hypothesizing NB Candidate Genes n Differentially represented genes in biomedical literature NTRK 1 vs. NTRK 2 pathway differentially associated genes/proteins based on literature • Preferential association determined by co-occurrence with either receptor 5 times or more over the other • Assumption: the co-occurrence frequency is reflecting functional correlation •

Text Mining Applications -- Hypothesizing NB Candidate Genes NTRK 1/NTRK 2 Preferentially Associated Genes

Text Mining Applications -- Hypothesizing NB Candidate Genes NTRK 1/NTRK 2 Preferentially Associated Genes in Literature Lit. Set 1: NTRK 1 Associated Genes 514 157 Lit. Set 2: NTRK 2 Associated Genes

Text Mining Applications -- Hypothesizing NB Candidate Genes Microarray Expression Data Analysis NTRK 1/NTRK

Text Mining Applications -- Hypothesizing NB Candidate Genes Microarray Expression Data Analysis NTRK 1/NTRK 2 Associated Genes in Literature NTRK 1 Associated Genes Gene Set 1: NTRK 1 , NTRK 2 18 514 468 283 Gene Set 2: NTRK 2 , NTRK 1 4 157 NTRK 2 Associated Genes

Functional Pathway Analysis Determine gene enrichment score for six selected functional pathways: CD --

Functional Pathway Analysis Determine gene enrichment score for six selected functional pathways: CD -- Cell Death; CGP -- Cell Growth and Proliferation; CCSI -- Cell-to-Cell Signaling and Interaction; CM -- Cell Morphology NSDF -- Nervous System Development and Function; CAO -- Cellular Assembly and Organization.

Functional Pathway Analysis Six selected pathways: CD -- Cell Death; CGP -- Cell Growth

Functional Pathway Analysis Six selected pathways: CD -- Cell Death; CGP -- Cell Growth and Proliferation; CCSI -- Cell-to-Cell Signaling and Interaction; Ingenuity Pathway Analysis Tool Kit CM -- Cell Morphology; NSDF -- Nervous System Development and Function; CAO -- Cellular Assembly and Organization.

Hypergeometric Test P-values

Hypergeometric Test P-values

Hypergeometric Test between Array and Overlap Groups Multiple-test corrected P-values (Bonferroni step-down)

Hypergeometric Test between Array and Overlap Groups Multiple-test corrected P-values (Bonferroni step-down)

RT-PCR Experimental Validation 11 out of 22 genes selected for RT-PCR validation: Symbol Description

RT-PCR Experimental Validation 11 out of 22 genes selected for RT-PCR validation: Symbol Description CAMK 4 calcium/calmodulin-dependent protein kinase IV VSNL 1 visinin-like 1 TBC 1 D 8 TBC 1 domain family, member 8 (with GRAM domain) RPS 6 KA 1 ribosomal protein S 6 kinase, 90 k. Da, polypeptide 1 EFNB 3 ephrin-B 3 B 3 GAT 1 beta-1, 3 -glucuronyltransferase 1 (glucuronosyltransferase P) GNAS complex locus NEFH neurofilament, heavy polypeptide 200 k. Da INA internexin neuronal intermediate filament protein, alpha NEFL neurofilament, light polypeptide 68 k. Da TYRO 3 protein tyrosine kinase

RT-PCR Experimental Validation 11 out of 22 genes selected for RT-PCR validation: Symbol Description

RT-PCR Experimental Validation 11 out of 22 genes selected for RT-PCR validation: Symbol Description CAMK 4 calcium/calmodulin-dependent protein kinase IV VSNL 1 visinin-like 1 TBC 1 D 8 TBC 1 domain family, member 8 (with GRAM domain) RPS 6 KA 1 ribosomal protein S 6 kinase, 90 k. Da, polypeptide 1 EFNB 3 ephrin-B 3 B 3 GAT 1 beta-1, 3 -glucuronyltransferase 1 (glucuronosyltransferase P) GNAS complex locus NEFH neurofilament, heavy polypeptide 200 k. Da INA internexin neuronal intermediate filament protein, alpha NEFL neurofilament, light polypeptide 68 k. Da TYRO 3 protein tyrosine kinase

RT-PCR Experimental Validation 11 out of 22 genes selected for RT-PCR validation: Symbol Description

RT-PCR Experimental Validation 11 out of 22 genes selected for RT-PCR validation: Symbol Description CAMK 4 calcium/calmodulin-dependent protein kinase IV VSNL 1 visinin-like 1 TBC 1 D 8 TBC 1 domain family, member 8 (with GRAM domain) RPS 6 KA 1 ribosomal protein S 6 kinase, 90 k. Da, polypeptide 1 EFNB 3 ephrin-B 3 B 3 GAT 1 beta-1, 3 -glucuronyltransferase 1 (glucuronosyltransferase P) GNAS complex locus NEFH neurofilament, heavy polypeptide 200 k. Da INA internexin neuronal intermediate filament protein, alpha NEFL neurofilament, light polypeptide 68 k. Da TYRO 3 protein tyrosine kinase

RT-PCR Experimental Validation 0 hr 1. 5 hr 4 hr 12 hr

RT-PCR Experimental Validation 0 hr 1. 5 hr 4 hr 12 hr

EFNB 3 Discussion n n EFNB 3 (ephrin-B 3) belongs to a family of

EFNB 3 Discussion n n EFNB 3 (ephrin-B 3) belongs to a family of ligands that binds to Eph family receptor tyrosine kinases Implicated in axon guidance and vertebrate nervous system development Exhibited growth-suppressive activity against NB cells in vitro Preferentially and significantly associated with low tumor stage and favorable clinical outcomes in neuroblastoma primary tumors

RT-PCR Experimental Validation 0 hr 1. 5 hr 4 hr 12 hr

RT-PCR Experimental Validation 0 hr 1. 5 hr 4 hr 12 hr

TYRO 3 Discussion n n Trans-memberane receptor tyrosine kinase activated by GAS 6 has

TYRO 3 Discussion n n Trans-memberane receptor tyrosine kinase activated by GAS 6 has showed to promote human fetal oligodendrocyte survival without proliferation GAS 6 may also contribute to cell adhesion and immune responses Further study of GAS 6/TYRO 3 signaling is needed

Summary -- NB Application n Prioritized array-determined differentially expressed genes by integrating text mining

Summary -- NB Application n Prioritized array-determined differentially expressed genes by integrating text mining results n Literature-based method showed its capability of enriching functionally relevant genes by pathway analysis n RT-PCR experiments further validated the inferential power of text mining

Conclusion n Created a process for iteratively and precisely defining biomedical semantic types directly

Conclusion n Created a process for iteratively and precisely defining biomedical semantic types directly from literature n Developed automated entity extractors across genomic and phenotypic domains in malignancy with satisfactory accuracy rates n Applied this computational entity recognition and normalization process to all MEDLINE abstracts n Integrated text mining results with neuroblastoma experimental data to hypothesize candidate genes differentiating neuroblastoma subtypes

Future Directions n Increasing dimensions of Information matrix n Context-based normalization algorithm n Relation

Future Directions n Increasing dimensions of Information matrix n Context-based normalization algorithm n Relation extraction with deeper semantic parsing

Acknowledgement Penn Bio. IE Team: Dr. Mark Liberman Dr. Mark Mandel Dr. Ryan Mc.

Acknowledgement Penn Bio. IE Team: Dr. Mark Liberman Dr. Mark Mandel Dr. Ryan Mc. Donald Dr. Fernando Pereira Annotator team White Lab: Steve Carroll Hawren Fang Kevin Murphy Brodeur Lab: Dr. Garrett Brodeur Ms. Ruth Ho Dr. Jane Minturn CHOP NAP Core: Dr. Eric Rappaport CHOP Bioinformatics Core: Dr. Xiaowu Gai Dr. Jim Zhang