Statistical Approaches to Analysis of Traditional Chinese Medicine

Outline • Overview of TIMAN Group • Traditional Chinese Medicine (TCM) data analysis •

My Research Group: TIMAN (Text Information Management & Analysis) http: //timan. cs. uiuc. edu

Research Roadmap of TIMAN Information Retrieval Text Mining Biomedical Literature Medical Records Small Relevant

Our Work in Medical & Health Informatics How can we find similar medical cases

Today’s Talk: Analysis of Traditional Chinese Medicine Patient Records • Disease Profiling (Wang et

Traditional Chinese medicine (TCM) • More than 2, 500 years of Chinese medical practice

Vision of TCM Data Mining • TCM patient records = “experimental results” of herbs

Collaboration with Beijing TCM Data Center • • • Data Sets: > 300, 000

Outline • Disease Profiling (Wang et al. IEEE BIBM 2016) Probabilistic generative model •

Problem definition • Input: TCM patient records (with symptoms, diseases, and herbs) • Output:

An example analysis of patient record 12

Challenge: mixed vocabulary • Each record may contain multiple diseases – Patient may have

Our solution • A conditional probabilistic topic model – Explicitly model each disease as

The Conditional Symptom-Herb Model Likelihood of patient t having disease d Symptom Herb Disease

Evaluation: Dataset • 10, 907 patients TCM records in digestive system treatment • 3,

“Typical Herbs” Prescribed for 3 Diseases 19

Algorithm-Recommended Herbs vs. Physician-Prescribed Herbs 20

herb-symptoms relationships • Top 10 herb-symptoms relationships identified by our method but not by

Summary of Disease Profiling • A new probabilistic model to analysis traditional Chinese medicine

Outline • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huanget

Approach § Use both herb and symptom information for clustering § Challenge: data sparseness

Data Description § 2, 276 medical records Manually annotated by doctors § 3 level

Clustering § Agglomerative clustering § Cosine similarity as affinity measure § Clustered with level

Evaluation § Adjusted Rand Index (ARI) § Maximum score of 1. 0 § Counts

Summary of Subcategorization • First study of subcategorization of TCM records • Improve clustering

Outline • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang

Challenge in Analysis of EMR: Missing Data • Many data mining methods assume availability

Challenge in Analysis of EMR: Semantic mismatch • Features that mean the same thing,

Solution: External Knowledge • Idea: use external knowledge (expert curated, literature mined, etc. )

HEMnet • Create a heterogeneous information network: HEterogeneous Electronic Medical network (HEMnet) • Use

Data Description • Lung cancer data set – 43 patients with squamous-cell lung carcinoma

Using HEMnet to enrich EMRs • With the HEMnet, we now wish to fill

Pos. Network Embedding [Wang et al. 17] Heterogenous network Node/Entity vector space z E

Path based nodes similarity score Score of path M starts from node u and

Fast online optimization • Challenge: – The heterogeneous network may have millions of nodes

Experiments • Compare performance of HEMnet-enriched patient record matrix with the baseline patient record

Squamous-Cell Lung Carcinoma Results With HEMnet, p-value = 0. 001020 Baseline, p-value = 0.

Non-Squamous-Cell Lung Carcinoma Results With HEMnet, p-value = 0. 003652 Baseline, p-value = 0.

Summary of Survival Analysis • HEMnet-enriched patient records can improve the performance of clustering

Summary: Analysis of Traditional Chinese Medicine Patient Records • Disease Profiling (Wang et al.

Outlook: TCM Data Mining • TCM patient records = “experimental results” of herbs on

Know. Eng Center • Know. Eng = Knowledge Engine for Genomics – Cloud-based system

Heterogeneous Knowledge Network See https: //hub. knoweng. org/app/site/knownet. html for more information 53

Pipeline 1: Sample Clustering Knowledge Network increases robustness of clustering 54

Sample Pipeline 2: Gene Prioritization • User uploads a spreadsheet of gene-level transcriptomic (or

Sample Pipeline 3: Gene Set Characterization Given a set of genes, the pipeline tests

Forthcoming Pipeline 1: Literature Mining • Interactive literature mining with gene sets and term

Forthcoming Pipeline 2: Signature Analysis transcriptomic profiles Expression signature related to a phenotype 58

Forthcoming Pipeline 3: Phenotype Predictive Modeling 59

More information about Know. Eng can be found at https: //hub. knoweng. org/features and

A few other projects • Medical case retrieval (JAMIA 2010) • Exploit clinical notes

Extraction of Symptom Graphs from EHR (Patient Records) Multi-Level Symptom Graphs Predict the future

Discovery of Adverse Drug Reactions from Forums Green: Disease symptoms Blue: Side effect symptoms

Sample ADRs Discovered Drug(Freq) Drug Use Symptoms in Descending Order Zoloft (84) antidepressant Ativan

Thank You! Questions/Comments? Looking forward to opportunities to collaborate! 65

Computer-Aided Research (CAR) Public data/Info/ knowledge … … Public data/Info/ knowledge Network 1. Multi-level

Five Levels of Integration • Level 1: “Syntactic” integration of multiple sources – Scalable,

Multi-level support is needed because… • Knowledge extraction is far from 100% accurate (NLP

Automation-Scalability Tradeoff Automation of discovery Goal Specialized statistical prediction models Logic-based Inference systems “Beyond

Interactive ER Graph Analysis • The extracted entities and relations form a weighted graph

Example of Interactive Graph Mining Behavior B 2 isa Co-occur-fly Gene A 1 Orth-mos

Inference-Based Discovery • Encode all kinds of knowledge in the same knowledge representation language

Integration of Expert Knowledge • How can we combine expert knowledge with knowledge extracted

A Possible System Architecture User Interface/ Workflow Manager Inference Engine User Modeling & Personalization

Slides: 74

Download presentation

Statistical Approaches to Analysis of Traditional Chinese Medicine Patient Records Cheng. Xiang (“Cheng”) Zhai Department of Computer Science Affiliated with Carl R. Woese Institute for Genomic Biology Department of Statistics School of Information Sciences University of Illinois, Urbana-Champaign http: //czhai. cs. illinois. edu, czhai@illinois. edu Distinguished Lecture in Causal Discovery, Univ. of Pittsburgh, May 18, 2017 1

Outline • Overview of TIMAN Group • Traditional Chinese Medicine (TCM) data analysis • Know. Eng Center and opportunities for collaboration 2

My Research Group: TIMAN (Text Information Management & Analysis) http: //timan. cs. uiuc. edu Current: 11 Ph. D. students 5 MS students 2 Undergraduates 1 Visiting scholar Alumni: 27 Ph. D. students 40 MS students 20 Undergraduates Academia + Industry Funding: 3

Research Roadmap of TIMAN Information Retrieval Text Mining Biomedical Literature Medical Records Small Relevant Data Health Forums Raw Data … Big Decision Support Biomedical researchers Physicians Patients … We emphasize - Development of general and robust computational methods - optimization of human-computer collaboration + computers help human find patterns in data + humans train computers to make them intelligent 4

Our Work in Medical & Health Informatics How can we find similar medical cases in medical literature, in online forums, …? EHR (Patient Records) Medical Case Retrieval Similar Medical Cases Medical Knowledge Discovery How can we analyze EHR together with other knowledge bases to discover medical knowledge (e. g. , effectiveness of drugs, ADRs) from EHR? Improved Health Care Medical Knowledge 5

Today’s Talk: Analysis of Traditional Chinese Medicine Patient Records • Disease Profiling (Wang et al. IEEE BIBM 2016) Probabilistic generative model • Patient Subcategorization (Huang et al. ACM BCB 2016) Improve clustering by exploiting domain knowledge • Survival Analysis (Huang et al. 2017, submission under review) Integration of EMRs and biomedical information network 6

Traditional Chinese medicine (TCM) • More than 2, 500 years of Chinese medical practice • Complementary and alternative medicine • Knowledge is not well-documented – A significant challenge to inherit/use TCM knowledge – An interesting opportunity for data mining! Tu Youyou 2015 Nobel Prize in Physiology 屠呦呦 or Medicine (jointly with William C. Campbell and Satoshi Ōmura) for discovering artemisinin (青蒿素) 7

Vision of TCM Data Mining • TCM patient records = “experimental results” of herbs on patients – Potentially discover effective herbs for treating particular groups of patients – Effective chemical ingredients in effective herbs – Combination with western medicine + genomics + biomedical knowledge – Discover new medical knowledge & Provide a scientific foundation for TCM • TCM philosophy Individualized experiments – Large-space of empirical hypotheses explored 8

Collaboration with Beijing TCM Data Center • • • Data Sets: > 300, 000 clinical cases – Collected EMRs since 2007 from six hospitals – Data stored in a clinical data warehouse Key Collaborators: – Dr. Runshun Zhang, Dr. Jie Liu: Guang’anmen Hospital, Chinese Academy of Chinese Medical Sciences – Prof. Xuezhong Zhou, Department of Computer Science, Beijing Jiaotong University Ph. D Students at UIUC Sheng Wang Edward Huang 9

Outline • Disease Profiling (Wang et al. IEEE BIBM 2016) Probabilistic generative model • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review) S. Wang, E. Huang, R. Zhang, X. Zhang, B. Liu. X. Zhou, C. Zhai, A Conditional Probabilistic Model for Joint Analysis of Symptoms, Diseases, and Herbs in Traditional Chinese Medicine Patient Records, IEEE BIBM 2016. 10

Problem definition • Input: TCM patient records (with symptoms, diseases, and herbs) • Output: disease profile: – Typical symptom groups associated with disease – Typical herb groups associated with disease – Correlations between symptom groups and herb groups 11

An example analysis of patient record 12

Challenge: mixed vocabulary • Each record may contain multiple diseases – Patient may have anemia, hypertension and chronic gastritis at the same time • The corresponding symptoms come from different diseases 13

Our solution • A conditional probabilistic topic model – Explicitly model each disease as one topic – Each disease has its own distribution of symptoms and herbs • Difference from topic models (e. g. , PLSA/LDA) – Diseases as priors on topics. – A topic with two coordinated subtopics (one for herbs and one for symptoms) • Difference from previous work on TCM analysis – Modeling asymmetric causal relations between diseases and {symptoms, herbs} 14

The Conditional Symptom-Herb Model Likelihood of patient t having disease d Symptom Herb Disease Typical symptoms of disease d Typical herbs of disease d All diseases of patient t Observed Patient Record Optimization: Find optimal Pr(s|d), Pr(h|d) and Pr(d|t) so that the probability of all the observed { Pr(s, h|t)} would be maximized (EM algorithm) 15

Evaluation: Dataset • 10, 907 patients TCM records in digestive system treatment • 3, 000 symptoms, 97 diseases and 652 herbs • Most frequently occurring disease: chronic gastritis • Most frequently occurring symptoms: abdominal pain and chills • Ground truth: 27, 285 manually curated herbsymptom relationship. 16

Output of the model 17

“Typical Symptoms” of 3 Diseases 18

“Typical Herbs” Prescribed for 3 Diseases 19

Algorithm-Recommended Herbs vs. Physician-Prescribed Herbs 20

Predict symptom-herb annotations 21

herb-symptoms relationships • Top 10 herb-symptoms relationships identified by our method but not by frequent pattern mining 22

Summary of Disease Profiling • A new probabilistic model to analysis traditional Chinese medicine patient record – discovers meaningful TCM knowledge and outperforms previous work – can be used to develop a practically useful clinical decision making system • Future Work – Build an application system (e. g. , recommending herbs) – Analyze effectiveness of herbs 23

Outline • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huanget al. ACM BCB 2016) Improve clustering by exploiting domain knowledge • Survival Analysis (Huang et al. 2017, submission under review) E. Huang, S. Wang, R. Zhang, B. Liu, X. Zhou, C. Zhai, Pa. Re. Cat: Patient Record Subcategorization for Precision Traditional Chinese Medicine, ACM BCB 2016. 24

Subcategorizations 25

Approach § Use both herb and symptom information for clustering § Challenge: data sparseness § Solution: Leverage domain knowledge (TCM dictionary); using an embedding approach 26

Pa. Re. Cat Pipeline 27

Data Description § 2, 276 medical records Manually annotated by doctors § 3 level 1 labels § 51 level 2 labels § 274 level 3 labels 28

Clustering § Agglomerative clustering § Cosine similarity as affinity measure § Clustered with level 2 labels as ground truth § Number of clusters = number of level 2 labels § Two feature types § Clustering using only symptoms § Clustering with both symptoms and herbs § Didn’t do clustering using only herbs (symptoms are assumed to be always available) 29

Evaluation § Adjusted Rand Index (ARI) § Maximum score of 1. 0 § Counts overlaps in contingency table § Adjusted for chance Symptoms + Herbs k-means 0. 0174 k-means 0. 0770 Spectral 0. 0653 Spectral 0. 0843 Agglomerative 0. 1613 Agglomerative 0. 2717 Pa. Re. Cat 0. 1672 Pa. Re. Cat 0. 2754 30

Sample Clustering Results 31

Similar symptoms treated differently 32

Different symptoms treated similarly 33

Summary of Subcategorization • First study of subcategorization of TCM records • Improve clustering by using TCM dictionary • Experiment results show that using TCM is beneficial • Future work: comparative analysis of similar subcategories & effectiveness of herbs 34

Outline • Disease Profiling (Wang et al. IEEE BIBM 2016) • Patient Subcategorization (Huang et al. ACM BCB 2016) • Survival Analysis (Huang et al. 2017, submission under review) Integration of EMRs and biomedical information network E. W. Huang, S. Wang, B. Li, R. Zhang, B. Liu, R. Zhang, J. Liu, X. Zhou, H. Lin, C. Zhai. HEMnet: Integration of Electronic Medical Records with Molecular Interaction Networks and Domain Knowledge for Survival Analysis, under review 35

Challenge in Analysis of EMR: Missing Data • Many data mining methods assume availability of all features – Assumption usually doesn’t hold for EMR – Example: doctors do not perform all medical tests on all patients • Mean imputation – Most common method to fill missing values – Introduces noise rather than reduce it 36

Challenge in Analysis of EMR: Semantic mismatch • Features that mean the same thing, but occupy different spaces in the vocabulary – Example: “hypertension” and “high blood pressure” • Similar problem in text mining can be solved with word 2 vec or other word embedding techniques, but such techniques are insufficient for EMR (e. g. , med 2 vec) 37

Solution: External Knowledge • Idea: use external knowledge (expert curated, literature mined, etc. ) to fill in missing context information – Molecular interaction networks, such as protein interaction (PPI) networks (protein-protein edges) – Known drug targets (drug-protein edges) – Known drug-symptom relationships (drug-symptom edges) • But how can we incorporate all of this external information into EMRs? 38

HEMnet • Create a heterogeneous information network: HEterogeneous Electronic Medical network (HEMnet) • Use co-occurrence information to build network – Example: create edge between node d and node s if, for a patient, d is a drug that is prescribed to treat some symptom s. • With this co-occurrence network, we can add in the PPI network and other domain knowledge to create the HEMnet 39

Data Description • Lung cancer data set – 43 patients with squamous-cell lung carcinoma – 90 patients non-squamous-cell lung carcinoma • 449 unique features (excluding proteins) • 133 patient records with high degree of sparsity – Each record missing an average of 407 out of 449 features – Patient record matrix: 133 x 449 matrix • Each record also contains the patient’s survival information – Number of days until death – If no death event, number of days until hospital discharge • HEMnet – 11, 911 nodes (most of these are proteins) – 379, 715 edges – 23 edge types 40

Using HEMnet to enrich EMRs • With the HEMnet, we now wish to fill in the missing “contexts” of the medical records • Use network embedding technique, Pro. SNet, which has been shown to be useful in protein function prediction – word 2 vec: in sentence, a word’s neighbors should predict the word’s context – Pro. SNet: in graph, a node’s neighbors should predict the node’s context • Get a low-dimensional vector for each node • Get a pairwise cosine similarity matrix of nodes • Multiply matrix into patient record matrix 41

Pos. Network Embedding [Wang et al. 17] Heterogenous network Node/Entity vector space z E 3 E 4 E 5 E 2 E 1 Input Pros. Net Dimensionality Reduction y x Output • General: applicable to any heterogeneous network – e. g. , gene, drug, disease network • Efficient: – 60 K nodes and 10 M edges: less than 30 minutes on a 12 -core CPU 42

Path based nodes similarity score Score of path M starts from node u and ends at Global bias for Weights on different path type M dimension of Xv node v Weights on different dimension of Xu Compact node representation 43

Objective function • 44

Fast online optimization • Challenge: – The heterogeneous network may have millions of nodes and edges – Methods based on random walk with restart can only handle network with ~10 K nodes • Solution: Online learning – Randomly sample a path in the network every time – Only optimize nodes on this path – Sampling according to node degree – Inherently parallelized • A network with 60 K nodes and 10 M edges. – Less than 30 minutes on a 12 -core CPU 45

Experiments • Compare performance of HEMnet-enriched patient record matrix with the baseline patient record matrix • Clustering experiment – Hospitals are interested in whether a patient will survive – With patient survival functions as ground truth, we want to separate patients into two groups – Thus, the best clustering is one in which the two clusters have the most different survival functions • Survival functions can be estimated with Kaplan-Meier curves, then compared with log-rank test • Low p-value means the two clusters have significantly different survival rates • So, the lower the p-value, the more successful the method was in separating relatively healthy patients from those with short survival times 46

Squamous-Cell Lung Carcinoma Results With HEMnet, p-value = 0. 001020 Baseline, p-value = 0. 01267 47

Non-Squamous-Cell Lung Carcinoma Results With HEMnet, p-value = 0. 003652 Baseline, p-value = 0. 02333 48

Summary of Survival Analysis • HEMnet-enriched patient records can improve the performance of clustering over the baseline • For the baseline, neither cancer subtype was significantly separated at the 0. 01 level, while the HEMnet-enriched feature matrix was • Therefore, the external information (PPI network and domain knowledge) is helpful in solving the issues associated with missing data and semantic mismatch 49

Summary: Analysis of Traditional Chinese Medicine Patient Records • Disease Profiling (Wang et al. IEEE BIBM 2016) Probabilistic generative model • Patient Subcategorization (Huang et al. ACM BCB 2016) Improve clustering by exploiting domain knowledge • Survival Analysis (Huang et al. 2017, submission under review) Integration of EMRs and biomedical information network 50

Outlook: TCM Data Mining • TCM patient records = “experimental results” of herbs on patients We are just here – Potentially discover effective herbs for treating particular groups of patients – Effective chemical ingredients in effective herbs – Combination with western medicine + genomics + biomedical knowledge – Discover new medical knowledge & Provide a scientific foundation for TCM Towards semi-automatic (even automatic) discovery of findings similar to artemisinin (青蒿素) 51

Know. Eng Center • Know. Eng = Knowledge Engine for Genomics – Cloud-based system for knowledge-guided analysis of genomics data. – User uploads a spreadsheet data set to Know. Eng and configures an analysis task using a Web interface. – After the analysis task is done, the user can view the results in a intuitive and user-friendly way. • Key benefit: – Making intelligent use of prior knowledge in the public domain. – Prior knowledge represented as a massive heterogeneous network called the Knowledge Network (including nearly 100 externally curated databases). 52

Heterogeneous Knowledge Network See https: //hub. knoweng. org/app/site/knownet. html for more information 53

Pipeline 1: Sample Clustering Knowledge Network increases robustness of clustering 54

Sample Pipeline 2: Gene Prioritization • User uploads a spreadsheet of gene-level transcriptomic (or other omics) profiles of a collection of biological samples annotated with a numeric phenotype (e. g. , drug response, patient survival, etc. ) or categorical phenotype (e. g. , cancer subtype, metastatic status) • This pipeline scores each gene by the correlation between its “omic” value (e. g. , expression) and the phenotype, and reports the top phenotype-related genes. Knowledge Network increases robustness of prioritization 55

Sample Pipeline 3: Gene Set Characterization Given a set of genes, the pipeline tests the gene set for enrichment against a large compendium of annotations pre-loaded into Know. En. G (e. g. , pathway, GO term) Knowledge Network provides richer paths for association analysi 56

Forthcoming Pipeline 1: Literature Mining • Interactive literature mining with gene sets and term sets • Flexible selection of a set of literature articles for analysis • Given a set of terms and a set of candidate genes, rank genes for a particular term based on their associations in the literature • Potentially enriched with the Knowledge Network 57

Forthcoming Pipeline 2: Signature Analysis transcriptomic profiles Expression signature related to a phenotype 58

Forthcoming Pipeline 3: Phenotype Predictive Modeling 59

More information about Know. Eng can be found at https: //hub. knoweng. org/features and https: //knoweng. org 60

A few other projects • Medical case retrieval (JAMIA 2010) • Exploit clinical notes to improve prediction of onset of disease (ACM KDD 2012) • Discovery of adverse drug reactions (ADRs) from online health forums (ACM BCB 2014) • Text annotations, gene function prediction, …. 61

Extraction of Symptom Graphs from EHR (Patient Records) Multi-Level Symptom Graphs Predict the future onset of a disease (e. g. , Congestive Heart Failure) for a patient Discovery of symptom profiles of diseases Discovered symptoms improves accuracy of prediction by +10% (Work published in ACM KDD 2012) 62

Discovery of Adverse Drug Reactions from Forums Green: Disease symptoms Blue: Side effect symptoms Red: Drug: Cefalexin ADR: panic attack faint …. Sheng Wang et al. 2014. Side. Effect. PTM: an unsupervised topic model to mine adverse drug reactions from health forums. In ACM BCB 2014. 63

Sample ADRs Discovered Drug(Freq) Drug Use Symptoms in Descending Order Zoloft (84) antidepressant Ativan (33) anxiety disorders Topamax (20) anticonvulsant Ephedrine (2) stimulant weigh gain, weight, depression, side effects, mgs, gain weight, anxiety, nausea, head, brain, pregnancy, pregnant, headaches, depressed, tired Ativan, sleep, Seroquel, doc prescribed seroqual, raising blood sugar levels, anti-psychotic drug, diabetic, constipation, diabetes, 10 mg, benzo, addicted Topmax, liver, side effects, migraines, headaches, weight, Topamax, pdoc, neurologist, supplement, sleep, fatigue, seizures, liver problems, kidney stones dizziness, stomach, Benadryl, dizzy, tired, lethargic, tapering, tremors, panic attach, head Unreported to FDA 64

Thank You! Questions/Comments? Looking forward to opportunities to collaborate! 65

Computer-Aided Research (CAR) Public data/Info/ knowledge … … Public data/Info/ knowledge Network 1. Multi-level integration of data/info/knowledge 2. Multimode info access 5. Collaborative research 3. Research task support Personal data/info/ knowledge 4. Personalized CAR Personal data/info/ knowledge 66

Five Levels of Integration • Level 1: “Syntactic” integration of multiple sources – Scalable, robust, but minimum support for discovery • Level 2: Semantic integration (ontology) – Scalable, less robust, better support for discovery • Level 3: Synthesis of knowledge (entities, relations) – Less scalable, not robust, support for interactive discovery • Level 4: Synthesis of knowledge + Inference rules – Only applicable to a limited domain, but potentially support automatic discovery • Level 5: Specialized discovery model – Automatic hypothesis testing, but limited to a special discovery/prediction task 67

Multi-level support is needed because… • Knowledge extraction is far from 100% accurate (NLP is difficult) • Interpretation of knowledge is inherently context-sensitive and low-level support is needed for context and provenance • Automation-scalability tradeoff will not disappear (soon) • … 68

Automation-Scalability Tradeoff Automation of discovery Goal Specialized statistical prediction models Logic-based Inference systems “Beyond ontology” ER graph integration analysis engine Ontology-based semantic integration “Ontology-Free” integration Federated search engines Scalability/Generality 69

Interactive ER Graph Analysis • The extracted entities and relations form a weighted graph • Need to develop techniques to mine the graph for knowledge – Store graphs – Index graphs – Mining algorithms (neighbor finding, path finding, entity comparison, outlier detection, frequent subgraphs, …. ) – Mining language 70

Example of Interactive Graph Mining Behavior B 2 isa Co-occur-fly Gene A 1 Orth-mos Gene A 1’ Reg isa Behavior B 1 Behavior B 3 Co-occur-mos Gene A 3 Reg Reg Gene A 4’ Behavior B 4 Co-occur-fly Gene A 2 orth Co-occur-bee Gene A 4 Gene A 5 1. X=Neighbor. Of(B 4, Behavior, {co-occur, isa}) {B 1, B 2, B 3} 2. Y=Neighbor. Of(X, Gene, {c-occur, orth} {A 1, A 1’, A 2, A 3} 3. Y=Y + {A 5, A 6} {A 1, A 1’, A 2, A 3, A 5, A 6} 4. Z=Neighbor. Of(Y, Gene, {reg}) {A 4, A 4’} X= Path. Between({A 4, A 4’}, B 4, {co-occur, reg, isa}) 71

Inference-Based Discovery • Encode all kinds of knowledge in the same knowledge representation language • Perform logic inferences • Example Regulate (Gene. A, Gene. B, Context. C). [Text mining] Seq. Similar(Gene. A, Gene. A’) [Sequence mining] Regulate(X, Y, C) Regulate(Z, Y, C) & Seq. Similar(X, Z) [Human knowledge] Regulate(Gene. A’, Gene. B, Context. C) ADD: In. Pathway(Gene. B, P 1) In. Pathway(X, P) Regulate(X, Y, C) & In. Pathway(Y, P) [Human knowledge] Involved. In. Pathway(Gene. A’, P 1) 72

Integration of Expert Knowledge • How can we combine expert knowledge with knowledge extracted from literature? • Possible strategies: – Interactive mining (human knowledge is used to guide the next step of mining) – Inference-based integration – Trainable programs (focused miner, targeting at certain kind of knowledge) 73

A Possible System Architecture User Interface/ Workflow Manager Inference Engine User Modeling & Personalization Special Search & Navigation Information. Retrie val Data/Info + Ontology Analysis Engine NLP Machine Learning Information Extraction ER Graph Mining Entities Relations Hypothesis Knowledge Base Expert Knowledge … NCBI Genome Databases 74