Beyond Genes Proteins and Abstracts A Framework to
Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill http: //www. ils. unc. edu/~cablake@email. unc. edu
Motivation • Relentless increase in electronically available text – Life Sciences • The NLM added the 17 millionth entry to Pub. Med in April 2007 • 5, 200 journals indexed • 12, 000 new articles each week ! – Chemistry – more than 110, 000 articles in 1 year alone • Consequences: – Hundreds of thousands articles Shift from Retrievaloftorelevant Synthesis – Implicit connections between literature go 2
Entity Extraction • Newspaper genre – People, places, and organizations – Message Understanding Conference (MUC) • Biomedical genre – Genes and proteins – Diseases and treatments – Chemical compounds – Challenges: Bio. Creative , GENIA, JNLPBA 3
Relationship Extraction • Newspaper genre – Person moving from one company to another • Biomedicine genre – genes and proteins e. g. binds, inhibits – ARBITER (Rindflesch, Rajan, & Hunter, 2000) – Geneways (Rzhetsky, et al, 2004) – rel. Ex (Fundel, Kuffner, & Zimmer, 2007) 4
Causal Relationships • Newspaper genre – Causal relationships (Khoo, Chan, & Niu, 1998) • Biomedical genre – Causes and treats (Price & Delcambre, 2005) – Causal knowledge (Khoo, Chan, Niu, 2000) • Universal Grammar – Causatives (Comrie, 1974, 1981) 5
Claim Definition • “To assert in the face of possible contradiction” • Example sentence reporting a claim – “This study showed that Tamoxifen reduces the breast cancer risk” • Example Claim Framework – Tamoxifenagent – reduceschange – [breast cancer risk] object 6
Goals • Create a Framework that reflects how claims made in biomedical literature • The Framework should – generalize beyond biomedicine – differentiate between different levels of confidence in the claim – consider claims made in the full text • Populate the Framework automatically 7
The Claim Framework • Information facets – concepts – change – basis of the claim • Each information facet may have – modifiers – directionality 8
The Claim Framework Category 1. Explicit Claim 2. Implicit Claim 3. Correlation 4. Comparison 5. Observation Concept A Agent Concept B Object Nature of change Required Claim Basis Optional Agent Object Optional Required N/A Required Required Optional 9
Explicit Claims Indeed, glycine prevented Wy-14643 stimulated superoxide production by Kupffer cells. Claim 1 – glycineagent – preventedchange – [Wy-14643 -stimulated superoxide production]object Claim 2 – [Kupffer cells]agent – produceschange 10
Implicit Claims In liver the number of peroxisomes increases from about 500 -600/cell to > 5000/cell after exposure to peroxisome proliferators. Claim 1 – [Peroxisomes proliferators] agent – increaseschange. Direction – Peroxisomesobject – [In the liver]agent. Modifier – 11
Correlations A weak but statistically significant correlation was observed between the plasma nm 23 -H 1 level and the WBC count (Figure 1, n=102, r=0. 437, P<0. 0001) – [plasma nm 23 -H 1 level] agent – [WBC count] object – correlation change – [statistically significant] change. Modifier 12
Comparisons The plasma concentration of nm 23 -H 1 was higher in patients with AML than in normal controls (P =. 0001) Claim 1 – [plasma concentration of nm 23 -H 1] basis of claim – [Patients with AML]agent – higher change. Direction – [normal controls]object 13
Observations However, the plasma nm 21 -H 1 protein level was increased in SML-M 3 patients (P=. 0002) Claim 1 – [nm 21 -H 1 protein level]object – Increasedchange. Direction – [SML-M 3 patients]object. Modifier 14
Working Hypothesis 1 The Claim Framework reflects how a scientist communicates her findings – Full text documents randomly selected from biomedical literature will report findings using constructs within the Claim Framework – Human annotators will agree on facets within the Claim Framework – The Claim Framework will generalize to a variety of scientific literatures 15
Working Hypothesis 2 Facets within the Claim Framework can be populated automatically – The system will detect all claims identified by the human annotators (i. e. recall) – The system will only identify claims that were identified by the human annotators (i. e. precision) – The system design will generalize to new literatures by avoiding domain specific 16 constructs
Validating the Claim Framework • Draft Claim Framework given to two annotators • Pilot Study: Identify every claim – Include claims that don’t conform to the framework – Don’t consider how this will be automated 17
Validating the Claim Framework • Main study – 25 articles • Verification – Random set of sentences annotated twice – Feedback provided daily 18
Results • All documents – Total number of sentences: 5535 – Sentences with >=1 claim: 1250 (22. 6%) – Total number of claims: 3228 – Average claims per sentence: 2. 51 – Claims that did not fit in the Framework: 31 • Per document – Average number of sentences: 191 – Average number of sentences with >=1 claim: 43 19
Distribution of Claim Categories Category Explicit Implicit Observatio n Correlation Compariso n Total (%) Pilot(%) Main(%) 2489 77. 11 332 83. 42 2157 76. 63 87 2. 70 3 0. 75 84 2. 98 298 9. 23 24 6. 03 274 9. 73 174 5. 39 12 3. 02 162 5. 75 165 5. 11 27 6. 85 138 Total 3228 100 398 100 2830 4. 9 20 100
Annotation Agent Direction Agent Modifier Object Direction Object Modifier Change Direction Change Modifier Claim Basis Dir. Claim Basis Mod. Total All Documents Total (%) Words (Avg) 2894 89. 65 5221 1. 80 285 8. 83 291 1. 02 1246 38. 60 4448 3. 57 3197 99. 04 6849 2. 14 271 8. 40 283 1. 04 1561 48. 36 5383 3. 44 1897 58. 77 1953 1. 03 1337 41. 42 1358 1. 02 1147 35. 53 1618 1. 41 165 5. 11 394 2. 39 42 1. 30 43 1. 02 86 2. 66 266 3. 09 21 3228 28107 8. 70
Inter Annotator Agreement Information Facet Kappa Agent 0. 71 Object 0. 77 Change 0. 57 Change+Change. Dir 0. 88 perfect Agreement substantial moderate almost 22
Location of Claims Section Abstract Introduction Method Result Discussion Total Sentences With % sectio Claim Total n 98 309 31. 72 357 979 36. 47 6 1121 0. 54 293 1829 16. 02 539 1406 38. 34 % claim 7. 84 28. 56 0. 48 23. 44 43. 12 100. 0 23
Findings thus far • 99% of the claims made in these articles could be captured in the Claim Framework • 22% of sentences report at least 1 claim • 77% of the claims identified were explicit • 8% of claims are made in the abstract • Agreement – substantial between agents and objects 24
Acknowledgements – This project supported in part by – Renaissance Computing Institute (RENCI) Faculty Fellowship Program – NSF Center for Environmentally Responsible Solvents and Processes (CERSP CHE 9876674) – This project used resources provided by – the OSG, which is supported by the NSF & the U. S. Department of Energy's Office of Science • The speaker thanks • Nassib Nassar and Mats Rynge (RENCI) • Amol Bapat and Ryan Jones (SILS)
Questions and Comments Welcome Catherine Blake cablake@email. unc. edu http: //www. ils. unc. edu/~cablake School of Information and Library Science University of North Carolina at Chapel Hill
- Slides: 26