Challenges of Developing a Global Alerting System American
Challenges of Developing a Global Alerting System American Chemical Society National Meeting - Chicago Symposium Honoring Gary Wiggins March 25 th, 2007 Leah Sandvoss Information Scientist
Honors n Chemical Informatics program initiation n Mentorship n Independent study on Chemical Information 2
Outline n Definitions n Overview of Global Alerting System n Business Analysis – pre-project n Information Retrieval n Metadata n Users n Project limitations n Lessons Learned 3
Definitions n Alerts/Selective Dissemination of Information(SDI)/Current Awareness – a stored search strategy which is run periodically against a database to return any newly added results to the end-user n Information Retrieval – the systematic storage and recovery of data, as from a file or database n Knowledge Management – refers to a range of practices used by organizations to identify, create, represent, and distribute knowledge for reuse, awareness and learning across the organization. n Business Analysis – act of gathering and translating business issues and needs into a form that can be given to appropriate people to form solutions n Metadata – data about data w “Zip Code” is the metadata for the piece of data “ 92121” w “Abstract” is the metadata for the actual abstract text n Unstructured data – data structure which is not readily machine readable n Controlled vocabulary – carefully selected list of words and phrases which are used to tag units of information to make them more retrievable by a search w cancer vs neoplasm Sources: Chicago Manual Style (CMS): information retrieval. Dictionary. com Unabridged (v 1. 1). Random House, Inc. http: //dictionary. reference. com /browse/information http: //dictionary. reference. com/browse/information retrieval (accessed: February 21, 2007). “Consulting Skills for Business Analysis” course by Watermark Learning 4
Knowledge Management n Each company in the healthcare and pharmaceutical sector has spent an average of US$274, 000 per annum on knowledge management over the past three years (ref 2001) Dyer, G. and Mc. Donough B. (2001) Vertical targets for knowledge management vendors. International Data Corporation. Document No. 25535 5
State of Biomedical Literature Mining Source: Jensen, L. J. , Sari, J. , Bork, P. Literature Mining for the biologist: from information retrieval to biological discovery Nature, vol 7, February 2006 6
Current Awareness System n Common platform to deliver many types of information, providing a common process for inserting the information n Compiles results from multiple information retrieval systems n Allows for the collection, review, analysis, and summarization of information types n Export capabilities n Uses controlled vocabularies n Provides structured, actionable information News Patents Books Portal Literature Integration layer Internal TOCs Key Op Leaders 7
Business Analysis 8
Business Analysis n A team was formed in 2002 to look at key information products available to the end-user, coined “value-added products” n First focus was on the products providing alerts/SDIs w Used online alert survey from an existing internal system to identify user needs. Approximately 230 total respondents. w Held discussions among team members about workflow based on their experience with customer needs w Conducted a per-month cost comparison of various alerting services n Several products provided overlapping information, resulting in duplication of effort among the information scientists n Recommendations were made for a future system and workflow for managing alerts 9
Business Analysis n In late 2004, a project team was formed to develop a tool to manage alerts as well as search results n First goal was to provide a repository to manage content. Tool would: w Allow information scientists to contribute, manage, and disseminate content w Replace existing current awareness products w Provide automation where possible to save time on the part of information scientists n Second goal was to develop a tool to display content to the end-user in one interface with a common format n Environmental scan was performed but determined to develop product inhouse w Facilitate incorporating internal content n Focus of this talk is on the repository development 10
Information Retrieval 11
Information Retrieval– Licensing Issues n Needed to determine what was “in scope” for existing contracts n Copyright restrictions questions w Vendor-produced abstracts subject to copyright restrictions? w Does it violate copyright to redistribute results? w Can full-text of article be used and classified? w Can a screen scrape be performed on an HTML page? w Can complete citation(s) be stored? If not, what fields can be stored? (ie, a unique identifier so that user can get back to complete citation). w For stored content, is there an expiration date? n Results Options w Format - XML, plain text, HTML, etc w Transfer type - s. FTP, e-mail, HTML view 12
Unstructured or Semi-structured Data – BRS/Tagged UI 92158846 TI Cluster headache syndrome. Ways to abort or ward off attacks. [Review] AU Marks DR. Rapoport AM 13
Unstructured or Semi-structured data - HTML 14
Unstructured or Semi-structured data - TOC 15
Information Retrieval – Process Supported Commercial Databases Rules applied for strategy setup and delivery E-mail inbox Repository Parsers applied Supported News Sources Other Sources Information Scientists 16
Metadata 17
Source System Metadata n Source system - defined set of metadata to which the vendor tags would map n To define core fields, looked at range of fields provided by all the databases of interest from the different vendors n Used Dublin Core fields where applicable Abstract Molecular Sequence Author Molecular Source Number Classification Open URL issn Date Granted Open URL issue Device Manufacturer Organism Info Device Trade Name Patent Assignee Diseases Patent Class Edition Subset Patent Country External Reference ID Gene Info Keywords Language Literature Title Literature Type Location Methods & Equipment Miscellaneous Patent Number Personal Name as subject Publish Date Publisher Sequence Data Space Flight Mission Subject Heading Title TOC Categories 18
Metadata Mapping n Minimum set of fields / database standpoint n Exclude fields not used for search or retrieval (ex: Item URL, Locally Held, Local Messages, Record Owner, Update Code, Notes, Order Number, Price, Abbreviated Source, Reprint Address, etc. ) n Manual process by subject matter experts (information scientists) Database Name Database Tag Name Target Metadata Field Name Biosis Previews Concept Code Classification Derwent World Patents Index Title Index Terms and Additional Words Keywords Derwent World Patents Index Derwent Accession Number External Reference ID Sci. Search Cited Work Cited Reference CAB Abstracts Organism Descriptors Organism Info Medline Country of Publication Location 19
Metadata - Content Objects n Content objects defined to differentiate content types on the backend w Contained unique metadata as well as overlapping metadata n Choices for end-user interface n Content Objects: 20
Metadata - Controlled Vocabulary n Controlled terms enhance search and retrieval capability w Terms are selected by user (information scientist) for tagging content items w Use preferred term, then list of synonyms w Standard terminology lists as pick lists (ex: Therapeutic area, disease) n Authoritative sources were used to determine appropriate values w Internal vocabularies w National Library of Medicine Medical Subject Headings (Me. SH) w Medical Dictionary for Regulatory Activities (Med. DRA) Authoritative Classifications Me. SH Metathesaurus Cyclohexatriene Med. DRA Benzene Internal Benzene Repository Figure source: DATAFUSION, Inc copyright 1999 21
Users 22
Users Information Scientists n Information Scientists w Set up alert strategies in vendor databases as well as the source system repository w Involved in interactive sessions with the tool to discuss content needs and find bugs in the system End-Users n End-users w Used the portal which displayed content w Involved early on in the initial requirements gathering, then engaged by the information scientists to test the tool 23
Project Limitations 24
Project Limitations – Source System n For every new vendor file/database that needed to be added to the system, a manual mapping from the vendor database fields to the target metadata had to be performed n Repository interface was cumbersome w Setting up a strategy was quite time-consuming as there was no auto-population of data w Opening new windows within the system was quite slow n New version of source system arrived mid-project n An approver role was required to allow an alert strategy to be set-up n System did not provide robust, boolean searching at the time n Only had one expert on the source system 25
Project Limitations - Organizational n Key reasons why projects fail: w Inattentiveness to organizational change w Sponsorship is lost or changes w Lack of budget/resources Other Factors n Project team leaders and members changed several times throughout life of project n Other applications identified to integrate into the solution were also “new” or in development n IT resources not well supported NO-GO decision was made near production 26
Lessons Learned n For a multi-year project: w Manage change – Knowledge transfer – Sustain momentum w Sustain business sponsorship w Plan the budget carefully w Involve influencing parties (vendors/publishers) early n Current awareness system: w Portal concept well-supported by end-users – Flexibility on their part to manage alerts – Integrated several different content types w Common workflow supported by information scientists 27
Summary n Knowledge Management is a continuous challenge n A need still exists for a global current awareness system Follow-up plans n Currently evaluating commercially available products n Internal efforts to filter, consolidate, and analyze content for customers 28
Acknowledgements Ajit Acharya Amy Tellez-Karsten Andrew Horgan Angela Liu Angelika Wendler-Awasthi Ann Young Barb Miller Barbara Breen Beverly Kucharski Bill Gillick Bob Berger Bryon Tilley Cara Evans Chandra Aitha Chris Duhl, West Pole Christina Carr Christina Keil Christine Ng Claire Hogikyan Clare Challenger Cleazoe Malek Dan Cooney, West Pole David Walsh Ed Pelic Elaine Logan Emory Emrich Fradwin Marmol Francis Di Bella Getu Diro Hennie Oswald Ian Parsons Iradj Reza Jan Carr Janet Smith Jill Maddox Julie grannis Karen Erani Karl Royer Kathy Cornish Kathy Van. Leeuwen Ken Drake Kevin Ogborne Kim Johnson Kirsten Kliwinski Leah Sandvoss Maheshkar Porandla Mark Mitchell Mary Skousen Michele Wang Michele Wolfe Murali Nandula Nathaniel Dunford Nicola Cooper Pam Kubiak Pat Burke Penny Miller Peter Dresslar, Metamatics Pragati Mithal Raj Dandamudi Ravneesh Sachdev Rich Steel Richard Nicholas Rob Exposito Rob Purdue Robert Linde Shuntai Wang Simona Hendl Srilekha Komma, Keane, Inc Susan Suchetta Suzan Quick, West Pole Thomas Knowles Veronica Trimble Vishal Kumar 29
Thanks 30
Backup Slides – Requirements from VAP team n Developed Requirements. Key documents included: w A proposed “alert service model” which included questions regarding alert gadget data entry, working with clients and ROI metrics. w A list of roles and responsibilities of stakeholders involved in “Global alerting process”, including IM Colleagues and Pfizer Colleagues. w A detailed description of the requirements needed for a future alerting system. It includes requirements for processing and managing alerts, archiving/distribution/retention issues, and delivery and service to clients. w A list of all of the various types of alerts that are currently used within IM and at which location they are run is provided. w A process model that describes how an end-user might look for/subscribe to alerts. Also included is a process that would be used by the IM colleague when setting up alerts. w A “client need” summary, provided from the IM perspective 31
Backup - Dublin Core Metadata Elements n Contributor n Coverage n Creator n Date n Description n Format n Resource Identifier n Language n Publisher n Relation n Rights Management n Source n Subject and Keywords n Title n Resource Type 32
- Slides: 32