Pub Med Central and the NLM Journal Archiving
Pub. Med Central and the NLM Journal Archiving Vocabulary NATIONAL LIBRARY OF MEDICINE
What is Pub. Med Central? • Digital archive of life sciences journals • includes health policy, bioinformatics and other fields • Participation is voluntary and limited to journals: • covered by a major abstracting/indexing service, or • have 3 editorial board members with current grants from major nonprofit funding agencies • Journals deposit an authoritative electronic copy that must meet PMC data quality standards • Deposits are permanent • Copyright retained by publisher or author NATIONAL LIBRARY OF MEDICINE
Access to PMC Content • Free access to full-text articles and supporting data • Not necessarily open access • Journal may delay free access to its content • research articles are generally free in a year or less • Full-text searching in PMC • Citations for all articles included in Pub. Med • Fully integrated with other Entrez databases – sequence data, taxonomy, books, etc. NATIONAL LIBRARY OF MEDICINE
Why? ? ? Why Free? • The more eyes the better • Readers provide another level of quality control Why XML? • • Preserves structure of an article Lends itself to intelligent processing Human readable – not dependent on technology Portable NATIONAL LIBRARY OF MEDICINE
PMC Workflow NATIONAL LIBRARY OF MEDICINE
Pub. Med Central DTD History pmc-1. dtd Ø DTD currently in production (but not for long). Ø Derived from keton. dtd and BMC article. dtd. Ø Designed to be a simple DTD for online display and archive. Ø Written with samples from PNAS, MBC, and BMC. Why a new DTD? Ø Elements/attributes had to be added to accommodate new journals. Ø DTD would become cumbersome quickly if we had to keep making changes for each new title. Ø Original “simplicity” of design would lead to confusing data structures as the dtd expanded. Ø Moved away from standard XML practices to accommodate source SGML. Ø Needed an independent review. NATIONAL LIBRARY OF MEDICINE
The Reviewers Mulberry Technologies, Inc Ø An electronic publishing consultancy specializing in SGML- and XMLbased systems. Ø Has been active in SGML since 1984 and in XML since 1996. Ø Has extensive experience in the development and maintenance of SGML and XML applications for STM publishers. The Task Ø Review the pmc-1. dtd for XML best practices, applicability to archive and online retrieval use, and completeness in application to STM journals. Ø Create an updated version of the DTD Ø Document the new DTD. NATIONAL LIBRARY OF MEDICINE
The Results pmc-2. dtd Mulberry’s Suggestions Ø Create two DTDs: • one for archiving to allow us to convert data from multiple sources to our DTD. • a subset for authoring to allow us to retain some control when publishers create articles to the DTD. Ø Use proven solutions like XLINK and the XHTML table standard. Ø Use data models to simplify the DTD. NATIONAL LIBRARY OF MEDICINE
Harvard E-Journal Archiving Project • The Melon Foundation funded the Harvard Library to study the feasibility of using one DTD for archiving journal articles. • Harvard commissioned Inera, Inc. for the E-Journal Archive DTD Feasibility Study. • • Conclusion – yes, it is feasible, but the right DTD does not exist. A meeting was held in April 2002 to discuss the changes needed to the PMC 2 DTD to expand its range to include most any journal. Attendees included PMC, Mulberry Technologies, Inc. (consultant to PMC), The Mellon Foundation, The Harvard Library, and Inera (consultant to Harvard. Mellon). NATIONAL LIBRARY OF MEDICINE
Conclusions 1. PMC and Harvard-Mellon had different ideas about what the DTD should do. 2. Harvard was interested in an Interchange DTD, which would allow publishers to submit in multiple formats, which would all be valid. PMC was interested in an Archive DTD, which would be open enough to allow conversion of multiple sources into one single format. 3. 2. If the PMC 2 DTD was modularized, and some pieces were added (like the OASIS table model), many DTDs could be built using the same elements, giving both flexibility and consistency. NATIONAL LIBRARY OF MEDICINE
Status • The “NLM Archiving and Interchange DTD Suite” has been created and released. Mulberry and Inera analyzed hundreds of journals across subjects to insure that the DTD Suite was powerful enough to tag them. • The “NLM Journal Archiving DTD” and the “Journal Publishing DTD” have been created from the DTD Suite. The Archiving DTD and the Suite were circulated through the Mulberry’s and Inera’s contacts in the electronic publishing world for comments and suggestions. Suggestions that made the DTD more useable were incorporated. NATIONAL LIBRARY OF MEDICINE
Archiving / Publishing DTDs • PLo. S is using the DTD for their journals • Tech. Books is using Journal Publishing DTD to send PMC content for J. Athletic Training and using the DTD for internal journal production • High Wire Press will use the DTDs for their content • Atypon • JSTo. R will use the DTD for its E-Journal Archive • CSIRO (Australia's Commonweath Scientific & Industrial Research Organisation) will tag its journals with the new DTD • Several others small journals trying to use the DTD to submit content to PMC NATIONAL LIBRARY OF MEDICINE
JSTOR The Scholarly Journal Archive JSTOR’s Electronic-Archiving Initiative Archiving full journal issues Use Archiving DTD for article material Publishers supply sample data for analysis and development • • • Association of Computing Machinery American Economics Association American Mathematical Society American Political Science Association Blackwell Publishing, Ltd. The Ecological Society of America John Wiley & Sons National Academy of Sciences The Royal Society The University of Chicago Press NATIONAL LIBRARY OF MEDICINE
Highwire Press Library of the Sciences and Medicine Currently using their own proprietary DTD Will be moving to the Archiving DTD Journals in • • Biological Sciences Physical Sciences Medical Sciences Social Sciences NATIONAL LIBRARY OF MEDICINE
CSIRO Commonwealth Scientific and Industrial Research Organization • Australia’s largest scientific research agency • Independent science and technology publisher • Journals, online journals, books, magazines and CD-ROMs • Using Inera’s e. Xtyles to both clean up and convert from Microsoft Word NATIONAL LIBRARY OF MEDICINE
Centers for Medicare and Medicaid Services • United States Department of Health and Human Services Centers for Medicare and Medicaid Services Office of Strategic Planning • Publishing DTD • Initial product is Health Care Financing Review 2004 CMS Statistics guide and other publications to follow • Frame. Maker application NATIONAL LIBRARY OF MEDICINE
Other Publishers • Public Library of Science (PLo. S Biology & PLo. S Medicine) • National Athletic Trainers' Association (Journal of Athletic Training) • St. James Publishing (Journal of Burns & Surgical Wound Care) • Amphibian and Reptile Conservation • Journal of Medical Internet Research NATIONAL LIBRARY OF MEDICINE
Conversion Vendors • • Tested DTD within a week of release Tested in advance of clients Have converted for publishing clients Notable vendors (that we know about): • Tech. Books (Fairfax, VA) — submitted XML in Publishing DTD to Pub. Med Central within 2 weeks of the DTD release. • using for & 30 journals • Data Conversion Laboratory (Fresh Meadows, NY) — • has agreed to convert content to the Archiving DTD for individual Open Access articles submitted to Pub. Med Central by authors. • converted CMS publications (and others) NATIONAL LIBRARY OF MEDICINE
Other Service Providers Atypon Systems • hosting, software, and operations provider • using for • Annual Reviews – 31 journals • Lawrence Erbaum Associates – 81 journals • University of California Press – 33 journals Impressions, Inc. • composition and publishing for print and online books and journals • used both the DTD and a schema version with Word 2003 NATIONAL LIBRARY OF MEDICINE
Who Owns the Tagset? The DTDs? • Not “Open Source” • DTDs and Tagset are in the public domain • NLM retains control over changes and additions to the Tagset and DTDs • But: Anyone may create a new DTD from or use them without permission from NLM NATIONAL LIBRARY OF MEDICINE
NLM Requests 1. If you create a DTD from the DTD Suite 2. And intend it to stay compatible with the Suite 3. Then please include the following comment in modules: “Created from, and fully compatible with, the Archiving and Interchange DTD Suite. ” 1. If you alter one or more modules of the suite 2. Then please rename your version and all its modules to avoid any confusion with the original Suite 3. And, please include the following statement as a comment in all your DTD modules: “Based in part on, but not fully compatible with, the Archiving and Interchange DTD Suite. ” NATIONAL LIBRARY OF MEDICINE
What’s Next? : Working Group To keep the DTD relevant to the publishing and archiving communities, we have created the XML Interchange Structure Working Group. This group advises NLM on recommended changes in and/or additions to the tagset. The Working group met for the first time on August 18, 2003. The recommendations from this meeting led to version 1. 1 of the DTDs, released on November 1, 2003. NATIONAL LIBRARY OF MEDICINE
What’s Next? : Other DTDs Because the DTD is built as a set of DTD modules, other document types can be created (relatively) easily using the same content models. We are building a Books DTD and planning an Online Documentation DTD. NATIONAL LIBRARY OF MEDICINE
What’s Next? PMC • Complete redesign of software – built around NLM Archiving and Interchange DTD • Portable PMC – toolset with basic functions to • build SQL database from PMC archival files • create standard TOC and article displays • Citation linking based on automated parsing of reference citations from scanned OCR text • Japanese (NIG/DDBJ) developing journal archiving system • Wellcome Trust / JISC adding £ 1. 75 million to digitize journals they enlist • Journals will be regular PMC participants • Titles include Annals of Surgery, Journal of Anatomy, Journal of Physiology, all of which go back to late 1800 s NATIONAL LIBRARY OF MEDICINE
Intermission The PMC Back Issue Scanning Project or Digitization NATIONAL LIBRARY OF MEDICINE
Back Issue Digitization • Create a complete digital archive of PMC journals • Bring the collection to today’s “if not online, it doesn’t exist” user • Cover-to-cover digital copy of everything up to where journal began producing electronic copy • Publisher gets free, unencumbered digital copy • First complete archive, Bulletin of the Medical Library Association (1911), released in November 2003 NATIONAL LIBRARY OF MEDICINE
Digitization Details • PDF file for each article with true reproduction of grayscale and color images • Citation / abstract XML record (if not already in Pub. Med) • Mechanically improved (5 -pass) OCR text for: • Searching across the collection and in individual PDFs • Potential automated reference linking • TIFF files for scanned page images and each grayscale and color figure NATIONAL LIBRARY OF MEDICINE
TOC for Digitized Issue NATIONAL LIBRARY OF MEDICINE
Digitized Article Summary Page NATIONAL LIBRARY OF MEDICINE
Page Browse, Hi. Fi Image, PDF NATIONAL LIBRARY OF MEDICINE
OCR Text for J Virol Article NATIONAL LIBRARY OF MEDICINE
Stumping What the world needs now: • XML-based authoring and editing products designed for scientific articles • Straightforward, universal standard for defining access rights, similar to copyright indication • Other operational, free archives that can form a collaborative archiving network NATIONAL LIBRARY OF MEDICINE
Links Pub. Med Central – http: //www. pubmedcentral. gov NLM DTDs and documentation http: //dtd. nlm. nih. gov jeffbeck@nih. gov NATIONAL LIBRARY OF MEDICINE
- Slides: 33