Reducing Costs and Expanding XML Submissions with PDF

Reducing Costs and Expanding XML Submissions with PDF to JATS Conversion by Keishi KATOH (加藤圭志) DIGITAL COMMUNICATIONS Co Ltd

Agenda � About J-STAGE � Service overview � Positioning of Bibliographic XML creation tool � Tool workflow � Conversion from PDF to JATS XML � Demonstration of the tool � Conversion 2 results analysis and future improvements JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Brief introduction for J-STAGE and bibliographic XML creation tool 3 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

About J-STAGE � J-STAGE = “Japan Science and Technology Information Aggregator, Electronic” � The major e-journal publishing platforms of Japan provided by Japan Science and Technology Agency (JST) � 1, 684 titles, 2. 4 M articles (Oct 2012) � www. jstage. jst. go. jp � J-STAGE 3 the new platform was launched in May 2012 � With 4 JATS XML submission (full text / bibliographic info) JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Service positioning of J-STAGE Copyright © 2012 Japan Science and Technology Agency The brand names and product names are registered trademarks of respective companies. 5 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Bibliographic XML creation tool in J-STAGE Academic Society Article PDF JATS bib XML Users access from the internet Internet Here Bibliographic XML creation tool J-STAGE registration system J-STAGE public system J-STAGE 6 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

The tool with reasons � Is XML easy? � XML spec is simple � JATS tag suite is easily understood � Domain specific light-weight tag set � Easy structures and attributes � Easily created from author’s data!! � Difficulty for authors to create papers in XML format � Many various tools used for writing the papers � Printing � Higher 7 / production process from writing to publishing company’s capabilities to work with XML skills required using XML JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Why from PDF? � Various tools and formats in publication � For writing: Word, Te. X… � For printing: � DTP Tools - In. Design, Frame. Maker � Automated publishing systems - 3 B 2/APP, AH Formatter � For distributing: PDF, HTML, XML… � Almost 8 all academic societies have PDFs JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Conversion workflow 9 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Workflow with two phases � Phase 1: Template pattern creation � Phase 2: Registration of PDF and conversion to XML Phase 1: Template pattern creation Phase 2: XML conversion Sample Article PDF Article PDF PDF Automatic Analyze Template Pattern XML Conversion JATS XML XML Details are shown in a demonstration 10 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Sources & Outputs � Source: PDF � ver. 1. 3~1. 5 � Fonts are embedded, not rasterized and scanned PDF � Without security permission flag � Output: JATS valid XML � With J-STAGE’s XML submission guideline compliant � Bibliographic elements 11 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Demonstration 12 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Demo contents � Create new template � Select sample PDF for template � Set page margin � Setting of template pattern � Select the ‘block’ � Assign ‘pseudo-JATS’ elements to blocks � About Japanese-English contents � PDFs Conversion using template pattern � Converting process � XML Editing � (Empty template) 13 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

practices in 30 sec �山 � mountain �木 � tree �鳥 � bird �魚 � fish �亀 � 14 tortoise JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Create a new template � Go to Create new template function � Select sample PDF and submit � Set page margin 15 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Analyzing PDF Header / Footer region Contents flow order to next page 16 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Template settings � Select ‘Block’ for extracting information � Assign Pseudo-JATS item to block 17 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Selecting block � Block type � Paragraphs with heading � Paragraphs only � Selecting methods � Font name, size, bold/italic � Text pattern � Page range, region on the page � Block continues until other selection settings’ block 18 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Assign a pseudo-JATS item � Pseudo-JATS items denotes ‘Not single xml element of JATS’ � trans-title and title � kwd-group and kwd � Items for English and Japanese 19 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Configure pseudo-JATS item � Content region � Whole block � Select by condition � With heading � With inline heading � Pseudo-JATS specific setting � Dividing keywords � contrib-author to institution 20 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Preview of conversion � Preview with design of J-STAGE public system � Some XML structure information 21 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Workflow with two phases (again) � Phase 1: Template pattern creation � Phase 2: Registration of PDF and conversion to XML Phase 1: Template pattern creation Phase 2: XML conversion Sample Article PDF Article PDF PDF Automatic Analyze Template Pattern XML Conversion JATS XML XML Details are shown in a demonstration 22 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Convert and edit articles � Upload PDFs and select the template � Wait a seconds � Check and edit extracted data � Get XML!! 23 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Conversion results � Conversion Journal articles EL JO JE CL TR JI NI BU AD PJ accuracy with 10 journals, about 10 Language J/E J/E E E J/E J J/E E E Automatic recognition rate Avg Min Max 91% 58% 100% 97% 89% 100% 98% 95% 99% 93% 86% 100% 90% 50% 100% 91% 83% 96% 91% 83% 100% 93% 75% 98% 100% 97% 100% 98% 90% 100% Number of articles 10 10 10 Recognizing failures in 10 references and 10 keywords 8 10 8 7 9 Errata / essays are excluded from the evaluation. 24 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Future improvements � Improvement � Recognition of text blocks � Columns � of PDF analyzer engine and sequence of text flow Reconstruction algorithms with text content � Dehyphenation � JATS and space insertion context recognizing ability Template setting pattern � Additional Bibliographic elements � � For full text into JATS XML Extract images, vector graphics � Equations � *details are undecided at this time. 25 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Conclusion � Bibliographic XML creation tool is provided. � Easy settings, easy editing � But need more improvements � Utilization trend of bibliographic XML creation tool � From access analysis, Some societies are using the tool with publication interval (monthly / bi-monthly) � 790 articles with 33 journals are registered in 4 months 26 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS

Contacts J-STAGE services Japan Science and Technology Agency contact@jstage. jst. go. jp www. jstage. jst. go. jp Technical questions DIGITAL COMMUNICATIONS Co. , Ltd. dc-eigyou@sgml-xml. jp www. sgml-xml. jp Antenna House, Inc. International sales info@antennahouse. com +1 302 -427 -2456 27 JATS-Con 2012 Copyright © 2012 DIGITAL COMMUNICATIONS
- Slides: 27