MESUR a brief overview Johan Bollen and Herbert
MESUR: a brief overview Johan Bollen and Herbert Van de Sompel Digital Library Research & Prototyping Team Los Alamos National Laboratory - Research Library jbollen@lanl. gov Acknowledgements: Marko A. Rodriguez (LANL), Lyudmila L. Balakireva (LANL) Wenzhong Zhao (LANL), Aric Hagberg (LANL) Research supported by the Andrew W. Mellon Foundation. Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Usage data has arrived. Value of usage data/statistics is undeniable: • Business intelligence • Monitoring of scholarly trends • Enhanced end-user services • Scholarly assessment Routinely collected at very large-scale Where are usage-based metrics? • • • Sampling problems: usage data is largely community-defined Semantics: what do metrics means? Validation: does metric really represent scholarly status? This presentation: overview MESUR project at LANL Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Challenges to usage-based metrics. Usage-based metrics have lagged development. Here’s why: • Multiple communities • Multiple collection (artifacts) • Data: usage data limited to particular subcommunities and collections of artifacts. • Metrics: various metrics studied. Different results because of sample, collection or metric definition? Aspects of scholarly status? Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Divergence: CSU usage impact factor vs. the 2003 IF Johan Bollen. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. Journal of the American Society for Information Science and Technology, 59(1), 2008 Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Convergence: increasing the sample LANL Usage PR IF (2003) Title (abbv. ) CSU Usage PR IF (2003) Title (abbv. ) 1 60. 196 7. 035 PHYS REV LETT 1 78. 565 21. 455 JAMA-J AM MED ASSOC 2 37. 568 2. 950 J CHEM PHYS 2 71. 414 29. 781 SCIENCE 3 34. 618 1. 179 J NUCL MATER 3 60. 373 30. 979 NATURE 4 31. 132 2. 202 PHYS REV E 4 40. 828 3. 779 J AM ACAD CHILD PSY 5 30. 441 2. 171 J APPL PHYS 5 39. 708 7. 157 AM J PSYCHIAT MSR Usage PR IF (2005) Title (abbv. ) 1 15. 830 30. 927 SCIENCE 2 15. 167 29. 273 NATURE 3 12. 798 10. 231 PNAS 4 10. 131 0. 402 LECT NOTES COMP SCI 5 8. 409 5. 854 J BIOL CHEM Convergence! • Open research questions: o o • Is this guaranteed? To what? A common-baseline? What we do know: o o Institutional perspective can be contrasted to baseline. As aggregation increases in size, so does value. Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
The MESUR project. Johan Bollen (LANL): Principal investigator. Herbert Van de Sompel (LANL): Architectural consultant. Aric Hagberg (LANL): Mathematical and statistical consultant. Marko Rodriguez (LANL): Ph. D student (Computer Science, UCSC). Lyudmila Balakireva (LANL): Database management and development. Wenzhong Zhao (LANL): Data processing, normalization and ingestion. “The Andrew W. Mellon Foundation has awarded a grant to Los Alamos National Laboratory (LANL) in support of a two-year project that will investigate metrics derived from the network-based usage of scholarly information. The Digital Library Research & Prototyping Team of the LANL Research Library will carry out the project. The project's major objective is enriching the toolkit used for the assessment of the impact of scholarly communication items, and hence of scholars, with metrics that derive from usage data. ” Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Project data flow and work plan. 1 2 3 negotiation aggregation ingestion 4 a reference data set Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX 4 b metrics survey
Project timeline. We are here! Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
We have COUNTER/SUSHI. How about the aggregation of item-level usage data? If there is value in aggregating COUNTER and other reports, there is considerable value in aggregating item-level usage data. Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
A tapestry of usage data providers: Main players: • Individual institutions • Aggregators • Publishers Each represent different, and possibly overlapping, samples of the scholarly community. Institutions: • Institutional communities • Many collections Aggregators: • Many communities • Many collections Publishers: • Many communities • Publisher collection Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
The issue of anonymization Privacy and anonymization concerns play on multiple levels that standard needs to address: 1. Institutions: where was usage data recorded? 2. Providers: who provided usage data? 3. Users: who is the user? - Goes beyond “naming” and masking identity: simple statistics can reveal identity User identity can be inferred from activity pattern (AOL search data) Law enforcement issues MESUR: 1. Session ID: preserve sequence without any references to individual users 2. Negotiated filtering of usage data Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Usage data collected Data providers: • Publishers: 8 • Aggregators: 3 • Institutions: 4 • • Data: > 1 B usage events and 1 B citations o At this point, 247, 083, 481 usage events loaded o Another +1, 000, 000 on the way Documents: > 50 M items Journals: 326, 000 o Includes newspapers, magazines o Professional magazines o Obscure material Community: > 100 M users and authors combined Data granularity allow reconstruction of user access sequences: • Accumulation of user access sequences = usage network • Nodes = journals or items, links=often co-accessed Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Journal usage graphs MESUR graph created: • 200 M usage events • Usage restricted to 2006 • Journals clipped to 7600 2004 JCR journals • Pair-wise sequences o Within session, only consecutive pairs o Raw frequency weights Network analysis now on-going • Network properties • Clustering journal 1 journal 2 Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Journal usage graphs Physics Material science Bioinform. Psychology Education Structural engineering Chemistry ? Energy Medicine Environmental sciences Microbiology Biotech Agriculture plants Dermatology Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
From same data: article usage graphs Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Usage graphs connect this domain to 50 years of network science • social network analysis • small world graphs • network science • graph theory • social modeling Good reads: • Barabasi (2003) Linked. • Wasserman (1994). Social network analysis. Heer (2005) - Large-Scale Online Social Network Visualization Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Structural metrics calculated from usage graph Classes of metrics: • Degree • Shortest path • Random walk • Distribution Shortest path • Closeness • Betweenness • Newman Distribution • In-degree entropy • Out-degree entropy • Bucket Entropy Degree • In-degree • Out-degree Random walk • Page. Rank • Eigenvector Each can be defined to take into account weights by e. g. means of weighted shortest path definition Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Set of metrics calculated on MESUR data set List of metrics: JCR 2004 • CITE-BE • CITE-ID • CITE-IE • CITE-IF • CITE-OD • CITE-OE • CITE-PG • CITE-UBW-UN • CITE-UCL-UN • CITE-UNM-UN • CITE-UPG • CITE-UPR • CITE-WBW-UN • CITE-WCL-UN • CITE-WID • CITE-WNM-UN • CITE-WOD • CITE-WPR Usage-based metrics: MESUR 2006 • USES-BE, • USES-ID • USES-IE • USES-OD • USES-OE • USES-PG • USES-UBW-UN • USES-UCL-UN • USES-UNM-UN • USES-UPG • USES-UPR • USES-WBW-UN • USES-WCL-UN • USES-WID • USES-WNM-UN • USES-WOD • USES-WPR Usage graph creation: Wenzhong Zhao Metrics: Marko Rodriguez and Aric Hagberg Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Hierarchical analysis citation usage Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Metrics relationship Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
MESUR Usage data: • Creation of single largest reference data set of usage, citation and bibliographic data • +1, 000, 000 usage events loaded in next month • Usage data obtained from multiple publishers, aggregators and institutions • Infrastructure for a continued research program in this domain • Results will guide scholarly evaluation and may help produce standards for usage data representation Usage graphs: • Natural results of sufficiently detailed usage data • Reduced distortion compared to raw usage: structure matters, not raw hits • Several options on how to create: MESUR investigates Metrics: • Each can represent different facets of scholarly impact • Hybrid metrics based on triple store functionality • Note increasing convergence of usage-metrics to citation metrics as sample increases. Reference data set will provide years of exciting research: • Let me know what you think. Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
Some relevant publications. Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007 Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (cs. DL/0610154) Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL 2006), pages 298 -307, June 2006. Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006. Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv. org: cs. DL/0601030) Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6): 1419 -1440, 2005. Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX
- Slides: 22