Combining Fulltext analysis Bibliometric Indicators a pilot study

Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1, 2 Olle Persson 3 1. Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium) 2. Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary) 3. Inforsk, Department of Sociology, Umeå University, Umeå (Sweden)

Introduction n Goal: mapping of scientific processes n n n Map of scientific papers Characterization of emerging clusters Extraction of new search keys Using bibliometric as well as lexical indicators of ‘relatedness’ Full-text analysis

Overview n n n Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Data source n n n 19 full-text papers from: Scientometrics, Vol 30, Issue 3 (2004) special issue on 9 th international conference on Scientometrics and Informetrics (Beijing, China) Validation setup n Manual assignment in various classes. .

Data source Section code Section name Paper I Advances in Scientometrics Havemann et al. (2004) Moed and Garfield (2004) Small (2004) Yue and Wilson (2004) II Policy relevant issues Negishi et al. (2004) Shelton and Holdrige (2004) Markusova et al. (2004) Wu et al. (2004) III Bibliometric approaches to collaboration in science Beaver (2004) Kretschmer (2004) Persson et al. (2004) Yoshikane and Kageura (2004) IV Advances in Informetrics and Webometrics Lamirel et al. (2004) Qiu and Chen (2004) Tang and Thelwall (2004) Vaughan and Wu (2004) V Mathematical models in Informetrics and Scientometrics Egghe (2004) Glänzel (2004) Shan et al. (2004)

Research questions n Comparison text-based mapping vs. expert classification n Extracted keywords n Comparison with bibliometric mapping

Overview n n n Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Methodology n Given a set of documents,

Methodology n n Given a set of documents, compute a representation, called index <1 0 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0>

Methodology n n Given a set of documents, compute a representation, called index <1 0 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> n to retrieve, summarize, classify or cluster them

Methodology n Document processing n n Remove punctuation & grammatical structure Define a vocabulary n n n Identify Multi-word terms (e. g. , tumor suppressor) Eliminate words low content (e. g. , and, thus, . . ) Map words with same meaning Strip plurals, conjugations, . . . Define weighing scheme and/or transformations (‘Bag of words’ ) (phrases) (stopwords) (synonyms) (stemming) (tf-idf, svd, . . )

Methodology n Compute index of textual resources: T 3 T 1 T 2 Similarity between documents Salton’s cosine: vocabulary

Overview n n n Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Results – Term statistics n n 19 papers 3610 withheld terms (including ~400 bigrams) n Distance Matrix (19 x 19) n n Apply MDS Apply Clustering

Results – MDS

Results – MDS Policy Mathematical approaches Webometrics

Results – Clustering Cut-off k=4 • Hierarchical clustering Ward method • Optimal parameters ? ‘Stability-based method’ • Quantified correspondence with expert assignments ? ‘Rand index’. . ?

Results – Peer evaluation Webometrics Mathematical approaches Policy Class Cluster I II IV V 1 3 4 1 0 0 2 0 0 0 3 0 0 1 0 3 4 1 0 2 1 0 Rand index = 0. 778 p-value (w. r. t to permuted data) < 10 -3 ; significant

Overview n n n Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Results – Reference age Histograms per paper

Results – Reference age Histograms aggregated by expert class

Results – Ref Age vs. % Serial Scatter plot of Expert classes: Mean Reference Age vs. Percentage of Serials

Overview n n n Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Results – Term extraction n Calculation of seminal keywords for each article Using TF-IDF weighting scheme Normalized to norm 1 to accommodate for document length

Author(s): Persson et al. Inflationary bibliometric values: the role of scientific collaboration and the need for relative indicators in evaluative studies Author(s): Glänzel Towards a model for diachronous and synchronous citation analyses co_author 0. 417794 diachronous_prospect 0. 492265 collabor* 0. 287652 synchronous 0. 377403 domest* 0. 208460 synchronous_retrospect 0. 360994 self_citat* 0. 185298 age 0. 250921 explan* 0. 170916 diachronous_prospect 0. 238375 Growth 0. 154099 technic*_reliabl* 0. 180497 reference_list 0. 151925 citat*_process 0. 150553 intern*_collabor* 0. 151925 life_time 0. 147679 reference_behaviour 0. 151468 impact_measur* 0. 125460 inflationari 0. 151468 random_select* 0. 114862 Author(s): Moed and Garfield In basic science the percentage of 'authoritative' references decreases as bibliographies become shorter Author(s): Shelton and Holdrige The US-EU race for leadership of science and technology, Qualitative and quantitative indicators research_field 0. 358836 EU 0. 638957 authorit*_docum* 0. 281942 WTEC 0. 346503 authorit* 0. 241017 panel 0. 224208

Results – Full-text vs Abstract n Is a full-text analysis warranted n n for term extraction ? for mapping purposes ?

Results – Full-text vs Abstract n n Less structure Less overlap with expert classes: Rand index = 0. 6257 p-value = 0. 464 ; not significant Full-text is an interesting source for additional keywords and improved mapping

Conclusion n n Keyword approach may be naïve But applied in a systematic framework in combination with ‘right’ algorithms, it provides interesting clues Complementary to bibliometric approaches Weak indications towards benefits of using full-text articles Future: extension of this pilot to larger samples

References • Bibliometrics; homepage Wolfgang Glänzel • http: //www. steunpuntoos. be/wg. html • Bibliometrics; homepage Olle Persson • http: //www. umu. se/inforsk/Staff/olle. htm • Text & Data mining; Ph. D thesis Patrick Glenisson • ftp: //ftp. esat. kuleuven. ac. be/pub/sista/glenisson/reports/phd. pdf • Optimal k in clustering; Stability method