Network Analysis Class Data cleaningname disambiguation User Guide

Network Analysis Class Data cleaning-name disambiguation User Guide Office of Portfolio Analysis Division of Program Coordination, Planning, and Strategic Initiatives National Institutes of Health Network Analysis Class/Data cleaning-Name disambiguation 1

Data cleaning-name disambiguation has three steps New! Step 1: Automated assistance using the i. Search-Publications cooccurrence feature Step 2: Manual check of merged names Step 3: Manual check of unmerged names All three steps involved in name disambiguation are needed to create a valid and meaningful co-author network Network Analysis Class/Data cleaning-Name disambiguation 2

Data cleaning-name disambiguation Step 1: Automated assistance using the i. Search-Publications co-occurrence feature Network Analysis Class/Data cleaning-Name disambiguation 3

Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence Ste feature Open i. Search Publications module https: //itools. od. nih. gov/dashboard/ • If you already know your list of PMIDs, just search for them in i. Search 1 2 4 3 Field to search 1 - Copy and paste the list of PMIDs to the search box 2 - Select “OR” 3 - Select the “PMID” as the field to search 4 - Hit the search button (Magnifying glass icon) 4 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence Ste feature • If you don’t know your portfolio of publications, proceed to do the search for publications as usual • Once you have your publications in i. Search, go to the export button and select “Co-occurrence Graph” 5 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence Ste • feature A pop up window (below) will open 1 Select the field to use to generate the Co-occurrence Graph. For coauthor networks it is the authors field. 2 It is the minimum number of times values have to co-occur to appear it in the network. For co-author networks it is the minimum number of times authors have to publish together to generate an edge. 3 4 give a title to the file (optional) It is the number of values for a field in a document before the values are truncated to the first N and the last value. For example if authors is selected and maximum values is set to 50 then only the first 50 authors and the last author are used in the co-occurrence graph generation. 6 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence Ste feature 1 Select the field to use to generate the Co-occurrence Graph For co-author networks select the authors field 2 Minimum Edge Weight It is the minimum number of times values have to co-occur to appear it in the network. For co-author networks it is the minimum number of times authors have to publish together to generate an edge. The default number is 2. Even if we move the slider to 1 (which we recommend), this setting implies that isolated nodes are not going to show in the edges file that this feature produces. Arranges need to be made to add isolates in case they exist and the edges’ file is used. 7 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence Ste feature 3 Maximum values per document It is the number of values for a field in a document before the values are truncated to the first N and the last value. For example if authors is selected and maximum values is set to 50 then only the first 50 authors and the last author are used in the co-occurrence graph generation. • To search for the maximum number of author in your portfolio there are two options: Option 1: Download the data and check for the max number of authors using OPA Training Excel Tip 11 Export the publications to an excel file (including the authors field) and follow the excel tip 11 on OPA website https: //dpcpsi. nih. gov/eo/intranet/opa/training/excel_tips 8 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence Ste feature Maximum values per document Option 2: Create a “Portfolio” in i. Search a) Create a portfolio for the publications In this example this publication portfolio will have 7 PMIDs b) Name the portfolio and click “Create” 9 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Ste Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence feature Maximum values per document Option 2: Create a “Portfolio” in i. Search c) When asked if you want to create a portfolio without adding curators, simply select “Yes” d) Then open the portfolio clicking on the “Portfolios” icon and select the portfolio 10 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Ste Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence feature Maximum values per document Option 2: Create a “Portfolio” in i. Search c) Once the portfolio is opened, use the following search query to identify how many of your publications have for example more than 1, but less than 5 authors: author. Count: [1 TO 5] None of the 7 publications of our example have less than 5 authors 11

Ste Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence feature Maximum values per document Option 2: Create a “Portfolio” in i. Search c) Repeating the query for more than 1 but less than 10 authors: author. Count: [1 TO 10] allows us to see that all our publications have 10 or less authors 12

Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence Ste feature 3 Maximun values per document Once we know what is the maximum number of authors in our publications, we can set the maximum value per document for example to 11 (this way all our pubs will be included with their total amount of authors in the co-occurrence graph) 13 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence Ste feature 4 Then name the file (if desired) and click on “Generate” 14 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Ste Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence feature There are three output files under i. Search notifications: • Node file Note that the extension of the files is. TSV. To open them in Excel, download the • Edge files and then look for them in the “Downloads” folder of your computer • Docs file i. Search notifications icon Output files 15 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Ste Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence feature To open the. TSV files in Excel, open a new Excel spreadsheet and click “open”. When prompted with the “Text Import Wizard” window, we should select the Unicode UTF-8 option (pic below). Output files: • Nodes file This file has information about each node. The file has the following columns: • Best Name: Name for the node • Merged Names: Other names for this node • PMIDs: The Pubmed ids this node occurred on • Grants: The grants linked to the publications this node occurred on • Clinical Trials; patents; journals; earliest pub year; latest pub year this node occurred on 16 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Ste Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence feature Output files: • Edges file This file shows the relationship between the entities. The file has the following columns: • Source and target nodes column: In the example of co-author networks, the source and target nodes are interchangeable since co-author networks are undirected. • Weight: The number of times they co-occur is the weight column • PMIDs: the PMIDs where the source and target nodes co-occur in are listed in this column These two authors only published together one time, in PMID 21642420 therefore their edge’s weight is 1 These two authors only published together three times, in PMIDs 20096814, 20974641, and 21642420 therefore their edge’s weight is 3 17 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Ste Data cleaning-name disambiguation: Step 1 1 p Automated assistance using the i. Search-Publications co-occurrence feature Output files: • Docs file This file has information about each node. The file has the following columns: • PMID • Original names (as they appear in the publications) • Clustered names, which are the already disambiguated names Author “Lui, Julian” had two different denominations in the input file (original names column) and only one in the output file (clustered names column) 18 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Data cleaning-name disambiguation Step 2: Manual check of merged names Network Analysis Class 19

p e t S Data cleaning-name disambiguation: Step 2 2 Manual check of merged names Output files: • The nodes file has a column called “merged names” • The Merged Names column shows you which names have been merged. If any name was merged but shouldn’t have, it has to be manually corrected in the Docs file’ “clustered names” column and the nodes and edges files provided by i. Search can no longer be used (they will not include any manual corrections). Instructions on how to create the nodes and edges files can be found at https: //dpcpsi. nih. gov/sites/default/files/Data%20 formatting-Step%20 by%20 step. pdf 20 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Data cleaning-name disambiguation Step 3: Manual check of unmerged names. The following slides will show you the step by step description of Part 1 Network Analysis Class 21

3 p Ste view r e Ov Data cleaning-name disambiguation STEP 3 OVERVIEW Docs file, Clustered Authors Column Part 1: List of Sci 2 unique names from the Clustered Names Column Part 2: Excel pivot table showing all Last name First initial variations Ofotokun, Ighovwerha Ofotokun, Ighovwhera Typo? Park, D J Park, Daniel S Same author? 22 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

3 Data cleaning-name disambiguation: p Ste rt 1 Manual check of unmerged names a P • Create a list of unique names using the “Clustered names” column Step 3 1. Load the i. Search output Docs file into Sci 2 Data Manager Panel Drag and drop the i. Search output “Docs file” (. CSV format is required) into Sci 2 Data Manager panel 23 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

3 Data cleaning-name disambiguation: p Ste rt 1 Manual check of unmerged names a P • Create a list of unique names using the “Clustered names” column Step 3 2. Select the “Standard csv format” option 24 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

3 Data cleaning-name disambiguation: p Ste rt 1 Manual check of unmerged names a P • Create a list of unique names using the “Clustered names” column Step 3 3. Now you should be able to see the loaded file in the “Data Manager” panel 25 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

3 Data cleaning-name disambiguation: p Ste rt 1 Manual check of unmerged names a P • Create a list of unique names using the “Clustered names” column Step 3 4. Go to Data Preparation Extract Co-occurrence network Select Extract Co-Occurrence Network from the dropdown box 26 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

3 Data cleaning-name disambiguation: p Ste rt 1 Manual check of unmerged names a P • Create a list of unique names using the “Clustered names” column Step 3 5. In the “Extract network from table” window, select the column containing the names cleaned by i. Search, and the delimiter present within names • Column Name: Clustered Names • Text delimiter: semicolon (“; ”) The delimiter within these names is a semicolon (“; ”) 27 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

3 Data cleaning-name disambiguation: p Ste rt 1 Manual check of unmerged names a P • Create a list of unique names using the “Clustered names” column Step 3 6. Two new files will now appear in the Data Manager panel below the loaded CSV file Two new files: • Extracted network on column clustered • Merge table based on clustered names 28 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

3 Data cleaning-name disambiguation: p Ste rt 1 Manual check of unmerged names a P • Create a list of unique names using the “Clustered names” column Step 3 7. Select the Merge table based on Clustered names file, and right click on it Right click with mouse on the Merge table file 29 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

3 Data cleaning-name disambiguation: p Ste rt 1 Manual check of unmerged names a P • Create a list of unique names using the “Clustered names” column Step 3 8. Select Open with From the dropdown box select “View with” 30 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

3 Data cleaning-name disambiguation: p Ste rt 1 Manual check of unmerged names a P • Create a list of unique names using the “Clustered names” column Step 3 9. Select the Microsoft Excel Comma Separated Values File from the dropdown menu From the dropdown box select open with Excel 31 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

3 Data cleaning-name disambiguation: p Ste rt 1 Manual check of unmerged names a P • Create a list of unique names using the “Clustered names” column Step 3 10. An excel spreadsheet will open with 3 columns. Column A contains the list of unique names of authors cleaned by i. Search Delete columns B and C since they are not needed for Step 3 of name disambiguation (these columns contain Sci 2 internal values) 32 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Data cleaning-name disambiguation Step 3: Manual check of unmerged names. The following slides will show you the step by step description of Part 2 Network Analysis Class 33

3 p Ste rt 2 Pa • Data cleaning-name disambiguation Step 3 OVERVIEW Create a Excel pivot table showing all Last name First initial variations Output file, Clustered Names Column Part 1: List of unique names from the Clustered Names Column Part 2: Excel pivot table showing all Last name First initial variations Ofotokun, Ighovwerha Ofotokun, Ighovwhera Park, D J Park, Daniel S Typo? Same author? 34 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

S 3 Data p te t 2 r a P cleaning-name disambiguation: Step 3 Manual check of unmerged names • Create a Excel pivot table showing all Last name First initial variations 1. Copy and paste column A into column B Select column A Copy & Paste in column B 35 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

S 3 Data p te t 2 r a P cleaning-name disambiguation: Step 3 Manual check of unmerged names • Create a Excel pivot table showing all Last name First initial variations Steps to separate last and first names 2. Select column B and go to Data Text to Column Select Column B 36 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

S 3 Data p te t 2 r a P cleaning-name disambiguation: Step 3 Manual check of unmerged names Part 2: Create a Excel pivot table showing all Last name First initial variations Steps to separate last and first names 2 a: Under Original data type, choose delimited 2 b: Under delimiters, choose Comma (last names and first names are separated by commas) 37 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

S 3 Data p te t 2 r a P cleaning-name disambiguation: Step 3 Manual check of unmerged names Part 2: Create a Excel pivot table showing all Last name First initial variations Now we have last name and first and second names in different columns (rename column B to Last name) Last name column First and second name column 38 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

S 3 Data p te t 2 r a P cleaning-name disambiguation: Step 3 Manual check of unmerged names Part 2: Create a Excel pivot table showing all Last name First initial variations To separate first and second names 1 - Select column C and go to Data Text to Column Label column Last name column Select Column C 39 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

S 3 Data p te t 2 r a P cleaning-name disambiguation: Step 3 Manual check of unmerged names Part 2: Create a Excel pivot table showing all Last name First initial variations To separate first and second names 2 a- Under Original data type, choose delimited 2 b- Under delimiters, choose Space (first and second names are separated by spaces) 40 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

S 3 Data p te t 2 r a P cleaning-name disambiguation: Step 3 Manual check of unmerged names Part 2: Create a Excel pivot table showing all Last name First initial variations Now we have first name and second names in different columns 3 - Delete column C (empty) D e l e t e 4 - Label Column C as First Name, and delete column D (we don’t need the second names/initials) D e l e t e 41

Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Part 2: Create a Excel pivot table showing all Last name First initial variations To separate create a column with first name’s first initial 5 a- Use the following formula in cell D 2 =LEFT(C 2, 1) 5 b- Then copy this formula on all column D cells 5 c- Finally, select entire column D and copy and paste the content as values 5 d- Label column as First initial 42 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Part 2: Create a Excel pivot table showing all Last name First initial variations To create a column with last name first initial Leave an empty space here 6 a- Use the following formula in cell E 2 =B 2& “, ”&D 2 (please note that there is an empty space after the comma) 6 b- Then copy this formula on all column E cells 6 c- Finally, select entire column E and copy and paste the content as values 6 d- Label column E as Last name first initial 43 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

S 3 Data p te t 2 r a P cleaning-name disambiguation: Step 3 Manual check of unmerged names Part 2: Create a Excel pivot table showing all Last name First initial variations 7 - Insert a pivot table 44 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

S 3 Data p te t 2 r a P cleaning-name disambiguation: Step 3 Manual check of unmerged names Part 2: Create a Excel pivot table showing all Last name First initial variations 8 - Set the Pivot table Fields as show below Rows: - Last name First initial - label Columns: - Count of label 45

S 3 Data p te 2 t r Pa cleaning-name disambiguation: Step 3 Manual check of unmerged names Part 2: Create a Excel pivot table showing all Last name First initial variations 9 - Sort the Count of label subtotals from largest to smallest In this example there are no “Last Name First Initial” with more than one “label”. But in the following slide we show other examples where the number of labels per “Last Name First Initial” is more than 1. 46

S 3 Data p te t 2 r a P cleaning-name disambiguation: Step 3 Manual check of unmerged names Part 2: Create a Excel pivot table showing all Last name First initial variations Typo or two different authors? Technau, Karl Gunter Technau, Karl Gunther Ofotokun, Ighovwerha Ofotokun, Ighovwhera Same author? Park, D J Park, Daniel S Kumburu, Happiness Kumburu, Happy Examples of names that should have been merged but were not. If you find a name that should have been merged but wasn't: - Go to the i. Search Docs file and correct the name in the Clustered Names column (using the “Find” and “Replace” feature) Third author? Chen, Hong Bin Chen, Huichao ? Now we are ready to format our data in Sci 2! 47 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network