Introduction of Lod Refine n n Google Refine

  • Slides: 25
Download presentation
Introduction of Lod. Refine n n Google Refine is a powerful tool for working

Introduction of Lod. Refine n n Google Refine is a powerful tool for working with messy data, cleaning it, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase. since October 2 nd, 2012 Google handed over Google Refine to the community and now the tool is known as Open. Refine. Project development, documentation and promotion is now fully supported by volunteers. Find out more about the history of Open. Refine and how you can help the community.

Introduction of Lod. Refine(cont. ) n n LODRefine was based on the latest version

Introduction of Lod. Refine(cont. ) n n LODRefine was based on the latest version of Open. Refine but also added new LODfriendly extensions to the package. In this case, we use LODRefine as the tool of data cleaning.

Download and Installation n n Download here http: //sourceforge. net/projects/lodrefine/files/? sou rce=navbar You must

Download and Installation n n Download here http: //sourceforge. net/projects/lodrefine/files/? sou rce=navbar You must download the version according to your computer operating system. unzip it. Then click the exe file. Open Chrome, LODRefine will be opened in one tab page of browser. Now you can begin the LODRefine.

Download and Installation (cont. )

Download and Installation (cont. )

Create Object n n To use LODRefine on some data, you must first import

Create Object n n To use LODRefine on some data, you must first import it into Refine. This importing process creates a new Refine project for your data file. Open. Refine understands a variety of data file formats. The formats currently supported (in version 2. 0) include: TSV, CSV, or values separated by a custom separator you specify; Excel (. xls, xlsx); XML, RDF as XML; JSON; Google Spreadsheets and RDF N 3 triples

Create Object (cont. ) n Once you have imported you data, You will see

Create Object (cont. ) n Once you have imported you data, You will see the data in Refine. Data is shown by rows and columns as you know it from a spreadsheet.

Create Object (cont. ) Click the “create new project” button on the right top

Create Object (cont. ) Click the “create new project” button on the right top to create project.

Cleaning some wrong value data(person name) n Load in the data, and select “Facet

Cleaning some wrong value data(person name) n Load in the data, and select “Facet Text Facet” on the column “entity_name”. LODRefine will show clusters of text that is similar and probably the same thing. In this dataset, there are many types of entities: anniversary, city, company and etc

Cleaning some wrong value data(person name)(cont. )

Cleaning some wrong value data(person name)(cont. )

Cleaning some wrong value (cont. ) n Take “person” type as example. If you

Cleaning some wrong value (cont. ) n Take “person” type as example. If you want to clean the data in “person” type, choose the “entity_name” named “person”. You will get the “new” dataset(1556 matching rows)

Cleaning some wrong value (cont. ) n “Facet Text Facet” on the column “value”.

Cleaning some wrong value (cont. ) n “Facet Text Facet” on the column “value”.

Cleaning some wrong value (cont. ) n Check the person name value one by

Cleaning some wrong value (cont. ) n Check the person name value one by one. Some names are not the name of person obviously, such as the name “Abstract” and the name “record”. Choose them and set the flag (under the all column) yellow.

Cleaning some wrong value (cont. ) n Back to the interface of 8849 rows,

Cleaning some wrong value (cont. ) n Back to the interface of 8849 rows, choose “all facet by flag”

Cleaning some wrong value (cont. ) n Choose the “true” flagged rows, see the

Cleaning some wrong value (cont. ) n Choose the “true” flagged rows, see the matching result.

Cleaning some wrong value (cont. ) n Choose “all edit rows remove all matching

Cleaning some wrong value (cont. ) n Choose “all edit rows remove all matching rows”, 6 rows will be deleted. 8843 rows left. For the “person” type, 1550 rows left.

Cleaning some wrong value (cont. ) n n n Check the name value again,

Cleaning some wrong value (cont. ) n n n Check the name value again, “Charles A correspondence”, “Gertrude Correspondence” and some similar wrong names are found. Edit them by click the value when the “edit” button appears. Click the button, a popup window appears. Input the true value, then click “apply” button. If there are many wrong names in the same format, “apply to all identical cells” button could be chosen.

Cleaning some wrong value (cont. )

Cleaning some wrong value (cont. )

Cluster the value and merge some (person name) n n Except wrong value, some

Cluster the value and merge some (person name) n n Except wrong value, some person names are same but written slightly different. For these names, let’s use the cluster function. (many rows need artificial judgment) Choose “edit cells cluster and edit” on column “value”.

Cluster the value and merge some (person name)

Cluster the value and merge some (person name)

Cluster the value and merge some (person name) n n n A new menu

Cluster the value and merge some (person name) n n n A new menu opened. Select “Key Collison” as a method and “cologne-phonetic” as a keying function. This will show some person names that have some problems. For example, “Mina Althea Orton ” and “Mina Althea Orton” are the same. If you check the “merge selected & re-cluster” item you can merge them together.

Cluster the value and merge some (person name)

Cluster the value and merge some (person name)

Cluster the value and merge some (person name) n n LODRefine ships with a

Cluster the value and merge some (person name) n n LODRefine ships with a selected number of clustering methods and algorithms that have proven effective and fast enough to use in a wide variety of situations and are ordered from strict to lax and should be used in this order. You can refer to here to the different algorithms. https: //github. com/Open. Refine/wiki/Clusterin g-In-Depth Try other methods and get the results you can find more names have problems. Try “key collision” method and “metaphones” keying function.

Cluster the value and merge some (person name)

Cluster the value and merge some (person name)

Cluster the value and merge some (person name) n Try “nearest neighbor” method and

Cluster the value and merge some (person name) n Try “nearest neighbor” method and “levenshtein” distance function.

Conclusions n n n Some cleaning work need the judgment of human or check

Conclusions n n n Some cleaning work need the judgment of human or check in the original dataset, especially by clustering. The process discussed above can be used to other column cleaning. LODRefine provides other cleaning functions such as transforming, spliting the columns, add columns.