Lattes Miner a Multilingual DSL for Information Extraction

  • Slides: 25
Download presentation
Lattes. Miner: a Multilingual DSL for Information Extraction from Lattes Platform 11 th Workshop

Lattes. Miner: a Multilingual DSL for Information Extraction from Lattes Platform 11 th Workshop on Domain-Specific Modeling Alexandre Donizeti Alves Horacio Hideki Yanasse Nei Yoshihiro Soma October 24, 2011

Introduction Lattes Platform is an information system implanted by CNPq (National Council for Scientific

Introduction Lattes Platform is an information system implanted by CNPq (National Council for Scientific and Technological Development) to manage information on science, technology and innovation related to researchers and institutions in Brazil This platform is undoubtedly the major source of information available on Brazilian researchers

Introduction: Lattes Platform http: //lattes. cnpq. br

Introduction: Lattes Platform http: //lattes. cnpq. br

Introduction The Lattes CV system, a curricular information system, is the main component of

Introduction The Lattes CV system, a curricular information system, is the main component of the platform Currently, the Lattes CV system stores around 2, 000 curricula of researchers, lectures, students and professionals from diverse areas of knowledge

Introduction: Lattes CV system Jorge Almeida Guimaraes http: //buscatextual. cnpq. br/buscatextual

Introduction: Lattes CV system Jorge Almeida Guimaraes http: //buscatextual. cnpq. br/buscatextual

Introduction: Lattes curriculum (English)

Introduction: Lattes curriculum (English)

Introduction: Lattes curriculum (English)

Introduction: Lattes curriculum (English)

Introduction: Lattes curriculum (Portuguese)

Introduction: Lattes curriculum (Portuguese)

Introduction In the last years, many works were developed using data extracted from Lattes

Introduction In the last years, many works were developed using data extracted from Lattes Platform of researchers of different areas of knowledge A common problem presented in these works is that the curricula and the information extracted had to be obtained manually

Introduction Therefore, this system has a very high quality information extraction potential

Introduction Therefore, this system has a very high quality information extraction potential

Lattes. Miner is an internal multilingual DSL for automatic information extraction from Lattes curricula

Lattes. Miner is an internal multilingual DSL for automatic information extraction from Lattes curricula It is composed by a set of classes written in Java that allows developers to implement their own applications with a high-level abstraction and expression power

Lattes. Miner Data Acquisition is responsible for downloading the Lattes curricula of the researchers

Lattes. Miner Data Acquisition is responsible for downloading the Lattes curricula of the researchers from Lattes CV system on the Web. The Data is responsible for the identification Data. Visualization Extraction is component the main component of Lattes. Miner. It is The Analysis component is responsible fororthe analysis of the The. Data extracted data can be stored in XML files in any database and visualization of the academic social networks. These responsible for extracting data from the HTML files. Thenetworks techniqueare data extracted and also for. Data the Structure analysis ofcomponent. the relationships identified. using the byextraction checking the relationships researchers. of identified information based on regularbetween expressions was used. Data Discovery is used to find the (ID) number of the researchers. Usually, only the name of the researcher is available.

Lattes. Miner Perfil Banca The Lattes. Miner class is composed lattes. miner. brby instances

Lattes. Miner Perfil Banca The Lattes. Miner class is composed lattes. miner. brby instances of classes Biodata and Board, in addition to many others not presented here. Biodata. IE lattes. miner. en Lattes. Miner lattes. miner. ie Board lattes. miner Board. IE Biodata. Dao Board. Dao lattes. miner. dao

Lattes. Miner was created through a fluent interface, that provides a compact and yet

Lattes. Miner was created through a fluent interface, that provides a compact and yet easy-read representation of the domain problem Fluent interfaces are implemented using the method chaining Lattes. Miner makes use of static factory methods and imports

Case Study For the following examples researchers of the Computer Science area with CNPq

Case Study For the following examples researchers of the Computer Science area with CNPq Research Productivity Scholarship were considered. The list contains all the names of the researchers. However, their corresponding (ID) number are not provided. http: //plsql 1. cnpq. br/divulg/RESULTADO_PQ_102003. curso

Listing 1 Java application code import java. util. *; import lattes. util. Util; import

Listing 1 Java application code import java. util. *; import lattes. util. Util; import static lattes. miner. Lattes. Miner. *; public class Listing 1 { public static void main(String[] args) { List<String> list = new Array. List<String>(); for (String name : Util. get. List("names. txt")) list. add( search(name) ); } } Util. set. List(list, "ids. txt");

Listing 2 Code fragment used to download the lattes curricula of the researchers. dir("cvs");

Listing 2 Code fragment used to download the lattes curricula of the researchers. dir("cvs"); for (String id : Util. get. List("ids. txt")) download(id). save();

Listing 3 This listing shows as to extracted data from Lattes curricula of the

Listing 3 This listing shows as to extracted data from Lattes curricula of the researchers. props("mysql"); for (String id : Util. get. List("ids. txt")) { load(id). biodata(). address(); publications( JOURNAL ). save(); }

Listing 4 Code fragment to illustrate how the Lattes. Miner is used to extract

Listing 4 Code fragment to illustrate how the Lattes. Miner is used to extract information in different languages. for (String id : Util. get. List("ids. txt")) { // Portuguese for (Banca b : carregar(id). bancas(). get. Bancas() ) { if ( b. ano() == 2010 ) System. out. println( b. aluno() ); } // English for (Board b : load(id). boards(). get. Boards() ) { if ( b. year() == 2010 ) System. out. println( b. student() ); } }

Results The SUCUPIRA is a system for identification and visualization of academic social networks.

Results The SUCUPIRA is a system for identification and visualization of academic social networks. Here is shows the geographical distribution of the five researchers that have published more articles in scientific journals.

Results This a graph of contacts of the five researchers that Theisgraph depicts an

Results This a graph of contacts of the five researchers that Theisgraph depicts an academic social network of the have published morefive in scientific journals. researchers. Nodes are presented with the name of researcher The color of the edges represent the number of relationships among researchers.

Conclusions Currently, the Lattes curricula are available in HTML format Lattes. Miner however does

Conclusions Currently, the Lattes curricula are available in HTML format Lattes. Miner however does not depend on the data format because it allows users to program their own applications with a high-level abstraction If the data format is eventually modified, the DSL interface remains the same

Conclusions An advantage of Lattes. Miner is that it searches by the name of

Conclusions An advantage of Lattes. Miner is that it searches by the name of the researcher Lattes. Miner is multilingual Another advantage is that the data extracted can are stored in a structural format (XML or database), allowing these data to be easily used by others applications

Future work The future step that is already being implemented in the Lattes. Miner

Future work The future step that is already being implemented in the Lattes. Miner DSL is a statistical analysis of the data

ACNOWLEDGMENTS

ACNOWLEDGMENTS