Finding Topic-centric Identified Experts based on Full Text Analysis

Finding Topic-centric Identified Experts based on Full Text Analysis HanminJung Information Service Research Lab KISTI

Korea

MikyoungLee Information Service Research Lab KISTI

Korea

In-SuKang Information Service Research Lab KISTI

Korea

Seung-WooLee Information Service Research Lab KISTI

Korea

Won-KyungSung Information Service Research Lab KISTI

Korea

Finding Topic-centric Identified Experts based on Full Text Analysis E42277028BB16EF5E31E2EBE77192E37 GROBID - A machine learning software for extracting information from scholarly documents

This paper shows a method for finding topic-centric experts from open access metadata and full text documents. Topic-centric information including experts is served on OntoFrame, which is a Semantic Web-based academic research information service supporting R&D activities. URI schemebased OntoFrame provides three entity pages: topic, person, and event. 'Persons by Topic' in topic page lists up topic-centric identified experts. SPARQL query is used to retrieve them from RDF triple store through backward chaining.

We gathered CiteSeer open access metadata and full text documents with the amount of about 110,000 papers. Using about 160,000 abundant topics, On-toFrame now serves topic-centric identified experts and relevant information acquired by full text analysis.

Introduction

Finding experts is useful in such cases: seeking for consultants, collaborators, and speakers. It also provides a source of information to supplement or complement academic sources including metadata [7], thus, receives increased attention in recent years. However, identification resolution is not considered significantly even though this research topic mainly deals with persons. Many studies concentrate only on string-based person names [1] [2] [5] [6]. Semantic Web can be one of competent solutions for managing identified experts through underlying URI scheme. Another consideration is to guarantee reliability on the results of the task. Deep analysis based on full text documents is needed in that topically-classified documents in high precision ensure finding the right persons for each topic. On the basis of these considerations, we propose an experts-finding method based on identity resolution and full text analysis, and further extract topic-centric information such as 'Topic Trends' and 'Institutions by Topic'. Chapter 2 indicates several previous studies. Chapter 3 explains how to acquire topic-centric information based on a Semantic Web Framework.

Related Studies

The sources for finding experts are various: documents, programs, e-mails, databases, citations, communities and so on. Finding expertise information from e-mails with four simple binary association methods was proposed by [1]. [5] investigated the expertise of users and experts by combining information retrieval techniques. However, such e-mails and communities are insufficient to extract the right experts for a specific topic because they give clues about only relationship and context. An experts-finding study based on full text documents related with persons and on a set of terms in them was introduced [2]. It extracts similar experts by measuring similarity between term vectors. However, it is not able to indicate which topics are related with experts, but only provides a bundle of persons as the results. ExpertFinder [6] recommends persons with a lot of documents for a given topic. A keyword phrase is used to retrieve relevant documents, but the results are unsatisfactory because reasonable candidates are not listed within the top three or four candidates in most cases. Its slow response time and incorrect relationship between persons and documents are also problems. Another interesting study, performed by [8], introduced three innovative points: document authority in terms of their PageRanks, co-occurrence model, and multiple levels of associations between experts and query terms. It finds variants in experts' names for identity recognition, but failed to identify different persons with the same name uniquely. OntoFrame is a Semantic Web-based service which provides academic research information for supporting R&D activities [3]. Its two main components are URI server and OntoReasoner (inference engine). The latter interacts with user interfaces through receiving SPARQL queries and returning XML results. We introduce SPARQL rather than inflexible SQL because it is easy to construct queries with only knowledge on ontology schema. OntoReasoner also expands knowledge in ways of forwardchaining inference. The URI server has several functions: ontology schema parsing and loading, DB schema creation, ontology instance loading, and RDF triple generation as shown in figure 1. When a new instance is inserted into the server, triple generator makes triples for the instance. The triples are then stored in RDF triple store, and further would be referred by OntoReasoner. OntoFrame distinguishes from other academic research information services such as CiteSeer (http://citeseer.ist.psu.edu/) and Google Scholar (http://scholar.google.com/) because it provides information acquired by inference beyond metadata. 'Persons by Topic', 'Topic Trends', and 'Social Network' are representative information served by OntoFrame.

Acquiring Topic-Centric Information

OntoFrame: an Academic Research Information Service

Data Gathering and Refining

The Open Archives Initiative (OAI, http://www.openarchives.org/) develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. CiteSeer (http://citeseer.ist.psu.edu/oai.html) also supports OAI, and thus allows downloading its own open access metadata which includes title, authors, publication year and so on. Identity resolution is an obligatory task for transforming string-based data to semantic data [4]. Various forms of institution names in the metadata are mapped to a set of normalized institution names1 , e.g. "U. Kassel" and "University of Kassel." We also identify different persons with the same name. There are a few metadata fields available for distinguishing authors such as affiliation, e-mail, and co-authors. It is possible to determine whether two authors with the same name are different or not using their affiliations and e-mails. However, affiliation and e-mail fields are not obligatory in many cases including CiteSeer metadata. Co-authorship information plays an important role in resolving identity problems because co-author field is usually filled up in metadata, and further many authors maintain co-authorship relation regardless of affiliation change. We consider two authors with the same name as the identical person when they share the identical co-author(s), otherwise they remain as different persons. 'sameAs' relation would compensate the short coverage of this method based on co-authorship. All of their information, including papers and topics, will be merged as one when we connect two authors with 'sameAs' relation later.

After identity resolution, we assign URI for each entity; for example, paper "A Bayesian Multiple Models Combination Method for Time Series Prediction" with 'http://www.kisti.re.kr/isrl/ResearchRefOntology#ART_00000000000000458673', topic "markov model" with 'http://www.kisti.re.kr/isrl/ResearchRefOntology#TOP_00000000000000046687' and person "V.

Petridis" with 'http://www.kisti.re.kr/isrl/ResearchRefOntology#PER_00000000000000128292'.

Topic Extraction

Fig. 2. Workflow of Topic Extraction based on Full Text Documents

Extracting topics from papers is the most basic task to acquire topic-centric experts.

As full text documents as well as metadata of CiteSeer are available, we use the documents. Extracted topics are assigned to each paper. The followings explain the stages of the extraction as shown in figure 2; First, indexer extracts index terms from a given document. Second, the terms are matched with topic keywords in topic index DB 2 . Third, successfully matched terms are ranked by the following algorithms, and then we select top-n (currently, five) topics for the input document.

Finding Experts

Many factors can be considered for finding experts: the number of papers, impact factor of sources, the degree of citations, hub persons in social network and so on. Currently, we take into account only the number of papers for several reasons. A great portion of source field in CiteSeer open access metadata has no information. Citation information also may be incomplete when compared with CiteSeer service page. We also do not consider social network because prosperous co-authorship with other persons does not always guarantee specialty on a topic. Acquiring topic-centric experts on OntoFrame requires querying to RDF triple store based on DBMS. 'Persons by Topic' is retrieved directly from the database through SPARQL query (shown as follows) and automatic SPARQL-to-SQL conversion. The query searches papers (?accomplishment) of which topic area is topicTerm, and then retrieves authors (?person) of the papers. Figure 3 shows backward chaining flow starting from topicTerm. 'createdByPerson' is one of derived properties induced by user-defined inference rules. It reduces the distance of backward path to find 'Persons by Topic' in ways that go through directly to 'Person' rather than without passing through 'CreatorInfo' (the dotted line in figure 3). After retrieving persons, OntoReasoner performs postprocessing for ranking them by descending order of the number of their own papers.

Conclusions

We gathered 114,337 papers (2000 ~ 2006) from CiteSeer open access metadata. They include 161,853 persons and 17,093 institutions. 160,568 topic keywords 3 were extracted from titles and abstracts. Average consuming time for extracting maximum 5 topics from a paper is about 1.6 seconds. Within three seconds are enough to generate an entity page including 'Persons by Topic' on OntoFrame4 .

Fig. 1 .1Fig. 1. OntoFrame Architecture

( 1 )1Index term list: The kth document } m index terms.indicates the ith index term in the document.

( 3 )23TF (Term Frequency) of index term: is the term frequency of index term t in document . Topic keyword and topic are the same in this study. Successfully matched index terms are also a subset of topic keywords because the terms are always a member of topic keywords in topic index DB.(4) TF of the index term matched with topic keyword: is the term frequency of the index term t found in topic keyword DB.

Fig. 3 .3Fig. 3. Backward Chaining Path for Finding 'Persons by Topic' (Experts for a Topic)

Fig. 4 .4Fig. 4. Example of Topic Page for 'markov model' ('Persons by Topic' shows ranked experts.) This paper showed a method for finding topic-centric identified experts from CiteSeer open access metadata and full text documents. Topic extraction based on full text analysis enables to construct topically-classified papers, and inference makes propagation to persons and institutions. SPARQL query retrieves URI-based 'Persons by Topic' from RDF triple store. Our future work includes introducing usability test to 2nd International ExpertFinder Workshop (FEWS2007) 2nd International ExpertFinder Workshop (FEWS2007) currently, about 14,000 Simple and compound nouns were extracted automatically and filtered manually by human dictionary constructors. The whole system will appear in Poster/Demo Track of ISWC2007.2nd International ExpertFinder Workshop (FEWS2007)

Finding Experts and Their Details in E-mail Corpora KBalog MRijke Proceedings of the 15 th International Conference on World Wide Web the 15 th International Conference on World Wide Web 2006 Finding Similar Experts KBalog MRijke Proceedings of the 30 th Annual International ACM SIGIR Conference the 30 th Annual International ACM SIGIR Conference 2007 Semantic Web-Based Services for Supporting Voluntary Collaboration among Researchers Using an Information Dissemination Platform HJung MLee WSung DPark Journal of Data Science Journal 6 1 2007 Construction of Semantic Web-based Knowledge Using Text Processing HJung WSung Proceedings of the 4 th International Conference on Information Technology : New Generations the 4 th International Conference on Information Technology : New Generations 2007 Finding Experts in Community-Based Question-Answering Services XLiu WCroft MKoll Proceedings of the 14 th ACM International Conference on Information and Knowledge Management the 14 th ACM International Conference on Information and Knowledge Management 2005 Enterprise Expert and Knowledge Discovery DMattox MMaybury DMorey Proceedings of the 8 th International Conference on Human-Computer Interaction the 8 th International Conference on Human-Computer Interaction 1999 Expert Finding Systems for Organizations: Domain Analysis and the DEMOIR Approach DYimam Beyond Knowledge Management: Sharing Expertise MIT Press 2000 The Open University at TREC 2006 Enterprise Track Expert Search Task JZhu DSong SRüger MEisenstadt EMotta Proceedings of the 15 th Text REtrieval Conference the 15 th Text REtrieval Conference 2006