<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Finding Topic-centric Identified Experts based on Full Text Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hanmin Jung</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mikyoung Lee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>In-Su Kang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seung-Woo Lee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Won-Kyung Sung</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Service Research Lab.</institution>
          ,
          <addr-line>KISTI</addr-line>
          ,
          <country country="KR">Korea</country>
        </aff>
      </contrib-group>
      <fpage>56</fpage>
      <lpage>63</lpage>
      <abstract>
        <p>This paper shows a method for finding topic-centric experts from open access metadata and full text documents. Topic-centric information including experts is served on OntoFrame, which is a Semantic Web-based academic research information service supporting R&amp;D activities. URI schemebased OntoFrame provides three entity pages: topic, person, and event. 'Persons by Topic' in topic page lists up topic-centric identified experts. SPARQL query is used to retrieve them from RDF triple store through backward chaining. We gathered CiteSeer open access metadata and full text documents with the amount of about 110,000 papers. Using about 160,000 abundant topics, OntoFrame now serves topic-centric identified experts and relevant information acquired by full text analysis.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Finding experts is useful in such cases: seeking for consultants, collaborators, and
speakers. It also provides a source of information to supplement or complement
academic sources including metadata [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], thus, receives increased attention in recent
years. However, identification resolution is not considered significantly even though
this research topic mainly deals with persons. Many studies concentrate only on
string-based person names [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Semantic Web can be one of competent
solutions for managing identified experts through underlying URI scheme. Another
consideration is to guarantee reliability on the results of the task. Deep analysis based
on full text documents is needed in that topically-classified documents in high
precision ensure finding the right persons for each topic. On the basis of these
considerations, we propose an experts-finding method based on identity resolution and full text
analysis, and further extract topic-centric information such as ‘Topic Trends’ and
‘Institutions by Topic’. Chapter 2 indicates several previous studies. Chapter 3
explains how to acquire topic-centric information based on a Semantic Web Framework.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 Related Studies</title>
      <p>
        The sources for finding experts are various: documents, programs, e-mails, databases,
citations, communities and so on. Finding expertise information from e-mails with
four simple binary association methods was proposed by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] investigated the
expertise of users and experts by combining information retrieval techniques.
However, such e-mails and communities are insufficient to extract the right experts for a
specific topic because they give clues about only relationship and context.
An experts-finding study based on full text documents related with persons and on a
set of terms in them was introduced [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It extracts similar experts by measuring
similarity between term vectors. However, it is not able to indicate which topics are
related with experts, but only provides a bundle of persons as the results. ExpertFinder
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] recommends persons with a lot of documents for a given topic. A keyword phrase
is used to retrieve relevant documents, but the results are unsatisfactory because
reasonable candidates are not listed within the top three or four candidates in most cases.
Its slow response time and incorrect relationship between persons and documents are
also problems. Another interesting study, performed by [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], introduced three
innovative points: document authority in terms of their PageRanks, co-occurrence model,
and multiple levels of associations between experts and query terms. It finds variants
in experts’ names for identity recognition, but failed to identify different persons with
the same name uniquely.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 Acquiring Topic-Centric Information</title>
      <sec id="sec-3-1">
        <title>3.1 OntoFrame: an Academic Research Information Service</title>
        <p>
          OntoFrame is a Semantic Web-based service which provides academic research
information for supporting R&amp;D activities [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Its two main components are URI server
and OntoReasoner (inference engine). The latter interacts with user interfaces through
receiving SPARQL queries and returning XML results. We introduce SPARQL
rather than inflexible SQL because it is easy to construct queries with only knowledge
on ontology schema. OntoReasoner also expands knowledge in ways of
forwardchaining inference. The URI server has several functions: ontology schema parsing
and loading, DB schema creation, ontology instance loading, and RDF triple
generation as shown in figure 1. When a new instance is inserted into the server, triple
generator makes triples for the instance. The triples are then stored in RDF triple store,
and further would be referred by OntoReasoner.
        </p>
        <p>OntoFrame distinguishes from other academic research information services such as
CiteSeer (http://citeseer.ist.psu.edu/) and Google Scholar (http://scholar.google.com/)
because it provides information acquired by inference beyond metadata. ‘Persons by
Topic’, ‘Topic Trends’, and ‘Social Network’ are representative information served
by OntoFrame.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Data Gathering and Refining</title>
        <p>The Open Archives Initiative (OAI, http://www.openarchives.org/) develops and
promotes interoperability standards that aim to facilitate the efficient dissemination of
content. CiteSeer (http://citeseer.ist.psu.edu/oai.html) also supports OAI, and thus
allows downloading its own open access metadata which includes title, authors,
publication year and so on.</p>
        <p>
          Identity resolution is an obligatory task for transforming string-based data to semantic
data [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Various forms of institution names in the metadata are mapped to a set of
normalized institution names1, e.g. “U. Kassel” and “University of Kassel.” We also
identify different persons with the same name. There are a few metadata fields
available for distinguishing authors such as affiliation, e-mail, and co-authors. It is
possible to determine whether two authors with the same name are different or not using
their affiliations and e-mails. However, affiliation and e-mail fields are not obligatory
in many cases including CiteSeer metadata. Co-authorship information plays an
important role in resolving identity problems because co-author field is usually filled up
in metadata, and further many authors maintain co-authorship relation regardless of
affiliation change. We consider two authors with the same name as the identical
person when they share the identical co-author(s), otherwise they remain as different
persons. ‘sameAs’ relation would compensate the short coverage of this method
based on co-authorship. All of their information, including papers and topics, will be
merged as one when we connect two authors with ‘sameAs’ relation later.
After identity resolution, we assign URI for each entity; for example, paper “A
Bayesian Multiple Models Combination Method for Time Series Prediction” with
‘http://www.kisti.re.kr/isrl/ResearchRefOntology#ART_00000000000000458673’, topic
“markov model” with
‘http://www.kisti.re.kr/isrl/ResearchRefOntology#TOP_00000000000000046687’ and person
1 currently, about 14,000
“V. Petridis”
‘http://www.kisti.re.kr/isrl/ResearchRefOntology#PER_00000000000000128292’.
with
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3 Topic Extraction</title>
        <p>Extracting topics from papers is the most basic task to acquire topic-centric experts.
As full text documents as well as metadata of CiteSeer are available, we use the
documents. Extracted topics are assigned to each paper. The followings explain the stages
of the extraction as shown in figure 2; First, indexer extracts index terms from a given
document. Second, the terms are matched with topic keywords in topic index DB2.
Third, successfully matched terms are ranked by the following algorithms, and then
we select top-n (currently, five) topics for the input document.</p>
        <p>(1) Index term list: The kth document Dk</p>
        <p>{tk1,..., tkm} have m index terms.</p>
        <p>tki indicates the ith index term in the document.
(2) Topic keyword list: Topic keyword list S
{s1,..., s p} has p keywords.
(3) TF (Term Frequency) of index term: tf Dk (t ) is the term frequency of
index term t in document D .</p>
        <p>k
2 Topic keyword and topic are the same in this study. Successfully matched index terms are
also a subset of topic keywords because the terms are always a member of topic keywords in
topic index DB.
(4) TF of the index term matched with topic keyword: tf D S (t ) is the term
k
frequency of the index term t found in topic keyword DB. The frequency
originates from tf Dk (t ) .</p>
        <p>Topic weighting formula: r (t )
for top-5 ranked topics
t ' Top5
tf Dks (t)
tf Dks (t ')</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4 Finding Experts</title>
        <p>Many factors can be considered for finding experts: the number of papers, impact
factor of sources, the degree of citations, hub persons in social network and so on.
Currently, we take into account only the number of papers for several reasons. A
great portion of source field in CiteSeer open access metadata has no information.
Citation information also may be incomplete when compared with CiteSeer service
page. We also do not consider social network because prosperous co-authorship with
other persons does not always guarantee specialty on a topic.</p>
        <p>Acquiring topic-centric experts on OntoFrame requires querying to RDF triple store
based on DBMS. ‘Persons by Topic’ is retrieved directly from the database through
SPARQL query (shown as follows) and automatic SPARQL-to-SQL conversion. The
query searches papers (?accomplishment) of which topic area is topicTerm, and then
retrieves authors (?person) of the papers. Figure 3 shows backward chaining flow
starting from topicTerm.</p>
        <p>SELECT ?person ?perRep ?perEngName ?perKorName ?institution
?instEngName ?instKorName
WHERE
{
?topicArea isrl:hasTopicTermOfAccomplishment topicTerm .
?accomplishment isrl:hasTopicAreaOfAccomplishment ?topicArea .
?accomplishment isrl:createdByPerson ?person .</p>
        <sec id="sec-3-4-1">
          <title>OPTIONAL {?perRep isrl:standForSameAsGroupOf ?person . }</title>
        </sec>
        <sec id="sec-3-4-2">
          <title>OPTIONAL { ?person isrl:engNameOfPerson ?perEngName . } OPTIONAL { ?person isrl:korNameOfPerson ?perKorName . } ?person isrl:hasInstitutionOfPerson ?institution .</title>
        </sec>
        <sec id="sec-3-4-3">
          <title>OPTIONAL { ?institution isrl:engNameOfInstitution ?instEngName . } OPTIONAL { ?institution isrl:korNameOfInstitution ?instKorName . } } ORDER BY ?person</title>
          <p>‘createdByPerson’ is one of derived properties induced by user-defined inference
rules. It reduces the distance of backward path to find ‘Persons by Topic’ in ways that
go through directly to ‘Person’ rather than without passing through ‘CreatorInfo’ (the
dotted line in figure 3). After retrieving persons, OntoReasoner performs
postprocessing for ranking them by descending order of the number of their own papers.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5 Topic-centric Information</title>
        <p>OntoFrame provides several entity-centric pages such as topic, person, and event.
Each entity page consists of a stack of information related with a specific entity. For
example, topic page serves ‘Search Results’, ‘Topic Trends’, ‘Also Try’, ‘Persons by
Topic’, ‘Institutions by Topic’, ‘Papers by Topic’, and ‘Researcher Group (Social
Network)’ as shown in figure 4. ‘Topic Trends’ shows relevant topics by year. We
define the relevance as the topics extracted from the same paper. ‘Institutions by
Topic’ for dominant institutions is similar to ‘Persons by Topic’. ‘Papers by Topic’
shows papers classified semantically into a topic.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Conclusions</title>
      <p>We gathered 114,337 papers (2000 ~ 2006) from CiteSeer open access metadata.
They include 161,853 persons and 17,093 institutions. 160,568 topic keywords3 were
extracted from titles and abstracts. Average consuming time for extracting maximum
5 topics from a paper is about 1.6 seconds. Within three seconds are enough to
generate an entity page including ‘Persons by Topic’ on OntoFrame4.
3 Simple and compound nouns were extracted automatically and filtered manually by human
dictionary constructors.</p>
      <p>4 The whole system will appear in Poster/Demo Track of ISWC2007.
This paper showed a method for finding topic-centric identified experts from CiteSeer
open access metadata and full text documents. Topic extraction based on full text
analysis enables to construct topically-classified papers, and inference makes
propagation to persons and institutions. SPARQL query retrieves URI-based ‘Persons by
Topic’ from RDF triple store. Our future work includes introducing usability test to
evaluate the performance of topic extraction and experts-finding in comparative ways
with Google Scholar and CiteSeer.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rijke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Finding Experts and Their Details in E-mail Corpora</article-title>
          .
          <source>In Proceedings of the 15th International Conference on World Wide Web</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rijke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Finding Similar Experts</article-title>
          .
          <source>In Proceedings of the 30th Annual International ACM SIGIR Conference</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jung</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sung</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Park</surname>
          </string-name>
          , D.:
          <article-title>Semantic Web-Based Services for Supporting Voluntary Collaboration among Researchers Using an Information Dissemination Platform</article-title>
          .
          <source>In Journal of Data Science Journal</source>
          <volume>6</volume>
          (
          <issue>1</issue>
          ) (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jung</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Sung</surname>
          </string-name>
          , W.:
          <article-title>Construction of Semantic Web-based Knowledge Using Text Processing</article-title>
          .
          <source>In Proceedings of the 4th International Conference on Information Technology : New Generations</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Koll</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Finding Experts in Community-Based Question-Answering Services</article-title>
          .
          <source>In Proceedings of the 14th ACM International Conference on Information and Knowledge Management</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mattox</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maybury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Morey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Enterprise Expert and Knowledge Discovery</article-title>
          .
          <source>In Proceedings of the 8th International Conference on Human-Computer Interaction</source>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Yimam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Expert Finding Systems for Organizations: Domain Analysis and the DEMOIR Approach</article-title>
          .
          <article-title>Beyond Knowledge Management: Sharing Expertise</article-title>
          . MIT Press (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rüger</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eisenstadt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Motta</surname>
          </string-name>
          , E.: The Open University at TREC 2006
          <article-title>Enterprise Track Expert Search Task</article-title>
          .
          <source>In Proceedings of the 15th Text REtrieval Conference</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>