ArnetMiner: An Expertise Oriented Search System
                   for Web Community

       Jie Tang, Jing Zhang, Duo Zhang, Limin Yao, Chunlin Zhu, and Juanzi Li
                Department of Computer and Technology, Tsinghua University
                {tangjie, zhangjing, zhangduo, ylm, ljz}@keg.cs.tsinghua.edu.cn


       Abstract. Expertise Oriented Search aims at providing comprehensive analysis
       and mining for people from distributed sources. In this paper, we give an
       overview of the expertise oriented search system (ArnetMiner). The system
       addresses several key research issues in extraction and mining of a researcher
       social network. The system is in operation on the internet for more than one
       year and receives accesses from about 1,500 users per month. Feedbacks from
       users and system logs indicate that users consider the system can really help
       people to find and share information in the web community.


1. Introduction

Web-based communities have become one of the most important online applications
[3] [5]. Web community targets at providing user-centered services to facilitate
finding and sharing information. Previous information search and mining methods is
not sufficient in this new scenario, due to lacks of semantics and lacks of effective
and efficient approaches to deal with the new mining issues.
   In this paper, we present a novel expertise oriented search system for web
community, which is available at http://www.arnetminer.org [7]. Our objective in this
system is to provide services for searching and mining the semantic-based web
community. Specifically, we currently focus on academic researcher community and
aim at answering four questions: 1) how to automatically extract the researcher profile
from the existing unstructured Web, 2) how to integrate the information (i.e.,
researchers’ profiles and publications) from different sources, 3) how to provide
useful search services based on the constructed web community, and 4) how to mine
the web community so as to provide more powerful services to the users.
   In ArnetMiner, we define the researcher profile ontology and perform researcher
profiling automatically using a unified approach. We integrate publications from the
existing bibliography datasets. In the integration, we propose a constraints-based
probabilistic model to deal with the problem of name disambiguation. We provide
three types of search services. Moreover, we provide several mining services, such as
expert finding, people association finding, and hot-topic finding.
   The system advances four points: 1) proposal of a unified approach to researcher
profiling, 2) proposal of a constraint-based probabilistic model to name
disambiguation, 3) proposal of a score-and-propagate approach to expert finding, and
4) proposal of an efficient approach to association search.
2. System Overview

Figure 1 shows the architecture of
the system. The system mainly
consists of five main components:
                                              5. Mining                      4. Search
1. Extraction:      it   automatically              Expert finding
   extracts the researcher profile               Association Finding              Person search
   from the Web by the following                  Hot-topic Finding             Publication search
   steps: 1) first collect and identify           Sub-topic Finding
                                                                               Conference search
                                                Survey paper finding
   relevant pages (e.g. one’s
   homepages or introducing pages)                3. Storage and Access
   from the Web, 2) use a unified                                 Access Interface
   approach to extract the profiling
   information from the identified                  Indexing
   pages, and 3) collect publications               Storage            RNKB
   from existing digital libraries.                                                   Ontology
2. Integration: it integrates the
   extracted researchers’ profiles                      2. Integration
   and the crawled publications. It                             Name disambiguation
   employs the researcher name as
   the identifier. A constraint-based         1. Extraction
   probabilistic model has been                   Profile extraction
                                                                              Publication collection
   proposed to deal with the name               Document Collection
   ambiguity problem in the
   integration. The integrated data is
   stored into a researcher network                               Web
                                                                                        Papers
                                                                                                DBLP
   knowledge base (RNKB).
3. -torage and Access: it provides            Figure 1. Architecture of ArnetMiner
   storage and index for the
   extracted/integrated data in the RNKB. Specifically, for storage it employs Jena
   [2]; for index, it employs the inverted-file indexing method [9].
4. -earch: it provides three types of search services: person search, publication
   search, and conference search. Given the name of a person, person search returns
   his/her profile information, authored publications, and relationships with the other
   researchers. Given a keyword, publication search returns the relevant publications.
   And conference search intends to find related conferences for a given keyword.
5. 3ining: it provides five mining services: expert finding, people association finding,
   hot-topic finding, sub-topic finding, and survey paper finding. Given a topic,
   expert finding returns a list of persons who are ‘experts’ on the topic. Given a
   keyword, hot-topic and sub-topic finding returns the hottest research topics that
   researchers interested in and sub topics in that field. And given any two persons,
   people association finding returns possible associations between them. Survey
   paper finding is aimed at finding survey papers for a given topic, which is helpful
   for the researcher to gain a quick insight into a research topic.
   For several features in the system, e.g., researcher profile extraction, name
disambiguation, expert finding, and association search, we propose new approaches
trying to overcome the drawbacks that exist in the conventional methods. For some
other features, e.g., storage, knowledge access, and searching, we utilize the state-of-
the-art methods. This is because, these issues have been intensively investigated
previously and the conventional methods can result in good performances. We also
provides easy access interface (web services) for developing advanced applications.
   Please note that this is a product of an ongoing project. Visitors should expect the
system to change. We are extracting more researcher profiles and publications and are
also developing more practical search services based on feedbacks from users.


3. Extraction of the Researcher Community

We define the researcher profile ontology (Figure 2), by extending FOAF [1]. In the
ontology, two concepts, 24 properties and two object relations are defined.
                 Research_Interest                    Fax                            Title
               Affiliation                              Phone
                                                                                              Publication_venue
                Postion                                     Address
                                                                                                           Start_page
      Person Photo                                           Email

      Homepage                                              authored_by                                     End_page
                                                                               Publication
      Name
                                                                      author                              Date
                                         Researcher
     Phddate                                                                                         Publisher
                                                                 Bsdate
         Phduniv                                                                                  Download_URL
                                                               Bsuniv
                             Msdate
               Phdmajor                                       Bsmajor
                                     Msuniv                                            Property
                                                                                                             Relation
                                              Msmajor                                  Concept

                        Figure 2. The researcher profile ontology
   We randomly selected 1K researchers and studied how to extract profiles of the
researchers from the Web. We found that it is non-trivial to perform the extraction.
Specifically, we observed that 85.62% of the researchers are faculties of universities
and 14.38% are researchers from company. For researchers from the same company,
they might have similar template-based homepages. However, different companies
have different templates. For researchers from universities, the layout and the content
of the homepages vary largely depending on the authors. We have also found that
71.88% of the 1K Web pages are researchers’ homepages and the rest are introducing
pages. Characteristics of the two types of pages significantly differ from each other.
   Statistical study also unveils that (strong) dependencies exist between profile
properties. For example, there are 3,842 cases (12.98%) in our data that extraction of
a property needs use the extraction results of the other properties. An ideal method
should consider annotating all the properties together.
   We propose a unified approach to researcher profiling [8]. The approach consists
of three steps: relevant page identification, preprocessing, and tagging. In relevant
page identification, given a researcher name, we first get a list of web pages by a
search engine (we used Google API) and then identify the homepage/introducing page
using a classifier. The performance of the classifier is 92.39% in terms of F1-measure.
In preprocessing, (A) we separate the text into tokens and (B) we assign possible tags
to each token. The tokens form the basic units and the pages form the sequences of
units in the tagging problem. In tagging, given a sequence of units, we determine the
most likely corresponding sequence of tags by using a trained tagging model. (The
type of the tags corresponds to the property defined in Figure 2.) In this paper, as the
tagging model, we make use of Conditional Random Fields (CRFs) [4].
   We conducted experiments to evaluate the performance of the unified approach.
On the randomly chosen 1K researchers’ pages, our approach can reach 83.37% (in
terms of F1-measure) on average. We compared our method with several state-of-the-
art methods, i.e., rule learning based method (Amilcare) and classification based
method (SVM-based method). Our approach outperforms the two baseline methods.


4. Integration of Heterogeneous Data

We integrate the publication data from existing online data source. We chose DBLP
bibliography (dblp.uni-trier.de/), which is one of the best formatted and organized
bibliography datasets. DBLP covers approximately 800,000 papers from major
Computer Science publication venues. In DBLP, authors are identified by their
names. For integrating the researcher profiles and the publications data, we use
researcher names and the author names as the identifier. The method inevitably has
the ambiguity problem (different researchers have the same name).
   The task of name disambiguation can be defined as follow: Given a person name a,
we denote all publications containing the author named a as P={p1, p2, …, pn}. For
each publication pi, it has attributes: title, conference, year, abstract, authors, and
references. Suppose there existing ; actual researchers {y1, y2, …, y;} having the
name a, our task is to assign these n publications to their real researcher yi.
   Our method is based on a unified probabilistic model using Hidden Markov
Random Fields (HMRF) [8]. This model incorporates constraints and a
parameterized-distance measure. The disambiguation problem is cast as assigning a
tag to each paper with each tag representing an actual researcher yi. Specifically, we
define the a-posteriori probability as the objective function. We aims at finding the
maximum of the objective function. We incorporate different types of constraints into
the objective function, where constraints are considered as a form of supervision or
background knowledge. If one paper’s assignment violates a constraint, it will be
penalized in some sense, which in turn affects the disambiguation result.
   For evaluating the proposed disambiguation method, we created two test sets from
the data collected in ArnetMiner. We applied our method to the two datasets and
obtained 75% in terms of F1-measure. We compared our method with a baseline
method using unsupervised clustering algorithm. The baseline is similar to that
proposed by [6] except that [6] also use a search engine to help the disambiguation.
Our method outperforms the baseline method by 8.0% in terms of F1-measure.


5. Storage and Access

ArnetMiner represents the data based on RDF/OWL and stores the extracted data in
MySQL database using Jena, version 1.5 [2]. To query the data, we use SPARQL. We
extracted about half million researcher profiles, integrated more than 0.8 million
publications, and extracted about 2.4 million co-author relationships between
researchers with 5.38 relationships for each on average. We stored the data as RDF
triples. In total, there are more than 10M N3 triples stored in the database.
   For searching for instances with one property containing some keyword such as
“Professor”, the naive SPARQL based method would not be efficient (sometimes
even need use dozen of minutes). For efficiently performing this kind of search, we
create an inverted-file index. Using the inverted-file index, we can efficiently search
for the URI of the instances/properties that contain the keyword. Then we employ
SPARQL to query the specified URI. In this way, the index-based method uses only
0.14 second to conduct an average search.


6. Search

In ArnetMiner, we provide three types of searches: person search, publication search,
and conference search.
1. Person search. The user inputs a person name, and the system returns the profile of
   the person. We perform person search in the constructed researcher network. If a
   person can be found, the profile of the person stored in the local knowledge base
   will be displayed. The system also supports searching with constraints, for
   example, the user can input a query like “Jie Tang, aff:Tsinghua” to searches for
   the person “Jie Tang” and with its “affiliation” containing “Tsinghua”.
2. Publication search. The user inputs keywords, and the system returns publications
   with the most relevant publications on the top. We employ the conventional
   information retrieval model to do the publication search. Moreover, the system
   tries to find the download link of each publication from the web.
3. Conference search. The user inputs keywords (e.g. “ISWC 2006”), and the system
   returns the detailed information of the conference.


7. Mining

Currently, ArnetMiner provides five mining services: expert finding, people
association finding, hot-topic finding, sub-topic finding, and survey paper finding.


7.1 Expert Finding

The goal of expert finding is to identify persons with some given expertise from the
community: “Who are the experts on topic X in the researcher community?”.
   We propose a new approach for finding experts in a web community in which we
take into consideration of both person profile and relationships between persons. The
approach consists of two stages, Candidate Scoring and Expert Propagation.
   In Candidate Scoring, we use the person profile information to calculate an initial
expert score for each person. The basic idea here is that if a person has (co)authored
many documents on a topic or if the person’s name co-occurs many times with the
topic, then it is likely that he/she is a candidate expert on the topic.
   In Expert Propagation, we make use of relationships between persons to improve
the accuracy of expert finding. The basic idea here is that if a person knows many
experts on a topic or if the person’s name co-occurs many times with an expert, then it
is more likely that he/she is an expert on the topic.
   Our intuition stems from our observations on how humans find an expert in the real
world, namely by a) reading person profile information, and b) asking known experts
to make a recommendation. Our approach is an implementation of the two observ-
ations by combining the person profile and the relationships in the Web community.
   We conducted experiments to evaluate the method for expert finding. We assume
that a real ‘expert’ is often active in the committees of the top conferences and
organizations in his/her related research topics. We collected topics and answers
(http://keg.cs.tsinghua.edu.cn/project/PSN/dataset.html). Experimental results show that
our method outperforms the baseline method using only researcher profiles and the
method using PageRank. See [10] for details.


7.2 People Association Finding

Given a web community, the people association is defined as a sequence of
relationships {eri1, er12, …, erl=} satisfying erm(m+1)!E for m=1, 2, …, l-1, where vi and
v= represents the source person and the target person, respectively.
    Given a large-scale web community, to find all possible associations between two
persons is obviously an NP-hard problem. In ArnetMiner, we concentrate ourselves
on finding the most ‘goodness’ associations. We call the association with the smallest
score (the small the best) as the shortest association and our goal is to find the near@
shortest associations, whose scores are within a factor of (1+!) of the score of the
shortest association for some user-defined !>0. Our method consists of two stages.
    1. Shortest association finding. It aims at finding shortest associations from all
persons v!A\v= in the community to the target person v= (including the shortest
association from vi to v= with score Lmin). We employed a heap-based Dijkstra
algorithm to find the shortest associations between two persons.
    2. Near-shortest associations finding. Based on the found shortest association score
Lmin>0 and a pre-defined parameter !, the algorithm requires enumeration of all
associations that are less than (1+!)Lmin by a depth-first search. We constrain the
length of an association to be less than a pre-defined threshold.
    To evaluate the effectiveness of our proposed approach, we created 9 test sets.
Experimental results show that our approach achieves high performance in all of the
association search tasks. In terms of the average time, our approach can find
associations in less than 3 seconds in most of the search tasks.


7.3 Hot-topic and Sub-topic Finding

Finding the hottest research topics and the sub topics in a research field is a very
important issue. For sub-topic finding, a clustering algorithm is utilized to group the
papers that contain the keyword inputted by the user. A threshold is used to determine
the number of clusters. Then each cluster is viewed as a sub-topic.
    For hot-topic fining, we employ a language model based methods. It uses two
steps: n-best Part-Of-Speech (POS) tagging and term (base-noun phrase)
identification given the n-best POS-sequences. In the first step, it finds the n-best POS
sequences for a sentence in the paper or paper title by estimating a language model
from the training data. In the second step, it again uses a trained language model to
estimate the best term sequence. For each term, a probability is assigned, representing
its popularity. We view the terms with the highest probabilities as the hot topics.


7.4 Survey Paper Finding

A survey paper objectively surveys a body of previously published research on a
topic, integrating information from several published papers. Researchers often start
investigating a new research issue by first studying the survey papers of that field. We
employed a classification based method to find the survey papers. Specifically, given
a keyword, we use the state-of-the-art retrieval method to find a set of relevant papers
and view the papers as candidates. Then we utilize a classification model to identify
whether a paper is a survey paper or not. As the classification model, we employ
Support Vector Machines (SVM). Features were defined in the classification model.


8. Experiences

Here, we share some thoughts about the strengths and the weaknesses of system.
   Strengths
   From experimental results, we see that ArnetMiner can achieve high performance
in most of the key issues addressed, including profiling, integration, expert finding,
and association finding. Some concluding remarks are as follows:
   1) Automatic extraction of the researcher profile from the Web is feasible and the
profile properties are usually inter-independent. By making use of the dependencies
between the properties, the accuracy of the profile extraction can be improved.
   2) Integration of data from different sources is necessary for web community.
Name disambiguation is the key issue in the integration. Our approach based on
HMRF model can obtain better results than the baseline method.
   3) Efficient storing and access is very important for a Semantic Web application.
Using the index-based method, the system can provide high efficiency in search.
   4) Expert finding is an important issue in the academic community. The score-and-
propagate approach can effectively combine the researcher profile and the
relationships between researchers, and thus obtain high performance.
   5) People association search is another important issue for searching web
community. The proposed approach can efficiently find associations between people.
   Weaknesses/Future works
   1) Extraction of more types of relationship. In ArnetMiner, we use only the co-
authorship as the relationship. In the future, we will extract other relationships, e.g.,
the relationship of co-organization and co-project etc.
   2) FOAF file integration. In the current system, we use the Google search API to
locate the pages and extract the profile from the identified page for a researcher.
FOAF files are also important sources to get person description. We can further
integrate FOAF files on the Web to get more information about the person.
   As future work, we also plan to investigate more mining issues to empower the
system, for example expertise publication finding and rising ‘star’ finding on a topic.
   We received feedbacks from about one hundred users. Most of the feedbacks are
positive. For example, some suggest that the expert finding approach is useful and it
can be enhanced by adding several new features (e.g. reviewers finding for a paper).
Some other feedbacks also ask for improvements of the system. For example, 5% of
the feedbacks complain mistakes made in the profile extraction and 6.8% point out
the integration mistakes (assigning publications to a wrong researcher). In addition,
5.5% of the feedbacks mention that the found research interests are not accurate and
the method should be improved, which is also our current research issue.


9. Conclusion

In this paper, we have presented an expertise oriented search system, called
ArnetMiner, for web community. We introduced the architecture and the main
features of the system. We have described in detail the several issues that we are
focusing on and proposed our approaches to them. We have carried out experiments
for evaluating each of the proposed approaches. We also simply analyzed the
strengths and weakness of the system.


References

[1] D. Brickley and L. Miller. FOAF vocabulary specification, namespace document,
   September 2, 2004. http://xmlns.com/foaf/0.1/.
[2] J.J. Carroll, J. Dickinson, C. Dollin, R. Reynolds, A. Seaborne, and K. Wilkinson. Jena:
   implementing the Semantic Web recommendations. In Proc. of WWW’2004, pp.74-83.
[3] J. Golbeck. Web-based social networks: a survey and future directions. Technique Report.
[4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models
   for segmenting and labeling sequence data. In Proc. of ICML’2001, pp.282-289.
[5] P. Mika. Flink: Semantic Web technology for the extraction and analysis of social networks.
   Web Semantics: Science, Services and Agents on the World Wide Web, 2005, v(3):211-223.
[6] Y.F. Tan, M. Kan, and D. Lee. Search engine driven author disambiguation. In Proc. of
   JCDL’2006, Chapel Hill, NC, USA, June 2006, pp. 314-315.
[7] J. Tang, M. Hong, J. Zhang, B. Liang, L. Yao, and J. Li. ArnetMiner: toward building and
   mining social networks. (Demo) In Proc. of SIGKDD’2007.
[8] J. Tang, D. Zhang, and L. Yao. Social network extraction of academic researchers. In Proc.
   of ICDM’2007, to appear.
[9] C.J. van Rijsbergen. Information retrieval. But-terworths, London, 1979.
[10] J. Zhang, J. Tang, and J. Li. Expert finding in a social networks. In Proc. of Database
   Systems for Advanced Applications (DASFAA’2007).