PubRec: Recommending Publications Based On
        Publicly Available Meta-Data

                  Anas Alzoghbi, Victor Anthony Arrascue Ayala,
                       Peter M. Fischer, and Georg Lausen

              Department of Computer Science, University of Freiburg
                Georges-Köhler-Allee 051, 79110 Freiburg, Germany
     {alzoghba,arrascue,peter.fischer,lausen}@informatik.uni-freiburg.de


        Abstract. In recent years we can observe a steady growth of scien-
        tific publications in increasingly diverse scientific fields. Current digital
        libraries and retrieval systems make searching these publications easy,
        but determining which of these are relevant for a specific person re-
        mains a challenge. This becomes even harder if we constrain ourselves
        to publicly available meta-data, as complete information (in particular
        the fulltext) is rarely accessible due to licensing issues. In this paper
        we propose to model researcher profile as a multivariate linear regression
        problem leveraging meta-data like abstracts and titles in order to achieve
        effective publication recommendation. We also evaluate the proposed ap-
        proach and show its effectiveness compared with competing approaches.

        Keywords: Recommender System, Scientific Paper Recommendation,
        Content-based Filtering, Multivariate Linear Regression, User Modelling


1     Introduction
Modern research is remarkably boosted by contemporary research-supporting
tools. Thanks to digital libraries, researchers support their work by accessing
a large part of the complete human knowledge with little effort. However, the
sheer amount of rapidly published scientific publications overwhelms researchers
with a large number of potentially relevant pieces of information. Recommender
systems have been introduced as an effective tool in pointing researchers to im-
portant publications [5, 9, 10]. An approach that gained a lot of interest [2] ex-
tracts the interests of a user from the text of his/her publication list. In order to
do so in an effective manner, full access to the textual content of research papers
is needed. Yet, digital libraries typically provide only meta-data for publications
including the publication date, title, keywords list and abstract. Although the
availability of such information facilitates the problem, the usefulness of such a
limited amount of information for paper recommendation is still unclear.
    Copyright c 2015 by the paper’s authors. Copying permitted only for private and
    academic purposes. In: R. Bergmann, S. Görg, G. Müller (Eds.): Proceedings of
    the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9.
    October 2015, published at http://ceur-ws.org


                                          11
In this work we explore an approach to effectively perform paper recommenda-
tion utilizing such limited information. We present an adaptive factor to measure
the interest extent of the active researcher in each of her/his previous publica-
tions; we apply a learning algorithm to fit a user model which in turn can be
used to calculate the possible interest in a potential paper. Our contributions
can be summarized as follows:
  – An effective approach for modeling researchers interest that does not require
    access to the fulltext of the publication, but only freely available meta-data.
  – An adaptive anti-aging factor that defines, for each researcher and publi-
    cation, a personalized interest extent, so that older contributions have less
    impact.
  – Preliminary results of comparing our approach against two state of the art
    recommendation techniques that considers full textual content.
The rest of this paper is organized as follows. In Section 2 we review work related
to our approach. Section 3 presents the problem definition and outlines the
presented approach. Section 4 demonstrates the profile building model employing
the anti-aging factor. In Section 5 we explain the conducted experiments and
discuss the results. Finally we conclude the paper in Section 6

2   Related Work
Research paper recommendation has been a hot topic for more than a decade.
Several works addressed this problem proposing ideas from different recommen-
dation directions [2]. Publication title and abstract were employed in [10] to
build a user model using collaborative topic regression combining ideas from
Collaborative Filtering and content analysis, but results were of varying quality.
Nascimento et al. [5] use titles and abstracts as well. Users provide a represen-
tative paper that fits their interests, out of which keywords are extracted from
the title and abstract. These keywords are then used to retrieve similar papers
from digital libraries. We believe this is a limited approach as keywords from
one publication are not enough to capture user interests. Sugiyama and Kan in
[8, 9] employ a simplified variation of the Rocchio algorithm [7] to build a user
profile utilizing all terms which appear in the fulltext of the user’s authored
publications, while they also incorporate terms from the citing and referenced
papers. However, this approach suffers from the poor quality of the terms used
and from the dependency on tools to extract text from pdf files which have well-
known limitations. Above all, the authors assumed the availability of the full
text of the publications which is rarely the case. In this work we optimize the
use of the publicly available meta-data rather than relying on the full text of
the publication. Moreover, we build a researcher interest model that can depict
different affinity models of researchers.

3   PubRec Model
We propose a content-based research publications recommender (PubRec) that
models both the active user (the researcher) and the candidate publications in


                                      12
terms of domain-related keywords. This section introduces the basic concepts of
PubRec along with the formal problem definition.


3.1   Research Publication Profile

Digital libraries like ACM, IEEE, Springer, etc. publish meta-data about re-
search publications publicly. Out of this meta-data, we are interested in title,
abstract, keyword list and publication year. The first three can be effectively
exploited to build a profile for each publication p as a keyword vector, which
                                                       −
                                                       →
represents p in terms of domain-related keywords: Vp = hwp,k1 , wp,k2 , ..., wpkn i,
where ki is a domain-related keyword from the set of all keywords K, and wp,ki
is the weight of ki in p with range of [0, 1].
                                                      −
                                                      →
All keywords from the keyword list are added to Vp with the maximum weight
value of 1 by virtue of their source. As they are assigned to publications explicitly
by the authors, we consider them the most precise domain description for the
underlying publication. This list, however, contains usually up to 10 keywords,
which is a small number for modeling a publication, thus, we aim to extend this
list. Titles and abstracts hold a great essence of the ideas presented in publica-
tions. Therefore, we treat them as the second source of keywords and for each
publication we apply keyword extraction from the concatenation of its title and
abstract with weights correspond to the TF-IDF weighting scheme.


3.2   Researcher Profile

Given a researcher r with a set of her/his publications, we construct a researcher
        −
        →
profile Vr = hsr,k1 , sr,k2 , ..., sr,kn i such that ki ∈ K is a domain-related keyword,
and sr,ki is the importance of ki to r. Our proposed profile construction method
ensures that r’s Interest Extent (IE) in a publication p is achieved by computing
the dot product between the researcher’s vector and the publication’s vector:
                                   −
                                   → −  →      −
                                               → −  →
                                IE(Vr , Vp ) = Vr · Vp                              (1)

3.3   Problem Definition

Our problem can be formally defined as:
Given a researcher r along with the corresponding set of publications Pr and
a candidate set of publications Pcand , find k publications from Pcand with the
maximum IE. The presented approach can be summarized in the following steps:
                                          −
                                          →
 – First, we build the researcher profile Vr using previous publications by mod-
    eling the problem as a multivariate linear regression problem (Section 4)
                                                                            −
                                                                            →
 – Each candidate publication p ∈ Pcand is modeled as a keyword-vector Vp
                                      −
                                      → −  →
 – We use Formula 1 to calculate IE(Vr , Vp ) for candidate publication p ∈ Pcand
 – Candidate publications are ordered by their Interest Extents and the top k
    are recommended to r.


                                        13
4     Modeling Researcher Interest
We utilize researchers’ publications to draw conclusions about their interests. A
key aspect of PubRec consists in considering the different interest researchers
have in their publications. After all, this interest might vary from paper to paper
depending on several factors. Moreover, the importance of these factors vary
among researchers. Thus, we believe that the publication age is an important
factor in this regard since a five years old publication, for example, might not
reflect the author’s current interest as much a publication of the current year.
Based on that, we introduce a scoring function for estimating the affinity of a
researcher r towards one of her publications p ∈ Pr by engaging the publication’s
age, which is expressed by the number of years elapsed after the publication’s
date and represented by σ in the following function:
                                                       −(σ)2
                                      IEr,p = e          λ     .                   (2)
Here, λ is the researcher-specific anti-aging factor. As depicted in Figure 1, the
curve of IE is plotted for three different values of λ: 4, 20 and 50. There we can
see how λ regulates the steepness of this curve. As the values of λ increase, the
curve becomes less steep and results in higher IE values for older publications.
For example consider researcher r0 , the Interest Extent of r0 for p0 , a 3 years
old publication, can be modeled in three different ways upon three different
values of λ: IEr0 ,p0 = 0.1 for λ = 4, IEr0 ,p0 = 0.63 for λ = 20 and IEr0 ,p0 =
0.83 for λ = 50. This behavior helps in modeling different types of researchers
based on their affinity model. Such that, researchers who tend to stick to the
same research topics longer time are modeled using larger λ values compared
to other researchers who tend to change their topics of interest more rapidly.
Choosing the best λ for each researcher is done empirically in this work, but
further investigations about the correlation between researcher characteristics
and the optimal λ value are left for future work.
                       1
                                                  p0                   λ=4
                     0.8
                                                                       λ=20
                     0.6
                                                                       λ=50
                IE


                     0.4
                                        p0
                     0.2
                                                  p0
                       0
                           0      2               4            6   8          10
                                                        σ

    Fig. 1: anti-aging factor (lambda) impact on Interest Extent IE of researcher r0


4.1    Learning Researcher Profile
The second contribution in this work is to model the problem of measuring
the importance of domain related keywords for a researcher r as a multivariate
linear regression problem as follows: Given the set of r’s publications Pr , for


                                             14
each publication pi ∈ Pr we build the underlying publication profile as described
                −→
in section 3.1: Vpi = hwpi ,k1 , wpi ,k2 , ..., wpi ,kn i. Furthermore, the Interest Extent
IEr,pi is calculated using Formula 2 as shown in Figure 2. Let the set of keywords’
weights of the paper pi : wpi ,k1 , wpi ,k2 , ..., wpi ,kn be the set of predictors related
to the response variable IEr,pi , then the multivariate linear regression model [6]
                                 →
                                 − −→
for pi is defined as: IEr,pi = θ · Vpi = θ0 + θ1 wk1 + ... + θn wkn .
                              k1   k2   k3         kn   IE
                       p1 99K w1,1 w1,2 w1,3 . . . w1,n IEr,p1
                       p2 99K w2,1 w2,2 w2,3 . . . w2,n IEr,p1
                                        ...
                       pm 99K wm,1 wm,2 wm,3 . . . wm,n IEr,pm

                       θ   99K θ1    θ2    θ3    . . . θn

      Fig. 2: Publications keyword vectors and Interest Extents for one researcher


            →
            −
    Where θ is the regression coefficient vector and θ0 , θ1 , . . . , θn are the re-
gression coefficients. Each coefficient value θj , j ∈ 1, . . . n defines the relation
between the researcher r and the keyword kj , or in other words the importance
                                                                         →
                                                                         −
of kj for r. Consequently, the user profile is modeled by means of θ . Meaning,
                                       −
                                       →
that in order to find the user profile Vr , we should solve the previously mentioned
                                            →
                                            −
regression problem and find the vector θ . This problem is solved by minimizing
the cost function:
                                       m
                                  1 X→      − −→
                         J(θ) =           ( θ · Vpi − IEr,pi )2
                                 2m i=1

This is a well known optimization problem and there exist a couple of algorithms
such as gradient descent or Normal equation to solve it [1]. We use an algorithm
known for its efficiency, namely the L-BFGS algorithm [4].

5     Experiments
We conducted experiments to validate our approach and compared it against
some state-of-the-art approaches. In the following we describe the used dataset
along with the used evaluation metrics. Finally, we show and discuss the results.

5.1    Dataset
To evaluate the presented approach, we used the Scholarly publication Rec-
ommendation dataset1 . It covers information about 50 anonymous researchers,
enclosing their publication set, in addition to a set of publications of interest for
each researcher. The interest lists are subsets of a larger collection of 100,531
publications called the candidate publications which is also provided.
1
    https://www.comp.nus.edu.sg/~sugiyama/dataset2.html


                                          15
To the best of our knowledge, this is the only available dataset which provides
the interest list for such a number of researchers. However, we had to resolve
a major obstacle before we could use the dataset. That is, publications in the
dataset are named by unique IDs without titles or author names, hence they
cannot be identified and no meta-data was provided.
In order to make the dataset usable for our evaluation, we needed to identify
the publications to be able to retrieve their meta-data. This was achieved by the
following steps: (a) requesting and obtaining original pdf files from the dataset
authors; (b) extracting publications’ titles from the pdf files and using them to
find publication identities within the DBLP2 register; and finally (c) having the
electronic edition pointer (ee) from DBLP publication’s attributes, we retrieved
needed information from corresponding publisher web site3 . The result is a rich
dataset that contains meta-data for 69,762 candidate publications, and more
importantly the full publications and interest sets for 49 researchers. Lastly, for
all publications in this dataset we applied the keyword extraction.
Keywords extraction and weighting. We use Topia’s Term Extractor 4 be-
cause of its efficiency and usability. It is a tool that uses Parts-Of-Speech (POS)
and statistical analysis to determine the terms and their strength in a given text.
Yet we extended this tool in order to extract keywords with higher quality and
make the best use out of the limited available resources. Our extensions to Topia
are: (a) we apply post filtering on the resulting terms by choosing only those
terms which appear in a white list of computer science terms; (b) the weights of
extracted terms is calculated based on the normalized TF-IDF weighting scheme.

5.2 Evaluation metrics
We report the quality of our method with two important and widely adopted
metrics for evaluating ranking algorithms in information retrieval. For the fol-
lowing metrics r is a researcher from the set of researchers R:
Mean Reciprocal Rank (MRR). MRR measures the method’s quality by
checking the first correct answer’s position in the ranked result. For each re-
searcher r, let pr be the position of the first interesting publication      from the
                                                              1         1
                                                                 P
recommended list, then MRR is calculated as M RR = |R|             r∈R pr
Normalized Discounted Cumulative Gain (nDCG)[3]. DCG@k indicates
how good are the top k results of the ranked list. Typically in recommender
systems DCG is measured for k ∈ {5, 10} as users don’t usually check recom-
mended items beyond the 10th position. The DCG for a researcher r is calculated
                 Pk 2rel(i) −1
as DCGr @k = i=1 log      10 (1+i)
                                   where rel(i) indicates the relevance of the item at
                                th
position i: rel(i) = 1 if the i item is relevant and rel(i) = 0 otherwise. nDCG
is the normalized score which takes values between 0 and 1, it is calculated as:
               DCG@k
nDCG@k = IDCG@k        , where IDCG@k is the DCG@k score of the ideal rank-
ing, in which the top k items are relevant. In our case we report on the average
nDCG@k over all researchers for k ∈ 5, 10
2
  http://dblp.uni-trier.de/
3
  We received the ACM publications’ meta-data from ACM as XML.
4
  http://pypi.python.org/pypi/topia.termextract


                                       16
5.3   Experimental results

Using the previously described dataset and evaluation metrics, we conducted
quality evaluations for our method with the following setup: given a set of can-
didate publications, and a set of researchers with their publications set, the
system should correctly predict the interesting publications for each researcher.
The results are demonstrated in the first row of Table 1. It shows that PubRec
manages to achieve a high MRR score of 0.717. Looking deeper into the details
of this metric by examining results for individual researchers gives more insights:
for 29 out of 49 researchers the first relevant publication appeared at the first
position of the recommended list, and at the second position for 7 researchers.
We compared our approach with two state-of-the-art publication recommender
systems [5, 9]. The work presented in [9] models each publication p using terms
from p, from publications referenced by p and from publications that cite p.
Additionally, the authors extended the set of citing publications by predicting
potential citing publications. As our key contribution lies in utilizing only pub-
licly available data, we implemented their core method5 (Sogiyama) for modeling
scientific publications considering only the terms which appear in the underlying
publication. We compared PubRec against Sogiyama on two different setups: (a)
Sogiyama using all terms appear in the full text of the publication6 ; (b) Sogiyama
using our domain-related keywords. The results are shown in second and third
row of the Table 1 respectively. In both setups PubRec outperforms Sogiyama
in the three measured metrics. Furthermore, comparing our results with the re-
sults of Sogiyama as appeared in [9] where they assume the availability of the
full text of the citing and referenced publications (5th row in Table 1) in addi-
tion to the potentially citing publications (4th row in Table 1), we find that our
approach with such a limited available information is competitive and exhibits a
reasonable trade-off between data-availability and recommendation quality. The
last row in the table shows the scores of [5]7 , where publications are modeled
using N-grams extracted from titles and abstracts. Each user identifies a repre-
sentative publication and the recommendation process turns into finding similar
publications to the representative one by means of the cosine similarity.


6     Conclusion

We have proposed a novel approach on recommending scientific publications. By
exploiting only publicly available meta-data from digital libraries the quality of
the predictions is superior to state-of-the-art approaches, which require access
to the full text of the paper. The focus is primarily on the user profiling, where
a strategy to determine the trend of interests of a user in her own publications
over time is integrated into a multivariate linear regression problem. The efficacy
5
  This method applies a light-weight variation of the Rocchio algorithm [7]
6
  The dataset provided by the authors of [9] contains all terms (not only domain-
  related terms) that appear in the full text of the publication.
7
  Values are taken from [9]


                                      17
                                         MRR nDCG@5 nDCG@10
              PubRec                     0.717 0.445 0.382
              Sogiyama On their dataset  0.550 0.395 0.358
              Sogiyama On PubRec dataset 0.577 0.345 0.285
              Sugiyama and Kan [9]       0.793 0.579 0.577
              Sugiyama and Kan [8]       0.751 0.525 0.479
              Nascimento et al. [5]      0.438 0.336 0.308

        Table 1: Recommendation accuracy comparison with other methods
of our approach is demonstrated by experiments on the Scholarly Paper Rec-
ommendation dataset. As future work, we plan to investigate the relationship
between the anti-aging factor λ and researchers. Furthermore, we are interested
in investigating the effects of enriching our modeling with meta-data from citing
and referenced publications.

Acknowledgments. Work by Anas Alzoghbi was partially supported by the
German Federal Ministry of Economics and Technology (BMWi) (KF2067905BZ).
We thank Kazunari Sugiyama for his efforts in providing us with the complete
Scholarly dataset. We also thank ACM for providing meta-data for our dataset.

References
 1. Alpaydin, E.: Introduction to Machine Learning. Adaptive Computation and Ma-
    chine Learning Series, MIT Press (2014)
 2. Beel, J., Langer, S., Genzmehr, M., Gipp, B., Breitinger, C., Nürnberger, A.: Re-
    search paper recommender system evaluation: A quantitative literature survey. In:
    Proceedings of the International Workshop on Reproducibility and Replication in
    Recommender Systems Evaluation. RepSys ’13 (2013)
 3. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques.
    ACM Trans. Inf. Syst. 20(4), 422–446 (Oct 2002)
 4. Liu, D., Nocedal, J.: On the limited memory bfgs method for large scale optimiza-
    tion. Mathematical Programming 45(1-3), 503–528 (1989)
 5. Nascimento, C., Laender, A.H., da Silva, A.S., Gonçalves, M.A.: A source indepen-
    dent framework for research paper recommendation. In: Proceedings of the 11th
    Annual International ACM/IEEE Joint Conference on Digital Libraries. JCDL ’11,
    ACM (2011)
 6. Rencher, A., Christensen, W.: Methods of Multivariate Analysis. Wiley Series in
    Probability and Statistics, Wiley (2012)
 7. Rocchio, J.J.: Relevance feedback in information retrieval (1971)
 8. Sugiyama, K., Kan, M.Y.: Scholarly paper recommendation via user’s recent re-
    search interests. In: Proceedings of the 10th Annual Joint Conference on Digital
    Libraries. JCDL ’10, ACM (2010)
 9. Sugiyama, K., Kan, M.Y.: Exploiting potential citation papers in scholarly paper
    recommendation. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on
    Digital Libraries. ACM (2013)
10. Wang, C., Blei, D.M.: Collaborative topic modeling for recommending scientific
    articles. In: Proceedings of the 17th ACM SIGKDD International Conference on
    Knowledge Discovery and Data Mining. KDD ’11 (2011)


                                       18