Are Cited References Meaningful? Measuring Semantic
            Relatedness in Citation Analysis

              Hassan Alam1, Aman Kumar2, Tina Werner3, Manan Vyas4

                                    BCL Technologies
                 (hassana1, amank2, twerner3, mvyas4)@bcltechnologieshttp


       Abstract. In this proof-of-concept study we use standard cosine similarity meas-
       ure to calculate the semantic similarity between two pieces of text – the citing
       document and the cited text. Three subject matter experts then evaluate the citing
       and the cited text based on the cosine score to give their judgement on the seman-
       tic similarity between the two pieces of text.

       Keywords: Bibliometrics, citation analysis and network analysis for IR


1      Introduction

Researchers and scientists in both academia and industry present and publish their re-
search work in a variety of places and platforms. Because of career pressure and other
factors, they are encouraged to publish and present more and more. The large and rap-
idly increasing amount of scientific literature online and otherwise (book, journals) has
triggered intensified-research into understanding the effectiveness and quality of this
research work.

When reading a research paper we often glance at the bibliography or the list of refer-
ences for additional information. An author cites the references when they look up for
information while preparing for their research paper and they want to acknowledge all
the sources they have used in the process of writing that paper. Ideally, authors are
expected to report the sources even though they do not quote directly from that source.
Readers can use the referenced list to check for the accuracy of the published material
and that establishes credibility for the author. But as a reader, we may not have time to
do further consultation because of the sheer size of the cited material.

In this study we are building a proof-of-concept system that looks only at the relevant
parts of the cited material that is appropriate for evaluating the claims made by the
original author in a given paragraph. This system analyzes the text around the cited
sentences or text in the original article and tries to find the cited material from the ref-
erenced articles to check if the cited text is semantically related to the citing text in the
original document.
2


Compared to humans, this tool cuts down the time considerably in reading and analyz-
ing the cited material.

Our goal in this study is do a proof-of-concept study to evaluate the relationship be-
tween citing and cited documents, by examining measures of cosine similarity between
the citing sentences and the text of the cited scientific articles. Since both the citing and
the cited documents discuss the same topics, we assume that the concepts that are rele-
vant to one another will be more similar than those that are not. If effective, this will
allow identification of the material in the cited article that is relevant to the citing text.
Once we establish that this similarity metrics for this specific task gives satisfactory
results, we will implement other semantic similarity measures such as Latent Semantic
Indexing and evaluate the results.

2      Related Work
Author in [1] explores reasons why citing and cited works may be related. The analysis
indicates that factors such as sources of the cited document, citing work, frequency of
a work cited, and type of citing articles predict closer relatedness between citing and
cited works. Authors in [4], [6] and references there-in discuss several measures of
similarity and relatedness, such as the Pearson correlation and conclude that the cosine
index performs the best.

In this preliminary work, first, we use standard similarity index – cosine similarity score
to establish similarity between two pieces of text, and then we use manual judgment to
understand what type of citing and cited texts closely match semantically with each
other.

This paper is organized as follows. In section 2 we describe the methodology we adopt
to do the empirical analysis to establish semantic relatedness. In section 3 we describe
the experiment set up for this task, followed by a discussion of the evaluations, and
conclusions.


3      Methodology

In this study we want to investigate the degree to which automated methods can reflect,
match, or even predict human judgments and to understand the semantic relationship
between the original article and the cited text.

The automated system calculates the cosine similarity between all sentence pairs, which
is then compared with the Subject Matter Expert’s (SME) relevancy judgment. The idea
is that can we correlate the semantic similarity of two sentences and ascertain the rela-
tionship of relevance between the citing and the cited text.

For Data Preprocessing, we used the stop word list [7] to get rid of the stop words for
further processing.
                                                                                         3


For stemming, we used Porter stemming algorithm [8] which is a Java implementation.
The motivation for stemming is that if we do not do stemming, the tfidf counts will
yield false results. tf as such is not sufficient for our goals of predicting an article’s
relevancy or establishing similarity between two pieces of texts. Using the inverse doc-
ument frequency lowers the weight of common terms. A weight is created by the tfidf
for each term. This establishes a balance between how often a term appears in an indi-
vidual document and with how many documents use the term. In this model, a common,
more frequent term is weighted lightly and an unfamiliar or rare term is weighted more
heavily. This results in identifying discriminative terms. Mathematically, tfidf weight
is calculated using the standard formula:
                                     ,            ∗ log
  Where, is the term, and j is the document.

Normalization
The term frequencies can be influenced by difference in the length of the article. A
more frequent term in a long article will skew the results. Also, it's likely that in short
article a term gets repeated a number of times which may lead to misleading results. In
order to mitigate the effect due to article length and term frequencies we need to nor-
malize the term weight for each article. The normalization of term weight is expressed
mathematically as:
                                   norm(D) = √(∑w(j)2)
   Here j is the document

Cosine Similarity Score
In order to compute the similarity of each pair of the compared items, the cosine simi-
larity gives a numerical value that describes by how much the two compared items are
close to each other. A group of cosine similarity score creates a natural ordering of
comparisons in which the highest values are the most similar and the lowest values are
the least.


                           Fig. 1. Computing Similarity score

This cosine similarity score computes a value, adjusted for article length, to depict the
similarity for each sentence pair, based on the values of shared terms. Mathematically,
it is represented as follows.
4


                Cosine (D1, D2) = ∑(wD1(j)*wD2(j)/norm(D1)*norm(D2))

To summarize, the algorithm we implemented for this proof-of-concept system is as
follows.
         Step 1: Term frequencies and inverse document frequencies is calculated for
         each individual stemmed term.
         Step 2: The term frequencies are combined to create a TF*IDF score.
         Step 3: The TF*IDF score is then normalized to account for varying lengths
         between sentences.
         Step 4: This normalization is then be used to calculate cosine similarity be-
         tween each citing sentence and every sentence in the cited article.
         Step 5: The similarity score is compared with manual assessments of
         whether the paired sentences from the citing and citing articles cite or sup-
         port one another.


3.1    Data
We wrote a tool to extract data from NLM/NCBI [9]. The NLM index includes the full
title for each journal, as well as each journal’s accepted abbreviations, making it possi-
ble to disambiguate and group the varied forms of each journal title under the same
identifier. Each article is indexed, and has a unique identifying number, the Pubmed
ID, or PMID. The NLM offers a Batch Citation Matcher at www.ncbi.nlm.nih.gov/en-
trez/getids.cgi. This citation matcher provides the PMID for each known citation.
Here’s a snapshot of the NCBI Batch Citation Matcher.


 Fig. 2. NCBI Batch Citation Matcher (https://www.ncbi.nlm.nih.gov/pubmed/batchcitmatch)

Using this interface at NCBI we can submit extracted citations in batches that could
range from fifty thousand to one hundred thousand at a time, and load the responses
from the NLM back into our database. This allows us to link the articles to their PMID
using the title, date, journal, etc. from each citation.

The assumption is that the full text of each article includes the list of citations from the
end of each article, and the tags within the text of the article that linked each citation to
the citing sentence. A citation in the text of the article would be marked with a number,
                                                                                       5


and the corresponding number in the reference section contained the full details of the
citation.


4      Experiment

We extracted 50 journal articles from PubMed. For each citation in an article the tool
extracted the corresponding paper. The tool then extracted two sentences before and
two sentences after the citation in the original document and tried to match the words
in those sentences with the target document using the cosine similarity metric. This
process generated a cosine similarity index for each citation in the original document.

Once we have the cosine similarity measurements, we picked up the pairs (citation in
the original sentence and relevant parts in the cited document) that scored higher than
0.90. Three subject matter experts (SME - clinical experts in this case) then manually
evaluated the citations sentences and the cited documents, and decided which of the
correlated documents matched the most. The human experts based their judgement
mainly on semantic matching of the sentences in the two documents and not just on
matched strings.


4.1    Results
For manual evaluations the three SMEs looked at 100 matched set that scored higher
than 0.90 cosine similarity score. SMEs rated their assessment on a scale of 1-100, 100
being the best match. For example, SME-1 found that out of the 100 paired texts, 62
talked about the same concept. In this preliminary study we did not analyze the disa-
greements between the SMEs. Table 1 gives a summary of this evaluation process.

                      Table 1. SME’s evaluations of 100 paired texts

 SME                  Semantic Relatedness (> 0.90 cosine score)
 SME-1                62
 SME-2                67
 SME-3                64


5      Conclusions

In this proof-of-concept study we analyzed the textual similarity between citation text
in an original research paper from PubMed and the corresponding text in the cited doc-
ument. We tried to understand how close the author was in citing the cited paper. We
first used cosine similarity measure to come up with a paired list of citation text and
cited text. We then looked at 100 such pairs with a cosine similarity score of over 0.90.
The system recorded an average accuracy of 64.33% based on the evaluations of the
6


three SMEs. For future work, we plan to extend the similarity metrics using the Word-
Net synset hierarchy and distributional similarity and Latent Semantic Analysis (LSA)
index.


References
    1. Bonzi, S.: Characteristics of a literature as predictors of relatedness between cited and citing
       works. Journal of the American Society for Information Science, 33(4):208-216 (1982).
    2. Boyack, K. W., Small, H., and Klavans, R.: Improving the Accuracy of Co-citation Cluster-
       ing Using Full Text. In Proceedings of 17th International Conference on Science and Tech-
       nology Indicators. (2012)
    3. Corley, C., and Mihalcea, R.: Measuring the Semantic Similarity of Text, in Proceedings of
       the ACL workshop on Empirical Modeling of Semantic Equivalence and Entailment, pp.
       13−18. (2005)
    4. Klavans, R., Boyack, K.W.: Identifying a Better Measure of Relatedness for Mapping Sci-
       ence. Journal of the American Society for Information Science and Technology. 57 (2) pp.
       251-263 (2006)
    5. Madylova, A., and Oguducu, S.G.: A taxonomy based semantic similarity of documents
       using the cosine measure, in Proceeding of International Symposium on Computer and In-
       formation Sciences, 2009, pp. 129−134 (2009)
    6. van Eck, N. J., Waltman.: Appropriate Similarity Measures for Author Co-citation Analysis.
       Journal for the American Society for Information Science and Technology. 59 (10) pp. 1653-
       1661. (2008)
    7. https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.stopwords/
    8. http://www.nltk.org/howto/stem.html
    9. https://www.ncbi.nlm.nih.gov/pubmed/batchcitmatch)