=Paper=
{{Paper
|id=Vol-1610/paper19
|storemode=property
|title=NEAL: A Neurally Enhanced Approach to Linking Citation and Reference
|pdfUrl=https://ceur-ws.org/Vol-1610/paper19.pdf
|volume=Vol-1610
|authors=Tadashi Nomoto
|dblpUrl=https://dblp.org/rec/conf/jcdl/Nomoto16
}}
==NEAL: A Neurally Enhanced Approach to Linking Citation and Reference==
<pdf width="1500px">https://ceur-ws.org/Vol-1610/paper19.pdf</pdf>
<pre>
    BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


     NEAL: A Neurally Enhanced Approach to Linking
                Citation and Reference

                                           Tadashi Nomoto
                              1
                              National Institute of Japanese Literature
                 2
                     The Graduate University of Advanced Studies (SOKENDAI)
                                      nomoto@acm.org


        Abstract. As a way to tackle Task 1A in CL-SciSumm 2016, we introduce a
        composite model consisting of TFIDF and Neural Network (NN), the latter being
        a adaptation of the embedding model originally proposed for the Q/A domain [2,
        7]. We discuss an experiment using a development data, results thereof, and some
        remaining issues.


1    Introduction
This paper provides an overview of our efforts to tackle Task 1A at CL-SciSumm 2016,
whose stated goal is to locate part of a reference paper (RP) most relevant to a given
citation made by a citing paper (CP). To give an idea of what it is about, consider
Figure 1.


                     Fig. 1. An example of citation in a scientific publication [3].


In it, you have a sentence that reads:
     On the other hand, researchers from the visualization community have de-
     signed to a number of topic visualization techniques [9,16,17,18] ...
Your job is to find passages in the relevant literature (what the authors call 9, 16, 17,
and 18), which are most pertinent to the sentence in question. (We denote a passage in
referred-to papers linked with a citation by a citation target or simply target, below and
throughout the paper.)
    As a way to solve the task, we work with a hybrid of two models: one that is based
on TFIDF and another on a single layer Neural Network (NN). Formally, the present
approach looks like the following.

                                  σ(d, r) = λh(d, r) + (1 − λ)t(d, r)                       (1)


                                               168
     BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


    where h represents a neural network and t a TFIDF based model; d is a citation
instance and r a sentence in RP.3 For a given citation instance d, we rank every sentence
r in RP in accordance with σ (while dismissing those with two or less words) and select
two highest ranked sentences as a target for d. We then remove redundancies in the
output with an MMR-like measure: we take a candidate sentence off the output if its
similarity with those preceding it exceeds a certain threshold (γ). Thus, the number of
the output sentences will be further cut down to one in case they are found to contain
redundancies. In the final run, we set γ to 0.24 and λ to 0.1. We call the current setup
as ‘a neurally enhanced approach to linking citation and reference,’ or NEAL for short.
Our adding the TFIDF component to NN in σ is meant to compensate for the latter’s
inability to handle exact word matches effectively due to the low dimensionality of
hidden layers into which word features are mapped [9].
    One significant consequence of using NN is that it will relieve us from the drudgery
of contriving every feature that one needs to train a classifier on: NN learns by itself
whatever feature it finds necessary to satisfy an objective function.
    In what follows, we discuss the NN portion of σ, which is basically an adaptation
of the neural embedding models [2, 1, 7, 8] to the current task. We built the TFIDF part
based on statistics collected from the final test data that CL-SciSumm 2016 released.


2      Predicting Similarity with Neural Network

The job of NN is to provides a scoring function h that favors a true target over a false
one: that is, to build a function that ensures that h(d, r+ ) > h(d, r− ), where r+ denotes
a true target (a sentence humans judged as a target ) and r− a false target (i.e., a sentence
not selected as a target). We define h by:

                                        h(d, r) = G(d)> F(r),                                   (2)

where G(d) denotes a vector derived from d and F(r) a vector from r, through word
embedding. In order for d’s similarity with its true target (r+ ) to be always higher than
that with a false target (r− ) [2, 7], we require the following constraint hold for G(d)
and F(r):
                     ∀i,j G(di )> F(rj+ ) > 0.1 + G(di )> F(rj− ), 4
which is tantamount to:

                      minimize:[0.1 − G(di )> F(rj+ ) + G(di )> F(rj− )]+ ,

    Figure 2 gives a general picture of how we arrive at G(d)> F(r). We start at the
bottom, with inputs that represent d and r. We initially translate every word in a citation
instance and target into word indices ranging from 0 to 15,456, which will be assembled
into a binary vector, where the presence or absence of word is marked with 1 or 0 at an
 3                                 d·r
     t is defined as: t(d, r) =           , where d and r are a vector of TFIDF weights representing
                                  kdk krk
     d and r, respectively.
 4
     ‘0.1’ represents a margin we have taken from [2].


                                                169
     BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


                 Fig. 2. Predicting Citation/Target Similarity Through Embedding


index assigned to it: thus having a 1 at the i-th unit means that a relevant input sentence
contains a word indexed with i. We denote a binary vector for d so derived by ψ(d) and
that for r by φ(r). [x]+ is a positive part of x.
    We project ψ(d) and φ(r) into two hidden layers, V and W, through word embed-
ding. V is an Nv × K matrix (∈ RNv ×K ) and W an Ne × K matrix (∈ RNe ×K ), with
Nv , Ne , and K indicating lengths of ψ(d), φ(r) and a hidden layer, respectively. (In the
test run, we set K = 30 and Nv = Ne = 15, 457.) Now we let

                                       G(d) = V> ψ(d),

and
                                       F(r) = W> φ(r).
    To determine values in V and W, we launch an iterative training process. Suppose
that we have the training data consisting of triples,

                                  D = {(di , ri+ , Si− )}i=1...m

with di representing a citing instance, ri+ a true target and Si− = {ri1    −            −
                                                                              , . . . , rin } indi-
                                                            −
cating a set of false targets for di . For each (di , ri , Si ) ∈ D, we do the following.
                                                       +


 1. For each ri− ∈ Si− , perform a stochastic gradient descent (SGD) to minimize:

                            [0.1 − G(di )> F(ri+ ) + G(di )> F(ri− )]+

 2. Ensure that columns of W and V are all normalized.
We developed training data from the ‘Development-Set-Apr8’ dataset (henceforth, DSA)
[5], which produced 4,608 training instances. We trained the model over 10 epochs,
meaning that it went through 46,080 training instances. We performed SGD using an
optimization algorithm known as AdaGrad-RDA, a regularized version of AdaGrad
[4].5
 5
     [6] developed a variant of LSTM to address an essentially same problem as discussed here,
     which could serve as a possible replacement of the embedding model the present model em-
     ploys.


                                            170
    BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


3    Evaluation

Prior to the actual run, we conducted an experiment using DSA to see how well NEAL
works. The dataset comes with ten topic clusters, each of which consists of one ref-
erence paper and a number of papers that contain citations to that paper. Following
a leave-one-out cross validation scheme, we split DSA into two blocks, one contain-
ing nine topic (RP) clusters and the other one. We used the former for training NEAL
(or h, to be precise), and tested it out on the remaining block. The performance was
measured by ROUGE - LCS, which produces the normalized length of a longest, possibly
discontiguous, string of words shared by predicted and true targets. Figure 3 illustrates
a citation and a corresponding target (made available by CL-SciSumm 2016 as part of
gold standard data). The area shaded in green represents a citation in CP and one in
yellow a target in RP. A citation and a target can span an arbitrary number of sentences.


                                   Fig. 3. Citation and Target


                               Table 1. Dry-Run Test Set (DSA)


                 RP     #CP |D| |E| |T |             RP    #CP |D| |E| |T |
               C02-1025 18 4,142 8,284 23         N04-1038 20 4,190 8,340 24
               C08-1098 22 3,824 7,648 29         P06-2124 12 4,367 8,734 18
               C10-1045 13 3,621 7,242 33         W04-0213 13 4,327 8,654 18
               D10-1083 11 4,329 8,658 18         W08-2222 9 4,519 9,038 9
               E09-2008 10 4,526 9,052 8          W95-0104 25 3,413 6,826 39


    Some statistics on DSA are shown in Table 1. RP refers to a reference paper, #CP
the count of relevant citing papers, |D| the the number of instances used for training. |E|
indicates how many instances are processed over the entire span of epochs, and |T | the
number of citation-target pairs we used to test NEAL. Term and document frequencies
(to be used for t(·, ·) in σ) were collected from DSA. K, Nv and Ne − parameters that
define the shape of NN − were set to 30, 15, 457, and 15, 457 (we also used the settings
for the final run).


                                            171
              BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


                                        0.0 0.4 0.8                 0.0 0.4 0.8                 0.0 0.4 0.8                 0.0 0.4 0.8                 0.0 0.4 0.8

                           C02−1025 C08−1098 C10−1045 D10−1083 E09−2008 N04−1038 P06−2124 W04−0213 W08−2222 W95−0104


                   0.30


                   0.25
ROUGE−LCS (Avg.)


                   0.20


                   0.15


                   0.10


                          0.0 0.4 0.8                 0.0 0.4 0.8                 0.0 0.4 0.8                 0.0 0.4 0.8                 0.0 0.4 0.8

                                                                                                λ


Fig. 4. Plot of Performance vs. λ for DSA. The title of each strip (e.g. C02-1025) represents a
designator for a given reference paper, which has 9 to 25 citing papers (Table 1).


                                                                                    172
    BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


    The test proceeded as follows. For a given pair c and t of citation and target, we rank
each sentence r in RP in accordance to σ(c, r) and select top one or two candidates as
a possible target (call it g). We then determine ROUGE - LCS for g and its true target t,
average scores over a entire set of citation-target pairs that belong to a particular topic
cluster. Our computation of ROUGE - LCS, however, did not include tokens with less
than 5 characters and those with more than 9 characters, as they were often found to be
garbled and unintelligible. We also chose not to use stemming or filter out stop words.
    Figure 4 shows by cluster performance of NEAL. The horizontal axis denotes the
value of λ and the vertical axis ROUGE - LCS scores. That λ affects the overall perfor-
mance is clearly seen. Note that NEAL reduces to TFIDF at λ = 0, and turns into a
full-fledged NN at λ = 1. Thus if NEAL’s performance peaks at λ > 0, it will mean
that NN-enabled NEAL performs better than TFIDF, or else is just as good as the latter.
We observe in Figure 4 a general tendency for the performance to climb highest some-
where between 0 and 1, suggesting the superiority of NN over TFIDF, although there
are notable exceptions at E09-2208 and P06-2124 where the score peaks at λ = 0.


                            Table 2. Peak Performance (PP) and λ


                           RP       PP λ            RP      PP λ
                         C02-1025 0.2457 0.6     N04-1038 0.2710 0.5
                         C08-1098 0.2040 0.2     P06-2124 0.1845 0.0
                         C10-1045 0.2507 0.6     W04-0213 0.2494 0.2
                         D10-1083 0.1329 0.6     W08-2222 0.2109 0.3
                         E09-2008 0.1904 0.0     W95-0104 0.2977 0.2


    Table 2 lists values of λ at which we had peak performance for each of the topic
clusters. 8 out of 10 clusters had peak performance at λ > 0, demonstrating that en-
abling NN generally leads to a gain in performance.


4    Final Remarks

We have presented what we call a ‘neurally enhanced approach to linking citation and
reference’ or NEAL, describing in some detail what machinery is involved and what
we found in an experiment with the development data. The results appear to suggest a
moderate impact of the neural network (NN) on the overall performance. But NEAL’s
performance against TFIDF is far from impressive. We suspect that its somewhat lack-
luster performance may have been caused by our inability to clearly demarcate true
and false targets: there are some words that appear both in true and false targets, which
could easily derail the classifier.
    Moreover, one could argue that the results of our experiment with DSA substanti-
ated a concern that [9] expressed about NN’s handling of word matches: at λ = 1 when
NN was decoupled from TFIDF completely, its performance plummeted to the ground.


                                           173
   BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries


As a way out, [9] suggests that we use the following instead of Equation (2).

                                   h(d, r) = G(d)> F0 (r, d)

F0 is an F conditioned on d, where you turn off all the words in φ(r) that are not found
in ψ(d). What makes the idea interesting is that it points to a possibility of embedding
t into h by slightly modifying the way we build φ(r) and ψ(d). While it is not clear
at the moment how it plays out, we believe that the idea is definitely worth a try, and
something we like to explore in the future work.


References
1. Bordes, A., Usunier, N., Weston, J., Yakhnenko, O.: Translating Embeddings for Modeling
   Multi-relational Data. N pp. 1–9 (2013)
2. Bordes, A., Weston, J., Usunier, N.: Supervised Embedding Models. In: ECML PKDD 2014
   (2014)
3. Cui, W., Liu, S., Tan, L., Shi, C., Song, Y., Gao, Z., Tong, X., Qu, H.: Textflow: Towards better
   understanding of evolving topics in text. IEEE Transactions on Visualization and Computer
   Graphics 17(12), 2412–2421 (2011)
4. Duchi, J.: Adaptive Subgradient Methods for Online Learning and Stochastic Optimization .
   Journal of Machine Learning Research 12, 2121–2159 (2011)
5. Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.Y.: Overview of the 2nd computa-
   tional linguistics scientific document summarization shared task (cl-scisumm 2016). In: The
   Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Nat-
   ural Language Processing for Digital Libraries (BIRNDL 2016). Newark, New Jersey, USA
   (2016)
6. Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., Song, X., Ward, R.: Deep Sentence
   Embedding Using the Long Short-Term Memory Networks. In: Proceedings of the 31st In-
   ternational Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37.
   (2015), http://arxiv.org/abs/1502.06922
7. Weston, J., Bengio, S., Usunier, N.: Large scale image annotation: Learning to rank with joint
   word-image embeddings. Machine Learning 81, 21–35 (2010)
8. Weston, J., Bordes, A., Yakhnenko, O., Usunier, N.: Connecting Language and Knowledge
   Bases with Embedding Models for Relation Extraction. Empirical Methods in Natural Lan-
   guage Processing (October), 1366–1371 (2013), http://aclweb.org/anthology/D/
   D13/D13-1136.pdf
9. Weston, J., Chopra, S., Bordes, A.: Memory Networks. International Conference on Learning
   Representations pp. 1–14 (2015), http://arxiv.org/abs/1410.3916


                                             174

</pre>