LaSTUS-TALN+INCO @ CL-SciSumm 2019

    Luis Chiruzzo1[0000−0002−1697−4614] , Ahmed AbuRa’ed2[0000−0002−6241−7755] ,
    Alex Bravo2[1111−2222−3333−4444] , and Horacio Saggion2[0000−0003−0016−7807]
1
    Universidad de la República, Facultad de Ingenierı́a, INCO, Montevideo, Uruguay
                                luischir@fing.edu.uy
    2
      Universitat Pompeu Fabra, DTIC, LaSTUS-TALN, C/Tanger 122, Barcelona
                                     (08018), Spain
                                  first.last@upf.edu


        Abstract. In this paper we present several systems developed to partic-
        ipate in the 4th Computational Linguistics Scientific Document Summa-
        rization Shared challenge which addresses the problem of summarizing
        a scientific paper using information from its citation network (i.e., the
        papers that cite the given paper). Given a cluster of scientific documents
        where one is a reference paper (RP) and the remaining documents are
        papers citing the reference, two tasks are proposed: (i) to identify which
        sentences in the reference paper are being cited and why they are cited,
        and (ii) to produce a citation-based summary of the reference paper using
        the information in the cluster. Our systems are based on both supervised
        (LSTM and convolutional neural networks) and unsupervised techniques
        using word embedding representations and features computed from the
        linguistic and semantic analysis of the documents.

        Keywords: Citation-based Summarization · Scientific Document Anal-
        ysis · Convolutional Neural Networks · Text-similarity Measures.


1     Introduction
Although scientific summarization has always been an important research topic
in the area of natural language processing (NLP) [13, 19, 24, 25] in recent years
new summarization approaches have emerged which take advantage of the cita-
tions that a scientific article has received in order to extract and summarize its
main contributions [20, 21, 1].
    The interest in the area has motivated the development of a series of evalua-
tion exercises in scientific summarization in the Computational Linguistics (CL)
domain known as the Computational Linguistics Scientific Document Summa-
rization Shared Task which started in 2014 as a pilot [9] and which is now a well
developed challenge in its fourth year [7, 8].
    In this challenge, given a cluster of n documents where one is a reference
paper (RP) and the n − 1 remaining documents are papers (i.e., citing papers
(CPs)) citing the reference paper, participants of the challenge have to develop
automatic procedures to simulate the following tasks: given a cluster of n doc-
uments where one is a reference paper and the n − 1 remaining documents are
papers containing citations to it:
2         F. Author et al.

      The challange has the following tasks:
    – Task 1A: For each citance in the citing papers (i.e., text spans containing
      a citation), identify the cited spans of text in the reference paper that most
      accurately reflect the citance.
    – Task 1B: For each cited text span, identify which discourse facet it belongs
      to, among: Aim, Hypothesis, Implication, Results, or Method.
    – Task 2: Finally, an optional task consists on generating a structured sum-
      mary of the reference paper with up to 250 words from the cited text spans.
    In this paper we report the systems developed at LaSTUS-TALN+INCO
to participate in CL-SciSumm 2019 [6]. We include a supervised system based
on recurrent neural networks and an unsupervised system based on sentence
similarity for Task 1A, one supervised approach for Task 1B, and one supervised
approach for Task 2. Except for the recurrent neural network method, the rest of
the systems for Tasks 1A and 1B follow similar approaches to the ones reported
in [4] and [2], achieving good performance in previous editions of the task. The
approach for Task 2 follows the method described in [2] which, according to
official results [10] [14], was the winning approach in CL-SciSumm 2018.


2      Task 1
We tried a supervised and an unsupervised approach for Task 1A. We separated
the CL SciSumm 2018 corpus of documents in 75% for training and 25% for
development evaluation. We also used the 978 documents from ScisummNet
2019 automatically annotated following [18] for pre-training our neural network
models.

2.1     Supervised approach
Our supervised approach consists in a neural network architecture for finding
out which sentences from the reference document are most the likely candidates
for being referenced by a given citation.

Network architecture The neural networks have the following structure:

    – Input layer - Two sentences: the citation text and a sentence from the
      reference document.
    – Embeddings layer - We tried with two collections of embeddings: Google
      News3 300 dimensions vectors and BabelNet[5][16] 300 dimensions vectors.
    – LSTM layers - One, two or three stacked bidirectional LSTM layers.
    – Dense layer - One fully connected layer.
    – Output layer - One unit indicating the probability that the sentence from
      the reference document corresponds to the citation.
3
    https://code.google.com/archive/p/word2vec/
                               LaSTUS-TALN+INCO @ CL-SciSumm 2019                3

   We carried different experiments using word embeddings or BabelNet synset
embeddings, the tokens in the input layer were words or synsets depending on
the experiment. The LSTM layers combine up to three layers and a dense layer
with sizes 150, 300, or 450. In all of our experiments we aimed to optimize against
our development set, which contains 25% of the CL-SciSumm 2018 training set.

Pre-training and Training We separated the training of the models in two
stages: pre-training and training. The 978 clusters of documents from the Yale
corpus were used to do a pre-training of the LSTM models. During pre-training,
we trained the models using 70% of the Yale corpus optimizing against the
remaining 30% using early stopping.
   After this pre-training phase was over, we trained the resulting model us-
ing our CL-SciSumm 2018 training partition. We found out that, in general,
pre-training with the Yale corpus and then training with CL-SciSumm 2019
achieved better results than only training with CL-SciSumm, even if the Yale
data was automatically annotated. For the training stage, we used early stopping
optimizing against 20% of our training corpus.

2.2   Unsupervised approach
As in previous editions [2][4][3], we used an unsupervised approach consisting in
comparing all the sentences in a reference document with a citation and returning
the most similar one according to certain metric. In this case, we transformed
all sentences and citations into BabelNet synsets and we took the centroid of
the synsets as a way of creating a sentence embedding. Then we used cosine
similarity two find out which of the candidate sentences were more suitable.

2.3   Voting System
We submitted a voting system which considers sentences picked by two or more
of the previous mentioned systems for Task 1.

2.4   Development results
Table 1 shows the results over the development corpus for the different exper-
iments we tried. In general, the neural networks performed worse over the de-
velopment corpus than the simpler unsupervised method. The networks trained
using Google News vectors achieved better results than the ones trained using
BabelNet vectors. Notice that in each case the number of sentences in order to
get the best results for development was different.


3     Task 2
In this section, we describe our extractive text summarization approach based on
convolutional neural networks which extends on our previous work on trainable
4         F. Author et al.

                   Model       Layers Size Top n Precision Recall F-1
               Babelnet Cosine   -      -    5    6.08% 21.33% 9.46%
                Google News      1    150 2       6.52% 8.86% 7.51%
                Google News      2    150 3       5.80% 11.81% 7.78%
                Google News      3    150 3       4.50% 12.24% 6.58%
                Google News      1    300 2       5.59% 7.59% 6.44%
                Google News      2    300 2       6.21% 8.44% 7.16%
                  BabelNet       1    150 9       2.67% 16.89% 4.61%
                  BabelNet       3    150 4       3.48% 9.78& 5.13%
                  BabelNet       1    300 20      1.39% 19.56% 2.60%
                  BabelNet       2    300 6       2.42% 10.22% 3.92%
                  BabelNet       3    300 10      1.96% 13.78% 3.43%
                Table 1. Results for Task 1a over the development set.


summarization [23, 4]. The network generates a summary by selecting the most
relevant sentences from the RP using linguistic and semantic features from RP
and CPs. The aim of our CNN is to learn the relation between a sentence and a
scoring value indicating its relevance.


3.1     Context Features

In order to extract the linguistic information from both sources (RP and CPs),
we developed a complex feature extraction method to characterize each sentence
in the RP and its relation with the corresponding CPs.
    We extracted a set of numeric features some of which are based on comparing
a sentence to its (document or cluster) context:

    – Sentence Abstract Similarity Scores: the similarity of a sentence vector to
      the author abstract vectors (three features).
    – Sentence Centroid Similarity Scores: the similarity of a sentence vector to
      the article centroid (three features).
    – First Sentence Similarity Scores: the similarity of a sentence vector to the
      vector of the first sentence, that is, the title of the RP (three features).
    – Position Score: a score representing the position of the sentence in the article.
      Sentences at the beginning of the article have high scores and sentence at
      the end of the article have low scores.
    – Position in Section Score: a score representing the position of the sentence
      in the section of the article. Sentences in first section get higher scores,
      sentences in last section get low scores.
    – Position in a Specific Section Score: a score representing the position of the
      sentence in a particular section. Sentences at the beginning of the section
      get higher scores and sentences at the end of the section get lower scores.
    – TextRank Normalized Scores: a sentence vector is computed to obtain a
      normalized score using the TextRank algorithm [15] (three features).
                              LaSTUS-TALN+INCO @ CL-SciSumm 2019                5

 – Term Frequency Score: we sum up the tf*idf values of all words in the sen-
   tence. Then, the obtained value is normalized using the set of scores from
   the whole article.
 – Citation Marker Score: the ratio of the number of citation markers in the
   sentence to the total number of citation markers in the article.
 – Rhetorical Class Probability Scores: probability of a sentence being in one
   of five possible rhetorical categories calculated by the Dr. Inventor frame-
   work [22].
 – Citing Paper Maximum Similarity Scores: each RP sentence vector is com-
   pared to each citation vector in each CP to get the maximum possible cosine
   similarity (three features).
 – Citing Paper Minimum Similarity Scores: each RP sentence vector is com-
   pared to each citation vector in each CP to get the minimum possible cosine
   similarity (three features).
 – Citing Paper Average Similarity Scores: each RP sentence vector is compared
   to each citation vector and the average cosine value obtained (three features).


3.2   Scoring Values

As commented above, our CNN learns the relation between features and a score,
that is, a regression task by devising various scoring functions to represent the
likelihood of a sentence belonging to a summary (for abstract, community and
human). The nomenclature followed to symbolize a scoring function is SCSum ,
where SC is the specific scoring function (which is indicated bellow) and Sum
is any summary type: abstract (Abs), community (Com) or human (Hum). The
scoring functions are defined bellow:

 – Cosine Distance: we calculated the maximum cosine similarity between each
   sentence vector in the RP with each vector in the gold standard summaries.
   This method produced three scoring functions (SUMMA (SUSum ), ACL
   (ACLSum ), and Google (GoSum )) for each summary type.
 – ROUGE-2 Similarity: we also calculated similarities based on the overlap
   of bigrams between sentences in the RP and gold standard summaries. In
   this regard, each sentence in the RP is compared with each gold standard
   summary using ROUGE-2 [12]. The precision value from this comparison is
   taken for the scoring function and is symbolized as R2Sum .
 – Scoring Functions Average: Moreover, we computed the average between all
   scoring functions (SUMMA, ACL, Google and ROUGE-2) for each summary
   type. In addition, we also calculated a simplified average with vectors do
   not based on word-frequencies (ACL, Google and ROUGE-2). These scoring
   functions are indicated as AvSum and SAvSum , respectively.

   Finally, these computation produced eighteen different functions to learn:
SUMMA (SU ), ACL (ACL) and Google (Go) vectors, ROUGE-2 (R2), Average
(Av) and Simplified Average (SAv) times abstract (Abs), community (Com),
human (Hum) summaries.
6         F. Author et al.

3.3     Convolution Model

Regarding the neural network hyperparameters, the CNN was defined with the
Adadelta updater [26] and the gradients were computed using back-propagation
as Kim [11] and Nguyen [17]. Also we used the sigmoid activation function, a
dropout rate of 0.5, l2 constraint of 3. For the convolutions, we applied 3 filter
window sizes (3, 4 and 5) to context features and 4 filter window sizes (2, 3, 4 and
5) to word embeddings. For each window were applied 150 filters for convolution.
Finally, for learning the regression task we applied a Mean Squared Error (MSE)
as loss function.


4      Challenge Submissions

For task 1, we sent the following four submissions:

    – run1: LSTM trained with Babelnet vectors with three layers of size 150.
    – run2: BabelNet centroids cosine similarity.
    – run3: LSTM trained with Google News vectors with two layers of size 150.
    – run4: Voting scheme based on [2].

      For task 2, the submissions we sent are the following:

    – Similarity with the abstract from all similarity scores except SUMMA.
    – Similarity with the abstract from all scores.
    – Rouge based score similarity with the abstract.
    – ACL cosine similarity based score with the abstract.

   Finally, based on [2] we presented the results of a classifier that addresses
Task 1B of identifying the discourse facet for each identified cited sentence.


5      Results

The performance of our systems for task 1 over the test set is shown in table 2.
We can see that the LSTM approached underperformed compared to their results
over the development corpus, one possible cause for this is that the systems could
have overfit to the training and development data. Out of the methods we tried,
the system that performs best for task 1 is still the voting scheme based on [2].
The performance of our systems for task 2 over the test set is shown in table 3.


6      Conclusion

We have described the systems developed to participate in Tasks 1a, 1b and 2
in the CL-SciSumm 2019 summarization challenge. For Task 1a – which aimed
at identifying cited sentences –, we implemented supervised and unsupervised
methods. Our supervised systems are based on LSTM neural networks, while the
                                LaSTUS-TALN+INCO @ CL-SciSumm 2019                 7

                                             Task1A             Task1B
    Run                     Sentence Overlap (F1) ROUGE-SU4 (F1) (F1)
    run4 Voting scheme              0.070                 0.025  0.122
    run2 BabelNet centroids         0.066                 0.026  0.277
    run3 Google News LSTM           0.031                 0.021  0.078
    run1 BabelNet LSTM              0.020                 0.015  0.070
                        Table 2. Test results for task 1.


                           Abstract Community         Human
                    Run R-2 R-SU4 R-2 R-SU4 R-2 R-SU4
                    run1 0.329 0.172 0.149 0.090 0.241 0.171
                    run2 0.316 0.167 0.169 0.101 0.245 0.169
                    run3 0.311 0.156 0.153 0.093 0.252 0.170
                    run4 0.246 0.147 0.131 0.084 0.170 0.141
                         Table 3. Test results for task 2.


unsupervised techniques take advantage of BabelNet synset embedding represen-
tations. We also included a system that uses a voting scheme based on several
supervised and unsupervised approaches with many different system configura-
tions.
    Regarding Task 2 – summarization proper –, we have developed a neural
network based on convolutions to learn a specific scoring function. The CNN
model was fed by a combination of word embedding with sentence relevance and
citation features extracted from each document cluster (RP and CPs).

Acknowledgments
This work is (partly) supported by the Spanish Ministry of Economy and Com-
petitiveness under the Maria de Maeztu Units of Excellence Programme (MDM-
2015-0502).

References
 1. Abu-Jbara, A., Ezra, J., Radev, D.R.: Purpose and polarity of citation: Towards
    nlp-based bibliometrics. In: HLT-NAACL. pp. 596–606 (2013)
 2. AbuRa’ed, A., Bravo, A., Chiruzzo, L., Saggion, H.: Lastus/taln+ inco@ cl-
    scisumm 2018-using regression and convolutions for cross-document semantic link-
    ing and summarization of scholarly literature. In: Proceedings of the 3nd Joint
    Workshop on Bibliometric-enhanced Information Retrieval and Natural Language
    Processing for Digital Libraries (BIRNDL2018). Ann Arbor, Michigan (July 2018)
    (2018)
 3. AbuRa’ed, A., Chiruzzo, L., Saggion, H.: What sentence are you referring to and
    why? identifying cited sentences in scientific literature. In: RANLP 2017. Inter-
    national Conference Recent Advances in Natural Language Processing; 2017 Sep
    2-8; Varna, Bulgaria.[Stroudsburg (PA)]: ACL; 2017. p. 9-17. ACL (Association for
    Computational Linguistics) (2017)
8       F. Author et al.

 4. AbuRa’ed, A., Chiruzzo, L., Saggion, H., Accuosto, P., Bravo, À.: Lastus/taln @
    clscisumm-17: Cross-document sentence matching and scientific text summariza-
    tion systems. In: Proceedings of the Computational Linguistics Scientific Summa-
    rization Shared Task (CL-SciSumm 2017) organized as a part of the 2nd Joint
    Workshop on Bibliometric-enhanced Information Retrieval and Natural Language
    Processing for Digital Libraries (BIRNDL 2017) and co-located with the 40th In-
    ternational ACM SIGIR Conference on Research and Development in Information
    Retrieval (SIGIR 2017), Tokyo, Japan, August 11, 2017. pp. 55–66 (2017)
 5. Camacho-Collados, J., Pilehvar, M.T., Navigli, R.: Nasari: Integrating explicit
    knowledge and corpus statistics for a multilingual representation of concepts and
    entities. Artificial Intelligence 240, 36–64 (2016)
 6. Chandrasekaran, M., Radev, D., Freitag, D., Kan, M.Y.: Overview and Re-
    sults: CL-SciSumm SharedTask 2019. Proceedings of the 4th Joint Workshop on
    Bibliometric-enhanced Information Retrieval and Natural Language Processing for
    Digital Libraries (BIRNDL 2019) @ SIGIR 2019 (2019)
 7. Jaidka, K., Chandrasekaran, M.K., Jain, D., Kan, M.Y.: The cl-scisumm shared
    task 2017: results and key insights. In: Proceedings of the Computational Lin-
    guistics Scientific Summarization Shared Task (CL-SciSumm 2017), organized as
    a part of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval
    and Natural Language Processing for Digital Libraries (BIRNDL 2017) (2017)
 8. Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.Y.: Insights from cl-
    scisumm 2016: the faceted scientific document summarization shared task. Inter-
    national Journal on Digital Libraries pp. 1–9 (2017)
 9. Jaidka, K., Chandrasekaran, M.K., Elizalde, B.F., Jha, R., Jones, C., Kan, M.Y.,
    Khanna, A., Molla-Aliod, D., Radev, D.R., Ronzano, F., Saggion, H.: The compu-
    tational linguistics summarization pilot task. In: Proceedings of TAC 2014 (2014)
10. Jaidka, K., Yasunaga, M., Chandrasekaran, M.K., Radev, D., Kan, M.Y.: The CL-
    SciSumm Shared Task 2018: Results and Key Insights. Proceedings of the 3rd Joint
    Workshop on Bibliometric-enhanced Information Retrieval and Natural Language
    Processing for Digital Libraries (BIRNDL 2018) co-located with the 41st Inter-
    national ACM SIGIR Conference on Research and Development in Information
    Retrieval (SIGIR 2018) (July 2018)
11. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint
    arXiv:1408.5882 (2014)
12. Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In:
    Text summarization branches out: Proceedings of the ACL-04 workshop. vol. 8.
    Barcelona, Spain (2004)
13. Luhn, H.P.: The automatic creation of literature abstracts. IBM J. Res. Dev. 2(2),
    159–165 (Apr 1958)
14. Ma, S., Zhang, H., Xu, J., Zhang, C.: Njust@ clscisumm-18. In: BIRNDL@ SIGIR.
    pp. 114–129 (2018)
15. Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the
    2004 conference on empirical methods in natural language processing (2004)
16. Navigli, R., Ponzetto, S.P.: BabelNet: The automatic construction, evaluation and
    application of a wide-coverage multilingual semantic network. Artif. Intell. 193,
    217–250 (Dec 2012)
17. Nguyen, T.H., Grishman, R.: Relation extraction: Perspective from convolutional
    neural networks. In: Proceedings of the 1st Workshop on Vector Space Modeling
    for Natural Language Processing. pp. 39–48 (2015)
18. Nomoto, T.: Resolving citation links with neural networks. Frontiers in Research
    Metrics and Analytics 3, 31 (2018)
                                 LaSTUS-TALN+INCO @ CL-SciSumm 2019                    9

19. Paice, C.D., Jones, P.A.: The identification of important concepts in highly struc-
    tured technical papers. In: Proceedings of the 16th Annual International ACM
    SIGIR Conference on Research and Development in Information Retrieval. pp.
    69–78. SIGIR ’93, ACM, New York, NY, USA (1993)
20. Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation sum-
    mary networks. In: Proceedings of the 22Nd International Conference on Compu-
    tational Linguistics - Volume 1. pp. 689–696. COLING ’08, Association for Com-
    putational Linguistics, Stroudsburg, PA, USA (2008)
21. Qazvinian, V., Radev, D.R.: Identifying non-explicit citing sentences for citation-
    based summarization. In: ACL 2010, Proceedings of the 48th Annual Meeting of
    the Association for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden.
    pp. 555–564 (2010)
22. Ronzano, F., Saggion, H.: Dr. Inventor Framework: Extracting structured informa-
    tion from scientific publications. In: International Conference on Discovery Science.
    pp. 209–220. Springer (2015)
23. Saggion, H., AbuRa’ed, A., Ronzano, F.: Trainable citation-enhanced summariza-
    tion of scientific articles. In: Proceedings of the Joint Workshop on Bibliometric-
    enhanced Information Retrieval and Natural Language Processing for Digital Li-
    braries (BIRNDL) co-located with the Joint Conference on Digital Libraries 2016
    (JCDL 2016), Newark, NJ, USA, June 23, 2016. pp. 175–186 (2016)
24. Saggion, H., Lapalme, G.: Concept identification and presentation in the context of
    technical text summarization. In: Proceedings of the 2000 NAACL-ANLP Work-
    shop on Automatic Summarization. pp. 1–10. Association for Computational Lin-
    guistics, Stroudsburg, PA, USA (2000)
25. Saggion, H., Lapalme, G.: Generating indicative-informative summaries with su-
    mum. Comput. Linguist. 28(4), 497–526 (Dec 2002)
26. Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint
    arXiv:1212.5701 (2012)