NJUST @ CLSciSumm-17

                   Shutian Ma1, Jin Xu1, Jie Wang1, Chengzhi Zhang1,2,*
    1 Department of Information Management, Nanjing University of Science and Technology,

                                     Nanjing, China, 210094
2 Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang

                    University), Fuzhou, China, 350108
mashutian0608@hotmail.com, 1292050078@qq.com, 1342234559@qq.com,
                      zhangcz@njust.edu.cn


        Abstract. This paper introduces NJUST system which is applied in the CL-
        SciSumm 2017 Shared Task at the BIRNDL 2017 Workshop. The training corpus
        contains 10 articles of training set, 10 articles of development set and 10 articles
        of test set from CL-SciSumm 2016. Articles were created by randomly sampling
        documents from the ACL Anthology corpus and selecting their citing papers. In
        Task 1A, we utilize different measurements to compute sentence similarities.
        Four classifiers are trained using different features and final results are obtained
        by voting system. In Task 1B, rule-based methods are mainly used according to
        high frequency words. As to Task 2, we generate a summary within 250 word
        based on the identified sentences in the reference paper from its cited text spans
        using maximal marginal relevance.


1       Introduction

Scientific papers are usually measured by their citances in citing papers which reveal
the extent to which a reference paper has been used by other researchers. So far, most
investigation has been focused on citation analysis from using simple index of citation
counts [1, 2, 3] to complex natural language processing of citation contents [4]. How-
ever, using citances can’t provide context from the reference paper, for example, the
type of information cited or where it is in the referenced paper. To understand different
perspectives of a reference paper, it’s important to generate summary from all the cited
text spans in the reference paper from citations [5, 6, 7, 8]. The CL-SciSumm 20172
has been designed to do automated summarization of scientific contributions for the
computational linguistics research domain, which can help readers to gain a gist of the
state-of-the-art in research for a topic.
   CL-SciSumm 2017 has been divided into two tasks. Firstly, we should identify text
spans in reference paper which most accurately reflect citance, facets of paper are also
needed to be distinguished. Second task is to generate a summary of reference paper
from the identified cited text spans. In this paper, we describe our methods applied for

* Corresponding Author
2 Available at: http://wing.comp.nus.edu.sg/~cl-scisumm2017/
2


CL-SciSumm 2017. As to Task 1A, we trained four classifiers and integrate all the
results by voting system. In Task 1B, rule-based methods are mainly used on identified
text span to determine which facet it belongs to. As to Task 2, we generate a summary
using maximal marginal relevance.


2      Related Work

This year’s CL-SciSumm 2017 takes place at the Joint Workshop on Bibliometric-en-
hanced Information Retrieval and Natural Language Processing for Digital Libraries
(BIRNDL 2017) 3 and is a follow-up on the shared task of CL-SciSumm 2016 4 [9].
Originally, the CL Summarization Pilot Task was conducted as a part of the Biomed-
Summ Track at the Text Analysis Conference 2014 (TAC 2014)5 [10]. There have been
many investigations on task problem previously [11, 12, 13, 14, 15, 16].
    When doing Task 1A, most teams identified the linkage between a paper citation in
citing paper and the corresponding cited text spans in reference paper by computing
sentences similarities. CIST system applied two kinds of features, one is from lexicons,
and another is from sentence similarities [11]. Aggarwal and Sharma made use of sub-
sequences (of words) overlap [13]. Bi-grams were identified between generated bag-
of-words to find matching statement in their study. PolyU [12] utilized TF-IDF cosine
similarity, position of sentence chunk and some lexical rules. SVM and its modification
model were chosen as the classifier for many teams [11, 12, 15]. New models have also
been proposed by combining new algorithms. Klampfl, Rexha and Kern proposed
TextSentenceRank for extracting candidate text spans which is inspired by graph based
ranking algorithms [16]. Nomoto introduced a composite model consisting of TF-IDF
and Neural Network [14].
    As for Task 1B, since the instances for the Implication and Hypothesis facets are
very limited, some teams only trained classification model on data of the other three
facets [12]. Machine learning models such as, decision tree [12], random forest classi-
fier [16], and SVM [11] were applied to conduct classification. Lexical rules are mainly
used on section headers or citance content [12, 13, 16]. Researchers will try to build
word lists for each facet which are similar words within each list. And then, they will
examine whether the subtitles of reference sentences or cited sentences contains the
following facet words or not for identification.
    Few teams took part in Task 2 of generating summary. CIST system calculated sen-
tence scores of five features: hLDA-level distribution feature, sentence-length feature,
sentence-position feature, cited text span and RST-feature. They also use discourse
facet to extract best-N sentences from all the sentences or from each cluster [11]. PolyU
[12] converted Task 2 into the query-focused multi-document summarization problem.
They used improved manifold ranking by modifying the prior score distribution to in-
spect the importance of citances.


3 Available at: http://wing.comp.nus.edu.sg/~birndl-sigir2017/
4 Available at: http://wing.comp.nus.edu.sg/cl-scisumm2016/
5 Available at: http://www.nist.gov/tac/2014
                                                                                                   3


3         Methodology

3.1       Task Description
There are two tasks in CL-SciSumm 2017 and framework is shown in Figure 1. The
training dataset contains 30 topics of documents. A topic is consisted of a Reference
Paper (RP) and Citing Papers (CPs) that all contain citations to the RP. In each CP, the
text spans (citances) have been identified that pertain to a particular citation to the RP.
In Task 1A, for each citance, we need to identify the spans of text (cited text spans) in
the RP that most accurately reflect the citance. In Task 1B, for each cited text span, we
need to identify what facet of the paper it belongs to, from five predefined facets, which
are Aim, Method, Results, Implication and Hypothesis. In Task 2, we need to generate
a structured summary of the RP from the cited text spans of the RP.
      Identify cited text
      span in the RP                Task 1A
                                                                        Summary generation
                                                           Task 2       based on cited text span
      Identify facet of
      cited text span
                                    Task 1B

                            Fig. 1. Framework of Task 1A, Task 1B and Task 2
   When doing evaluation, Task 1 will be scored by overlap of text spans measured by
number of sentences in the system output and the gold standard created by human an-
notators. Task 2 will be scored using the ROUGE family of metrics.


3.2       Task 1A

In this task, we are asked to identify the reference sentences referred to by a given
citance. We approach this problem from the perspective of finding the sentence in RP
which is more similar with citance and treat it as a classification task. In order to get
better performance, we applied different classifiers and combined their results by vot-
ing system. In order to train the models, three kinds of features are obtained. Short
descriptions of features are shown in Table 1.
                           Table 1. Three Kinds of Features Utilized in Task 1A
    Feature Type               Feature                          Feature Definition
                                              Cosine value between two sentence vectors trained
                            LDA similarity
                                              by LDA
                                              Division between the intersection and the union of
                          Jaccard similarity
                                              the words in two sentences
     Similarity-                              Add up IDF values of the same words between two
                            IDF similarity
    based features                            sentences
                                              Cosine value between two sentence vectors repre-
                          TF-IDF similarity
                                              sented by TF-IDF
                                              Cosine value between two sentence vectors trained
                          Doc2Vec similarity
                                              by Doc2Vec
4


           Rule-based                                             Bi-gram matching value, if there is bi-gram matched,
                                       Bigram
            features                                              the value is 1; otherwise, value is 0.
                                        Sid                       Sentence position in the full text
                                        Ssid                      Sentence position in the corresponding section
                                                                  The sentence position, divided by the number of sen-
                                Sentence Position
      Position-based                                              tences
         features                                                 The position of the corresponding section of the sen-
                                 Section Position
                                                                  tence chunk, divided by the number of sections
                                                                  The sentence position in the section, divided by the
                                   Inner Position
                                                                  number of sentences in the section

   Based on the annotation files, we give labels to the matched sentence pairs with 1
and unmatched sentence pairs with 0. When training classifiers, we firstly tried six dif-
ferent models, including SVM (kernel=linear), SVM (kernel=rbf), SVM (kernel=sig-
moid), decision tree, logistics regression and nearest neighbor. Different features are
investigated on all datasets of CL-SciSumm 2017. According to the 10 fold cross vali-
dation results, we remove SVM (kernel=sigmoid) and nearest neighbor, and choose
different features for the remaining classifiers. The average F1 values of all features for
Task 1A on training dataset are shown in Figure 2. In order to find good features, we
trained the classifiers for 8 runs with the different class ratios of 0 and 1 labels. From
Figure 2 (a) to Figure 2 (h), the class ratio of 0 to 1 is 1, 1.5, 2, 2.5, 3, 5, 7.5, and 10.
    0.8                                                                      0.7

    0.7                                                                      0.6
    0.6
                                                                             0.5
    0.5
                                                                             0.4
    0.4
                                                                             0.3
    0.3
                                                                             0.2
    0.2

    0.1                                                                      0.1

     0                                                                        0
             SVM(RBF)    SVM(linear)   Decision Tree   Logistic Regression            SVM(RBF)    SVM(linear)   Decision Tree   Logistic Regression

           (a) Average F1 when class ratio of 0 to 1 is 1                          (b) Average F1 when class ratio of 0 to 1 is 1.5
    0.6                                                                      0.6


    0.5                                                                      0.5


    0.4                                                                      0.4


    0.3                                                                      0.3


    0.2                                                                      0.2


    0.1                                                                      0.1


     0                                                                        0
             SVM(RBF)    SVM(linear)   Decision Tree   Logistic Regression            SVM(RBF)    SVM(linear)   Decision Tree   Logistic Regression

           (c) Average F1 when class ratio of 0 to 1 is 2                          (d) Average F1 when class ratio of 0 to 1 is 2.5
     0.5                                                                      0.4
 0.45                                                                        0.35
     0.4
                                                                              0.3
 0.35
     0.3                                                                     0.25

 0.25                                                                         0.2
     0.2                                                                     0.15
 0.15
                                                                              0.1
     0.1
 0.05                                                                        0.05

      0                                                                        0
              SVM(RBF)   SVM(linear)   Decision Tree   Logistic Regression             SVM(RBF)   SVM(linear)   Decision Tree   Logistic Regression

           (e) Average F1 when class ratio of 0 to 1 is 3                           (f) Average F1 when class ratio of 0 to 1 is 5
                                                                    图表标题                                                                               5
                   0.4

               0.35
  0.3                                                                        0.25
                   0.3
 0.25
                                                                              0.2
               0.25
  0.2
                                                                             0.15
                   0.2
 0.15
               0.15                                                           0.1
  0.1
                   0.1
                                                                             0.05
 0.05
               0.05
   0                                                                           0
        SVM(RBF)    0    SVM(linear)   Decision Tree   Logistic Regression           SVM(RBF)     SVM(linear)    Decision Tree   Logistic Regression

    (g) Average F1 when class ratio of 0 to 1 is 7.5
                                SVM(RBF)                  SVM(linear)
                                                                                (h) Decision
                                                                                    Average  Tree
                                                                                                 F1 whenLogistic ratio of 0 to 1 is 10
                                                                                                          classRegression
                     sid               ssid                 sent_position       sec_position    inner_position   lda_sim
                     jaccard_sim       tf_idf_sim           idf_sim             bigram          d2v_sim

 Fig. 2. Average F1 of All Features for Task 1A with Different Proportion of 0/1 Sample Size
   Based on these results, we can find that similarity-based features show better perfor-
mance than the others. So we keep all similarity-based features, rule-based feature and
choose some of the position-based features as the final features. Moreover, we set dif-
ferent weight to each classifier while all the results are integrated by voting system.
Parameter settings are shown in Table 3.
                       Table 2. Parameter Setting of Different Classifiers
          Classifier                       Training features                  Voting weight
                          LDA similarity, Jaccard similarity, TF-IDF simi-
    SVM (kernel=linear) larity, IDF similarity, Doc2Vec similarity, Bigram,        0.25
                          Ssid
                          LDA similarity, Jaccard similarity, TF-IDF simi-
     SVM (kernel=rbf) larity, IDF similarity, Doc2Vec similarity, Bigram,           0.4
                          sentence position, section position, inner position
                          TF-IDF similarity, IDF similarity, Doc2Vec simi-
       Decision Tree                                                               0.15
                          larity, Bigram, Ssid, sentence position
                          TF-IDF similarity, IDF similarity, Doc2Vec simi-
    Logistics Regression                                                            0.2
                          larity, Bigram, Ssid, sentence position

    Due to the big quantitative gap between 1 and 0 labels, we trained the classifiers for
5 runs with the different proportion of 1 and 0 labels and set penalty factor as well.
Furthermore, we also set different thresholds to the voting system. Detailed information
of 1 and 0 label proportion and voting system thresholds in 5 runs is shown in table 3.
Finally, according to the requirements of Task 1A, we did tuning on obtained results.
For each citance, if the identified text spans contain more than 5 sentences, then we will
list sentences in the order of Jaccard similarity from big to small, and pick the top 5
sentences to be the final results. If we can’t identify any text span, then we will list
sentences in the order of Jaccard similarity from big to small, and pick the top 1 sen-
tence to be the final result.
                       Table 3. Detailed Information of Running Settings
        Running Settings      0/1 sample size          Penalty Factor                                                  Thresholds
             Run1                    5.5                    5.5                                                           0.8
             Run2                    4.5                    4.5                                                           0.8
             Run3                    6.5                    6.5                                                           0.8
             Run4                    5.5                    5.5                                                           0.7
             Run5                    5.5                    5.5                                                           0.6
6


3.3     Task 1B
In this task, for each cited text span, we need to identify what facet of the paper it
belongs to. We construct three dictionaries of five facets Manual Dictionary, POS Dic-
tionary-I and POS Dictionary-II. The first one is made manually and another two is
made according to part-of-speech tagging results. Facet identification strategy of Task
1B is shown in Figure 3.
   Referring to manual dictionary, we looked through each identified text span of five
facets from all the annotation files in datasets. Then we build the dictionaries by judging
every word within the sentence context manually. Two graduate students took part in
this task.
   Referring to POS dictionary, we firstly made part-of-speech tagging by Stanford
POS Tagger6 on the section title and sentence content in all the labeled annotation files.
Then we keep the words which are adjectives and verbs and make all words as the
automatic dictionary of section title and sentence content separately. We then list all
words by frequency order according to five facets. After removing the words whose
frequency is less than 2, the left words are the automatic dictionary of section title and
sentence content separately. This is the POS dictionary-I. Since there are more words
that related to method citation. We built POS dictionary-II by removing the method
dictionary of section title and sentence content.
   Based on the five different dictionaries of five facets, if the section title or sentence
content contains any one of these words in the corresponding built dictionaries, it will
be directly classified as the corresponding facet. Since the manual dictionary will be
more accurate than POS dictionary. When using manual dictionary, identified facets
will be all kept which means one sentence can have more than one facet. When using
POS dictionary, the order of judging facet is hypothesis, aim, implication, method and
result and later identified facet will override the former one. Finally, each sentence will
have five identified facets, if five facets contain more than three of one facet, then we
classify it as this facet. Else if it contains more than three different facets, we just clas-
sify it as the facet of Method.

               Manual Dictionary       POS Dictionary-I         POS Dictionary-II


        Matching                      Hypothesis  Aim  Implication  Method  Result


                         Citance    Sentence Content      Section Title

                                     Five identified results

                                     Final identified results
                        Fig. 3. Facet Identification Strategy of Task 1B


6 Available at: https://nlp.stanford.edu/software/tagger.html
                                                                                           7


3.4     Task 2
Summary generation is divided into two main steps. First is to group sentences into
different clusters by bisecting K-means [17]. Second is using maximal marginal rele-
vance (MMR) [18] to extract sentence from each cluster and combine them into a sum-
mary.
   Firstly, we use vector space model to represent documents and then non-negative
matrix factorization is conducted to reduce the document dimension into 50 dimen-
sions. Then we apply the bisecting K-means which is based on K-means. Bisecting K-
means can be divided into four steps: 1.Pick a cluster to split; 2.Find 2 sub-clusters
using the basic K-means algorithm; 3. Repeat step 2, the bisecting step, for a fixed
number of times and take the split that produces the clustering with the highest overall
similarity. (For each cluster, its similarity is the average pairwise document similarity,
and we seek to minimize that sum over all clusters.); 4. Repeat steps 1, 2 and 3 until the
desired number of clusters is reached. After obtaining the clusters, we list all the clusters
in the order of cluster size from big value to small value. And then, all the sentences
within each cluster are listed in the order of MMR from big value to small value. The
basic idea of MMR is straightforward [19]: if we have a set of items 𝐷 and we want to
recommend a subset 𝑆𝑘 ⊂ 𝐷 ( 𝑤ℎ𝑒𝑟𝑒 |𝑆𝑘 | = 𝑘 𝑎𝑛𝑑 𝑘 ≪ |𝐷| ) relevant to a given
query 𝑞 . MMR proposes to build 𝑆𝑘 by selecting 𝑠𝑗∗ given 𝑆𝑗−1 = {𝑠1∗ , ⋯ , 𝑠𝑗−1         ∗
                                                                                            }
                         ∗
( 𝑤ℎ𝑒𝑟𝑒 𝑆𝑗 = 𝑆𝑗−1 ∪ {𝑠𝑗 } ) according to the following criteria:
           𝑠𝑗∗ = arg     max        [𝜆 (𝑆𝑖𝑚1 (𝑠𝑗 , 𝑞)) − (1 − 𝜆) max 𝑆𝑖𝑚2 (𝑠𝑗 , 𝑠𝑖 )]    (1)
                       𝑠𝑗 ∈𝐷\𝑆𝑗−1                               𝑠𝑖 ∈𝑆𝑗−1
Where 𝑆𝑖𝑚1 (∙,∙) measures the relevance between an item and a query, 𝑆𝑖𝑚2 (∙,∙)
measures the similarity between two items, and the manually tuned 𝜆 ∈ [0,1] trades off
relevance and similarity. In the case of 𝑠1∗ , the second term disappears.
   Finally, for each time, we choose first two sentence from each cluster to build the
summary before the length of summary exceeds 250 words.


4       Experiments

4.1     Task 1A

When doing corpora preprocessing, we remove the stop words and stem words to base
forms by Porter Stemmer algorithm7. Then, we applied D2V model in Genism8 and
python package9 of LDA model to represent documents. All the classifiers were done
via Scikit-learn10 python package. The source code of our system will be successively
open on the Github website: https://github.com/KingChristenson/NJUST-CL.
   For classification experiments, we split training dataset into two separate datasets:
10 articles of training set and 10 articles of development set from CL-SciSumm 2016

7 Available at: http://tartarus.org/~martin/PorterStemmer/
8 Available at: http://radimrehurek.com/gensim/index.html
9 Available at: https://pypi.python.org/pypi/lda
10 Available at: http://scikit-learn.org/stable/index.html
8


are chosen as train dataset. 10 articles of test set from CL-SciSumm 2016 are chosen as
test dataset. There are five runs that we submitted. Precision, Recall and F1 values
which we got from the test dataset are shown in Table 4.
                            Table 4. Task 1A Results of Training Dataset
              Running Settings              P                   R                                        F1
                   Run1                  0.08804             0.09774                                  0.09264
                   Run2                  0.08571             0.12782                                  0.10262
                   Run3                  0.09016             0.08271                                  0.08627
                   Run4                  0.08532             0.12531                                  0.10152
                   Run5                  0.09470             0.12531                                  0.10787

   We also draw Figure 4 (a) and Figure 4 (b) to see the trend of different evaluation
results when increasing the class ratio of 0 to 1 and thresholds for voting system.
     0.13                                                           0.13
    0.125                                                          0.125
     0.12                                                           0.12
    0.115                                                          0.115
     0.11                                                           0.11
    0.105                                                          0.105
      0.1                                                            0.1
    0.095                                                          0.095
                                                              P
     0.09                                                           0.09                                           P
                                                              R
    0.085                                                          0.085                                           R
                                                              F1
                                                                                                                   F1
     0.08                                                           0.08
               4.5            5.5             6.5                              0.6            0.7      0.8

Note: Blue line denotes precision, red line denotes recall and green line denotes F1 value.
     (a) Precision, Recall and F1 when class ratio of 0               (b) Precision, Recall and F1 when Threshold is
                     to 1 is Increasing                                                 Increasing
      Fig. 4. Evaluation when Increasing Class Ratio of 0 to 1 and Threshold for Voting System
   From Figure 4 (a), we can find that with the increasing of 0/1 sample size, although
the precision value is increasing slowly, according to F1 value, the performance of Task
1A is getting worse. The same situation happened when we increasing the threshold.
So it’s important to choose the proper parameters in such classification tasks, such as
the 0/1 sample size and threshold for voting system.


4.2         Task 1B
   We tried all results from Task 1A, and then got the best performance by voting sys-
tem for 5 runs. Table 5 shows our Task 1B results of the train data according to different
facets.
                                Table 5. Task 1B Results of Training Dataset
                             Evaluation
            Facet                             Precision          Recall                               F1
                 Aim Citation                         0.16162                   0.44444             0.23704
              Implication Citation                    0.50000                   0.23256             0.31746
              Hypothesis Citation                     0.50000                   0.50000             0.50000
                Method Citation                       0.74026                   0.91566             0.81867
                Result Citation                       0.39286                   0.48889             0.43564
                                                                                                9


    From Table 6, we can find that identification of method citation performs best since
it’s also the most common facet shown in all citations. Citation facet of result, aim and
implication shows bad performance. The poor quality of built dictionary might lead to
this results. More features should be considered when doing this task, such as the sen-
tence position or section title position.


5      Conclusion and Future Work

This document demonstrates our participant system NJUST on CL-SciSumm 2017. Our
system has tried to add some semantic information like doc vector and topic distribu-
tions in LDA to improve the citance linkage and summarization performance. When
choosing features, we find that TF-IDF similarity and IDF similarity do better than the
similarities based Doc2Vec and LDA. In order to improve classification performance,
several classifiers are trained with different features. The final results are obtained by
voting system. When doing Task 2, we use maximal marginal relevance to rank sen-
tences for summary generation. According to the evaluation [20], we did the best per-
formance in Task 1A and also good in Task 1B, while strategy for Task 2 didn’t work
well and more work can be done in all the tasks.
   In the future work, we need to find better ways to measure sentence similarities and
use some machine learning models to do Task 1B. As to summarization, we will try to
combine the sentence with its identified facet information for organizing the sentence
order. Furthermore, more features can be added to calculate the sentence score for rank-
ing, such as sentence length, sentence position, etc.


. Acknowledgements

This work is supported by Major Projects of National Social Science Fund (No.
16ZAD224), Fujian Provincial Key Laboratory of Information Processing and Intelli-
gent Control (Minjiang University) (No. MJUKF201704) and Qing Lan Project.


Reference
 1. Garfield E, Merton R K. Citation indexing: Its theory and application in science, technology,
    and humanities [M]. New York: Wiley, 1979.
 2. Meho L I, Yang K. Impact of data sources on citation counts and rankings of LIS faculty:
    Web of Science versus Scopus and Google Scholar [J]. Journal of the Association for Infor-
    mation Science and Technology, 2007, 58(13): 2105-2125.
 3. Bornmann L, Daniel H D. What do citation counts measure? A review of studies on citing
    behavior [J]. Journal of documentation, 2008, 64(1): 45-80.
 4. Zhang G, Ding Y, Milojević S. Citation content analysis (cca): A framework for syntactic
    and semantic analysis of citation content [J]. Journal of the Association for Information Sci-
    ence and Technology, 2013, 64(7): 1490-1503.
 5. Jaidka K, Khoo C S G, Na J C, et al. Deconstructing Human Literature Reviews-A Frame-
    work for Multi-Document Summarization [C]//ENLG. 2013: 125-135.
10


 6. Nenkova A, McKeown K. Automatic summarization [J]. Foundations and Trends® in In-
    formation Retrieval, 2011, 5(2–3): 103-233.
 7. Teufel S, Moens M. Summarizing scientific articles: experiments with relevance and rhetor-
    ical status [J]. Computational linguistics, 2002, 28(4): 409-445.
 8. Jones K S. Automatic summarising: The state of the art [J]. Information Processing & Man-
    agement, 2007, 43(6): 1449-1481.
 9. Jaidka K, Chandrasekaran M K, Rustagi S, et al. Overview of the CL-SciSumm 2016 Shared
    Task [C]//BIRNDL@ JCDL. 2016: 93-102.
10. Jaidka K, Chandrasekaran M K, Elizalde B F, et al. The computational linguistics summari-
    zation pilot task [C]//Proceedings of Text Ananlysis Conference, Gaithersburg, USA. 2014.
11. Li L, Mao L, Zhang Y, et al. CIST System for CL-SciSumm 2016 Shared Task
    [C]//BIRNDL@ JCDL. 2016: 156-167.
12. Cao Z, Li W, Wu D. PolyU at CL-SciSumm 2016 [C]//BIRNDL@ JCDL. 2016: 132-138.
13. Aggarwal P, Sharma R. Lexical and Syntactic cues to identify Reference Scope of Citance
    [C]//BIRNDL@ JCDL. 2016: 103-112.
14. Nomoto T. NEAL: A Neurally Enhanced Approach to Linking Citation and Reference
    [C]//BIRNDL@ JCDL. 2016: 168-174.
15. Moraes L, Baki S, Verma R M, et al. University of Houston at CL-SciSumm 2016: SVMs
    with tree kernels and Sentence Similarity [C]//BIRNDL@ JCDL. 2016: 113-121.
16. Klampfl S, Rexha A, Kern R. Identifying Referenced Text in Scientific Publications by
    Summarisation and Classification Techniques [C]//BIRNDL@ JCDL. 2016: 122-131.
17. Steinbach M, Karypis G, Kumar V. A comparison of document clustering tech-
    niques[C]//KDD workshop on text mining. 2000, 400(1): 525-526.
18. Carbonell J, Goldstein J. The use of MMR, diversity-based reranking for reordering docu-
    ments and producing summaries[C]//Proceedings of the 21st annual international ACM
    SIGIR conference on Research and development in information retrieval. ACM, 1998: 335-
    336.
19. Guo S, Sanner S. Probabilistic latent maximal marginal relevance[C]//Proceedings of the
    33rd international ACM SIGIR conference on Research and development in information
    retrieval. ACM, 2010: 833-834.
20. Kokil Jaidka, Muthu Kumar Chandrasekaran, Devanshu Jain, and Min-Yen Kan (2017).
    Overview of the CL-SciSumm 2017 Shared Task, In Proceedings of the Joint Workshop on
    Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital
    Libraries (BIRNDL 2017), Tokyo, Japan, CEUR.