NJUST @ CLSciSumm-18

                 Shutian Ma1, Heng Zhang1, Jin Xu1, Chengzhi Zhang1,2,
    1 Department of Information Management, Nanjing University of Science and Technology,

                                     Nanjing, China, 210094
2 Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang

                    University), Fuzhou, China, 350108
mashutian0608@hotmail.com, 525696532@qq.com, xujin@njust.edu.cn,
                      zhangcz@njust.edu.cn


        Abstract. This paper introduces NJUST system which is submitted in CL-
        SciSumm 2018 Shared Task at BIRNDL 2018 Workshop. The training corpus
        contains 40 articles which are created by randomly sampling documents from
        ACL Anthology corpus and selecting their citing papers. Overall, there are three
        basic tasks in CL-SciSumm 2018. Task 1A is to identify cited text spans in ref-
        erence paper. Briefly, we use multi-classifiers and resemble their results via vot-
        ing system. Meanwhile, we also submit results generated via single classifiers.
        For task 1B, which is to identify facets of cited text, except rule-based methods
        using human-labeled and POS dictionary, we also apply supervised topic model-
        ing and gradient boosted decision trees. As to Task 2, after organizing identified
        sentences into groups based on their similarities between abstract sentences, we
        rank them using several features and generate a summary within 250 word by
        selecting the top ones.

        Keywords: Cited Text Span Identification, Multi-classifiers, Voting System,
        Automatic Summarization, Scientific Summarization.


1       Introduction

Nowadays, increasement of publications makes researchers hard to catch up with the
progress in fields. In order to provide readers a quick overview of papers, scientific
summarization has arisen people’s attentions. Since citation sentences (citances) usu-
ally provide useful information about reference papers, researchers were focusing on
citation-based summarization by aggregating all citances that cite one unique paper [3].
However, detailed information cannot be revealed enough in citation texts, and view-
points of the citing authors can also be different from each other due to citing purposes
[4]. Recently, a number of shared tasks like, TAC 2014 Biomedical Summarization
Track1, Computational Linguistics Scientific Document Summarization Shared Task


 Corresponding Author.
1 Available at: https://tac.nist.gov/2014/BiomedSumm/index.html
2


(CL-SciSumm 20162, CL-SciSumm 20173 and CL-SciSumm 20184) are proposed to do
summarizations based on cited text spans, which is different from traditional methods.
Since the summaries are built based on reference paper itself, they are expected to pro-
vide reliable context information than citances. In this paper, we want to describe our
system submitted in CL-SciSumm 2018. Basically, there are two main parts in CL-
SciSumm shown in Figure 1, Task 1A is to identify cited text spans in reference paper.
Task 1B is to do facet identification and summary generation is finally done in Task 2.

    Identify cited text
      span in the RP            Task 1A
                                                                   Summary generation
                                                     Task 2       based on cited text span
    Identify facet of
     cited text span
                                Task 1B

                          Fig. 1. Framework of CL-SciSumm Shared Task
    Below is the detailed information of tasks.
   Given: A topic consisting of a Reference Paper (RP) and Citing Papers (CPs) that
all contain citations to the RP. In each CP, the citances have been identified that pertain
to a particular citation to the RP.
   Task 1A: For each citance, identify the cited text span in the RP that most accurately
reflect the citance. These are of the granularity of a sentence fragment, a full sentence,
or several consecutive sentences (no more than 5).
   Task 1B: For each cited text span, identify what facet of the paper it belongs to, from
a predefined set of facets.
   Task 2: Finally, generate a structured summary of the RP from the cited text spans
of the RP. The length of the summary should not exceed 250 words.
   Referring to our previous work in CL-SciSumm 2017 [5], multiple classifiers are
integrated based on a weighted voting system to identify cited text spans. Based on that,
we did some optimizations for Task 1A from aspects of feature selection, class-imbal-
anced data processing, voting weights allocation and parameter tuning [6]. While in
system applied in CL-SciSumm 2018, we conduct the similar strategy with multi-clas-
sifiers in Task 1A, but adding new steps to process data, new features for classifiers
and new classifiers as well. For Task 1B, we try to identify facet by supervised topic
modeling and classifier except using built dictionaries. Final results are combined be-
tween strategies. When doing summarization in Task 2, we firstly separate sentences
based on their similarity to abstracts and rank them over several features to select im-
portant ones for summary generations.
   The rest of paper is organized as follows. Section 2 provides a brief review of related
works. Section 3 elaborates the detailed information about our system this year. Exper-
imental data and evaluation results on training data are given in section 4. Conclusion
and direction for future research are outlined in section 5.


2 Available at: http://wing.comp.nus.edu.sg/cl-scisumm2016/
3 Available at: http://wing.comp.nus.edu.sg/~cl-scisumm2017/
4 Available at: http://wing.comp.nus.edu.sg/~cl-scisumm2018/
                                                                                          3


2      Related Work

With million publications are coming out every year [7], attention has been paid in
automatic scientific summarization due to people’s demand for getting quick over-
views. Recently, Computational Linguistics Scientific Document Summarization
Shared Task are the first annual medium-scale shared task on scientific summarization,
where summary is generated from identified cited text. This year, CL-SciSumm 2018
took place at the Joint Workshop on BIRNDL 20185 with the same goal of exploring
automated summarization of scientific contributions for computational linguistics do-
main. Here, we do literature review of different tasks based on submitted systems in
CL-SciSumm 2016 and CL-SciSumm 2017 [8].
    Looking at the related work of Task 1A, most teams solved it by characterizing the
linkage between a citance in citing paper and its corresponding cited text spans in ref-
erence paper [9]. Features are basically generated based on character-based and seman-
tic-based similarities. For example, in CL-SciSumm 2016, CIST System applied lexical
similarity and sentence similarity [10]. Aggarwal and Sharma [11] made use of sub-
sequences overlap. PolyU utilized TF-IDF cosine similarity, position of sentence chunk
and some lexical rules [12]. Other relevant features applied in CL-SciSumm 2017 are
longest common subsequence [13], character-level TF-IDF scores [14], modified Jac-
card distance [15]. Deep learning methods for sematic measurement between sentences,
such as pairwise neural network ranking model [13], popular word embedding models
like Word2Vec and Doc2Vec [5] were also used. In order to find the most similar sen-
tence pair, SVM and its modification model were chosen as the classifier for many
teams [10, 12, 16]. Except applying one single model [17, 18], nearly half of teams
applied weighted voting algorithms to integrate results [5, 13, 15].
    As for Task 1B, proportions of different discourse facet types are very imbalanced,
most proposed methods are using rule-based methods, which is based on human-labeled
dictionaries or some heuristics. Aggarwal and Sharma [11] identified the facet based
on cited text span location, such as if cited text span lies in introduction section, begin-
ning of abstract, then it is indicative of aim citation. CIST System took advantages of
frequent word and combined it with subtitle to do judgements[10, 14]. Besides, differ-
ent classifiers are also applied here, such as random forest classifier[19], SVM [14],
SMO [20], convolutional neural networks [17]and so on. Except position and similarity
features, new ones are proposed, like Dr inventor sentence related features and scien-
tific gazetteer features in [15].
    When doing Task 2, basically, there are two main steps. First is to cluster identified
text spans to organize them into groups. Second is to rank them based on different fea-
tures, which depict sentence importance in some level. CIST system calculated sen-
tence scores of five features [10]. In order to control redundancy of summary, they used
determinant point processes to enhance diversity [14]. Abura’ed, Chiruzzo [15] pro-
posed a modified version of 2016 summarization system with additional features which
are relevant with reference paper and citing paper.


5 Available at: http://wing.comp.nus.edu.sg/~birndl-sigir2018/
4


3        Methodology

As mentioned in introduction, there are two tasks. The dataset comprised 40 annotated
sets of references and their citing papers from the open access research papers in the
computational linguistics domain. A topic is consisted of a Reference Paper (RP) and
Citing Papers (CPs) that all contain citations to the RP. In each CP, the text spans
(citances) have been identified that pertain to a particular citation to the RP.

3.1      Task 1A
In this paper, we solve Task 1A by finding the sentence in RP that is more similar with
citance. There are two main steps in our system: selecting suitable features for classifi-
ers, integrating final results via a weighted voting system. Here are the detailed infor-
mation about our system for conducting Task 1A.
Citation Text Preprocess. Since training data is labeled by human which might have
some errors, we utilize two rules to expand labeled citation text in advance which can
rich semantic information of citation text: First, if the next sentence behind labeled ci-
tation text contains the same author name in citation text (Example in Paper [1]), then
we add this sentence into citation text. Second, if the next sentence behind labeled ci-
tation text contains demonstrative pronouns (Example in Paper [2]), then we add this
sentence into citation text. We do this preprocess on training and testing data directly.
For training data, there are 4,244 sentences are added into original citation texts.
    Paper [1]
    Like others, we have assumed lexical semantic classes of verbs as defined in Levin
    (1993) (hereafter Levin), which have served as a gold standard in computational
    linguistics research (Dorr and Jones, 1996; Kipper et al., 2000; Merlo and Steven-
    son, 2001; Schulte im Walde and Brew, 2002). Levin’s classes form a hierarchy of
    verb groupings.
    with shared meaning and syntax. O
    Paper [2]
    The system described in this paper is similar to the MENE system of (Borthwick,
    1999). It uses a maximum entropy framework and classifies each word given its
    features.
            Fig. 2. Examples when Utilizing Rules to Expand Labeled Citation Text
Feature Selection. Similar with previous system in CL-SciSumm 2017, we applied
three kinds of features to figure out linkage between sentences in scientific papers, they
are similarity-based features, rule-based features and position-based features. Then dif-
ferent kinds of features are generated for measuring linkages between citations and
cited text. In previous work [5], bi-gram feature didn’t work well, in order to convert
this feature into an efficient one, we count frequency of bi-grams in training data and
build a dictionary containing all the bigrams that frequency is over 500. When we find
the same bigram contained in citation sentence and reference sentence, we will filter
                                                                                             5


them based on this dictionary. For sentence similarity features, we add WordNet simi-
larity and Word2Vec similarity, which are the average of word pair similarities, whose
words are contained in the two sentences. Table 1 gives the short descriptions of fea-
tures we utilized in this task.
                 Table 1. Three Kinds of Features Applied in Four Classifiers
  Feature Type          Feature                       Feature Definition
                                     Cosine value between two sentence vectors trained
                   LDA similarity    by LDA (Topic number is set to be 20, iteration times
                                     is 2000.)
                                     Division between the intersection and the union of
                  Jaccard similarity
                                     the words in two sentences
                                     Add up IDF values of the same words between two
                    IDF similarity
                                     sentences
                                     Cosine value between two sentence vectors repre-
   Similarity-    TF-IDF similarity sented by TF-IDF (Sentence vectors haven’t done
  based features                     normalization.)
                                     Cosine value between two sentence vectors trained
                 Doc2Vec similarity by Doc2Vec (Distributed representation vector is set
                                     to be 200)
                                     Average of word pair similarities calculated via
                 WordNet similarity
                                     WordNet
                                     Average of word pair similarities calculated via
                 Word2Vec similarity Word2Vec (Distributed representation vector is set
                                     to be 300)
                                     After filtering, bi-gram matching value, if there is
   Rule-based
                   Filtered Bigram   any of bi-gram matched between two sentences, this
     features
                                     value is 1; otherwise 0.
                          Sid        Sentence position in the full text
                         Ssid        Sentence position in the corresponding section
                                     The sentence position, divided by the number of sen-
                  Sentence Position
    Position-                        tences
  based features                     The position of the corresponding section of the sen-
                   Section Position
                                     tence chunk, divided by the number of sections
                                     The sentence position in the section, divided by the
                    Inner Position
                                     number of sentences in the section

   To select relevant features for use in model construction, we firstly tested each fea-
ture with four classifiers, including Decision Tree (DT), Logistic Regression (LR),
SVM (kernel function is linear and RBF). We select negative and positive samples in
different class ratios: 1, 2, 3, 4, 5 and 6 to investigate performance stability using dif-
ferent training datasets. Figure 2 displays the average F1 values of different feature-
classifier combinations.
6

     0.8                                                   0.6

     0.7
                                                           0.5
     0.6
                                                           0.4
     0.5

     0.4                                                   0.3

     0.3
                                                           0.2
     0.2
                                                           0.1
     0.1

       0                                                    0
            DT         LG       SVM(RBF)    SVM(Linear)              DT      LG       SVM(RBF)   SVM(Linear)

       (a) Average F1 when #Negative/#Positive is 1          (b) Average F1 when #Negative/#Positive is 2
     0.6                                                   0.45

                                                            0.4
     0.5
                                                           0.35

     0.4                                                    0.3

                                                           0.25
     0.3
                                                            0.2

     0.2                                                   0.15

                                                            0.1
     0.1
                                                           0.05

       0                                                         0
            DT         LG       SVM(RBF)    SVM(Linear)              DT       LG      SVM(RBF)   SVM(Linear)

       (c) Average F1 when #Negative/#Positive is 3          (d) Average F1 when #Negative/#Positive is 4
     0.45                                                  0.45

      0.4                                                   0.4

     0.35                                                  0.35

      0.3                                                   0.3

     0.25                                                  0.25

      0.2                                                   0.2

     0.15                                                  0.15

      0.1                                                   0.1

     0.05                                                  0.05

       0                                                         0
             DT         LG       SVM(RBF)    SVM(Linear)             DT       LG      SVM(RBF)   SVM(Linear)

       (e) Average F1 when #Negative/#Positive is 5          (f) Average F1 when #Negative/#Positive is 6

Fig. 3. Average F1 of All Features with Different Proportion of Negative and Positive Samples
   In order to pick out the best feature combinations, we conduct subset selection by
iteratively evaluating a candidate subset of selected features set. Based on Figure 2, for
each classifiers, we choose features which are the most robust among different class
ratios and have good performance to be the fixed features. Less robust features are se-
lected to be the selected features set. We set class ratios of negative and positive sam-
ples to be 5.5. Table 2 to Table 5 shows the fixed feature and selected feature sets for
each classifier and their performance of precision, recall, F1.
     Table 2. Fixed and Selected Feature Sets for SVM (Linear) and their Precision, Recall, F1
    Fixed Features               Selected Features                  P         R         F1
                                                                 0.2231 0.0216 0.0391
    tfidf_sim,       bigram                                      0.5356 0.0647 0.1140
    idf_sim          lda_sim                                     0.3196 0.0256 0.0460
                     bigram, lda_sim                             0.5480 0.1095 0.1810
     Table 3. Fixed and Selected Feature Sets for SVM (RBF) and their Precision, Recall, F1
    Fixed Features              Selected Features                 P         R         F1
                                                               0.5720 0.1774 0.2679
                                                                                              7


                   ssid                                         0.3063    0.1622    0.2091
                   lda_sim                                      0.6221    0.1510    0.2411
  tfidf_sim,       jaccard_sim                                  0.5924    0.1550    0.2438
  idf_sim, jac-    ssid, lda_sim                                0.3450    0.1806    0.2332
  card_sim         ssid, jaccard_sim                            0.3224    0.1462    0.2002
                   lda_sim, jaccard_sim                         0.5579    0.1358    0.2166
                   ssid, lda_sim, jaccard_sim                   0.3594    0.1822    0.2373
         Table 4. Fixed and Selected Feature Sets for LR and their Precision, Recall, F1
  Fixed Features                 Selected Features                  P         R          F1
                                                                 0.6230 0.1805 0.2787
  tfidf_sim,
                    sec_position                                 0.6357 0.1885 0.2869
  idf_sim, jac-
                    lda_sim                                      0.6225 0.1845 0.2827
  card_sim
                    sec_position, lda_sim                        0.6375 0.2036 0.3060
        Table 5. Fixed and Selected Feature Sets for DT and their Precision, Recall, F1
  Fixed Features                Selected Features                 P          R        F1
                                                               0.4131 0.4098 0.4102
                    inner_position                             0.3840 0.3843 0.3877
                    lda_sim                                    0.3976 0.3779 0.3880
                    d2v_sim                                    0.3512 0.3659 0.3616
                    w2v_sim                                    0.3863 0.3923 0.3776
                    inner_position, lda_sim                    0.3974 0.3746 0.3866
  tfidf_sim,        inner_position, d2v_sim                    0.4004 0.3730 0.3811
  idf_sim, jac-     inner_position, w2v_sim                    0.4170 0.4139 0.4152
  card_sim,         lda_sim, d2v_sim                           0.3646 0.3819 0.3632
  sent_position, lda_sim, w2v_sim                              0.3843 0.3802 0.3826
  sid               d2v_sim, w2v_sim                           0.3574 0.3635 0.3518
                    inner_position, lda_sim, d2v_sim           0.3792 0.3786 0.3752
                    inner_position, lda_sim, w2v_sim           0.4060 0.4122 0.3963
                    inner_position, d2v_sim, w2v_sim           0.3709 0.3707 0.3706
                    lda_sim, d2v_sim, w2v_sim                  0.3818 0.4066 0.3815
                    inner_position, lda_sim, d2v_sim,          0.3858 0.3794 0.3730
                    w2v_sim

   As we can see, Decision Tree and Logistic Regression are performing better than
SVM (Linear and RBF). Therefore, when doing integrations over classifiers, we con-
struct two voting system, one is 4-classiferis containing all classifiers, another one is 3-
classifiers where we remove the SVM (Linear).
Parameter Setting. In this system, voting weights of multi-classifiers and running set-
ting are important parameters to adjust. Based on Table 2 to Table 5, we compute the
average of precision, recall, F1 for each classifier and use these average values as the
voting system weights. Since the SVM (Linear) behave worst among all four systems,
we do another voting system which only based on the other three classifiers. Voting
weights for 4-classifiers and 3-classifiers are shown in Table 6 and Table 7.
8


    Table 6. Different Voting Weights of Precision, Recall and F1-Oriented 4-Classifiers System
    Voting                 Voting Voting                 Voting Voting                   Voting
             Classifiers                    Classifiers                    Classifiers
    System                Weight System                 Weight System                   Weight
              SVM (Linear)    0.2160          SVM (Linear)   0.0699              SVM (Linear)      0.0954
     Preci-
               SVM (RBF)      0.2443 Recall- SVM (RBF)       0.2039   F1- ori-    SVM (RBF)        0.2320
    sion-ori-
                  DT          0.2051 oriented    DT          0.4870    ented         DT            0.3829
      ented
                  LG          0.3346             LG          0.2392                  LG            0.2897

    Table 7. Different Voting Weights of Precision, Recall and F1-Oriented 3-Classifiers System
     Voting                Voting Voting                  Voting Voting                  Voting
              Classifiers                   Classifiers                    Classifiers
    System                 Weight System                 Weight System                  Weight
                  SVM (RBF)   0.3116          SVM (RBF)      0.2192               SVM (RBF)        0.2565
    Precision-                       Recall-                          F1- ori-
                     DT       0.2617             DT          0.5236                  DT            0.4233
     oriented                        oriented                          ented
                     LG       0.4268             LG          0.2572                  LG            0.3202

   New Classifier. Except the classifiers we applied before, we also utilize a new one,
called XGBOOST, which is an efficient and scalable implementation of gradient boost-
ing framework by [21]. We use it as a single classifier with integrating into the voting
system. When testing on training data, we select negative and positive samples in: 2, 3,
4 and 5. Figure 3 shows the average F1.
                  1
                                                                                 sid
                 0.9
                                                                                 ssid
                 0.8
                                                                                 sent_position
                 0.7                                                             sec_position
                 0.6                                                             inner_position

                 0.5                                                             lda_sim
                                                                                 jaccard_sim
                 0.4
                                                                                 tf_idf_sim
                 0.3
                                                                                 idf_sim
                 0.2
                                                                                 bigram
                 0.1                                                             d2v_sim
                  0                                                              w2v_sim
                         2             3           4             5

                          Fig. 4. Average F1 of All Features with XGBOOST
   Therefore, we also choose the fixed feature (bigram, IDF similarity and WordNet
similarity) and selected feature sets (LDA similarity and Doc2Vec similarity) for
XGBOOST and test again on training data when negative/positive samples, penalty
factor are 5.5, 6, 6.5 and 7. Their performance of F1 are show in Table 8 below.
               Table 8. Fixed and Selected Feature Sets for XGBOOST and their F1
    Fixed Features           Selected Features            5.5     6       6.5                        7
                                                        0.5746 0.5647 0.4931                      0.4647
    bigram, idf_sim, lda_sim                            0.6231 0.5562 0.5309                      0.4974
    wordnet_simi        d2v_sim                         0.5868 0.4846 0.5212                      0.4252
                        d2v_sim,lda_sim                 0.7316 0.5588 0.5740                      0.5123
                                                                                                         9


3.2      Task 1B
In this task, for each cited text span, we need to identify what facet of the paper it
belongs to. Basically, there are three components in this system to deal with Task 1B.
 Dictionary: We construct two kinds of dictionaries of five facets manual dictionary,
   and POS dictionary. The first one is made manually and latter one is made according
   to part-of-speech tagging results. For POS dictionary, we keep those words whose
   POS results are VB and JJ. In detail, method POS dictionary has words which fre-
   quency is over 5, and for the other facet POS dictionary, they has words which fre-
   quency is over 2.
 Supervised Topic Model: After proposing of latent sematic indexing, latent topic
   modeling has become very popular for topic discovery in document collections, such
   as Latent Dirichlet Allocation (LDA) [22]. Supervised topic model (LLDA) [23] is
   then followed by, which can overcome limitations of traditional ones. This model
   assumes availability of topic labels (keywords) and the characterization of each topic
   by a multinomial distribution over all vocabulary words.
 XGBOOST: Tree boosting is a highly effective and widely used machine learning
   method. Here we apply XGBOOST [24] for approximate tree learning. When train-
   ing the model, there are 15 features in total. Five of them are the matched word
   number based on manual dictionary, five of them are the matched word number
   based on POS dictionary, and the left ones are position-based features mentioned in
   section 3.1.
Based on three components above, there are five different strategies:
Manual Dictionary. Based on the five different dictionaries of five facets, if the section
title or sentence content contains any one of these words in the corresponding built
dictionaries, it will be directly classified as the corresponding facet. Since the manual
dictionary will be more accurate than POS dictionary. We only apply this strategy using
manual dictionary. When doing judgements, the first identified facet should contain
more than 1(𝐶𝑜𝑢𝑛𝑡𝑀1 ) word in dictionary, the second identified facet should contain
more than 2(𝐶𝑜𝑢𝑛𝑡𝑀2 ) words in dictionary. To find the best order of judging facets,
we do the experiments over all random arrangements. In total, there are 120 sets of
results, here we only show the top 20 ones based on F 1 in Table 9.
 Table 9. Top 20 Average F1 Generated via Different Judging Orders Using Manual Dictionary
              Judging Order                    F1                  Judging Order                    F1
implication->method->result->aim->hypothesis 0.7179 method->result->hypothesis->implication->aim   0.7159
implication->method->result->hypothesis->aim 0.7179 method->result->hypothesis->aim->implication   0.7159
implication->method->hypothesis->result->aim 0.7179 method->hypothesis->result->implication->aim   0.7159
implication->method->aim->result->hypothesis 0.7162 method->hypothesis->result->aim->implication   0.7159
implication->method->aim->hypothesis->result 0.7162 implication->hypothesis->method->result->aim   0.7146
implication->method->hypothesis->aim->result 0.7162 hypothesis->implication->method->result->aim   0.7146
method->result->implication->aim->hypothesis 0.7159 method->implication->result->aim->hypothesis   0.7146
method->result->implication->hypothesis->aim 0.7159 method->implication->result->hypothesis->aim   0.7146
method->result->aim->implication->hypothesis 0.7159 method->implication->hypothesis->result->aim   0.7146
method->result->aim->hypothesis->implication 0.7159 method->hypothesis->implication->result->aim   0.7146
10


LLDA. For training data, we assume that each identified facet is a topic label and that
each citation sentence is a mixture of the expert-assigned topics that can be learned. We
firstly trained LLDA model on the training data and the dimension number is five.
Then, we apply this trained model to do predictions over testing data. Here, there is no
labels for testing data yet. After representing each sentence into the probability distri-
bution over five facets, we recognize the most possible facet as its identified facet. Since
some sentences might have more than one facets, we set the possibility thresholds
(𝑃𝐿𝐿𝐷𝐴2 = 0.2 𝑜𝑟 0.195) for the second possible facet. Referring to LLDA parameters,
we do some adjustments on beta, where a low beta value places more weight on having
each topic composed of only a few dominant words. Table 10 shows different beta
settings and their corresponding F1.
                             Table 10. Average F1 under Different Beta Settings
      Beta                F1        Beta         F1        Beta         F1                            Beta         F1
       0.1              0.3576       0.5       0.6939       1.2      0.7278                            2         0.7228
       0.2              0.5005       0.7       0.723        1.5      0.7241                            5         0.7228

XGBOOST. Here, we use the XGBOOST to do classification in this task. When choos-
ing features, position-based features mentioned in section 3.1 are selected as selected
feature set which will be evaluated using its candidate subsets. Performance of different
selected feature sets are given below in Table 11.
                   Table 11. Selected Feature Sets for XGBOOST and their F1
           Selected Feature Set          F1             Selected Feature Set                                          F1
sid, sid_position                                  0.7114 sid, ssid, sid_position, section_position                 0.7039
sid, inner_position                                0.7102 sid_position, section_position                            0.7029
sid                                                0.7077 sid, ssid, inner_position, section_position               0.7027
sid, sid_position, inner_position, section_position 0.7077 sid, ssid, sid_position, inner_position, section_position 0.7014
sid_position, inner_position                       0.7065 ssid, sid_position                                        0.7004
sid, sid_position, inner_position                  0.7065 ssid, sid_position, section_position                      0.7004
sid_position                                       0.7054 sid, ssid, sid_position, inner_position                   0.7003
ssid, sid_position, inner_position                 0.7053 inner_position                                            0.7002
inner_position, section_position                   0.7052 ssid                                                      0.6992
sid, ssid, sid_position                            0.7052 section_position                                          0.6992
sid, ssid, inner_position                          0.7052 sid, ssid                                                 0.699
sid, inner_position, section_position              0.7052 sid, ssid, section_position                               0.699
sid_position, inner_position, section_position     0.7052 ssid, inner_position, section_position                    0.699
sid, sid_position, section_position                0.705 ssid, sid_position, inner_position, section_position       0.699
sid, section_position                              0.7039 ssid, section_position                                    0.6979
ssid, inner_position                               0.6978

Manual dictionary + LLDA. Different from LLDA strategy, we use the manual dic-
tionary-labeled testing data to be the testing data for LLDA prediction. Here, we also
set the possibility thresholds for the second possible facet (𝑃𝐿𝐿𝐷𝐴2 = 0.18) and the
thresholds for contained word counts of the first and second identified facet when doing
different order of judgements (𝐶𝑜𝑢𝑛𝑡𝑀1 = 1 and 𝐶𝑜𝑢𝑛𝑡𝑀2 = 2) To find the best order
of judging facets, we also do the experiments over all random arrangements. Here we
only show the top 20 ones based on F1 in table 12.
                                                                                                      11


              Table 12. Top 20 Average F1 Generated via Different Judging Orders
              Judging Order                    F1                  Judging Order                    F1
implication->method->result->aim->hypothesis 0.7191 method->result->hypothesis->implication->aim   0.7165
implication->method->result->hypothesis->aim 0.7191 method->result->hypothesis->aim->implication   0.7165
implication->method->hypothesis->result->aim 0.7191 method->hypothesis->result->implication->aim   0.7165
implication->method->aim->result->hypothesis 0.7178 method->hypothesis->result->aim->implication   0.7165
implication->method->aim->hypothesis->result 0.7178 implication->hypothesis->method->result->aim   0.7157
implication->method->hypothesis->aim->result 0.7178 hypothesis->implication->method->result->aim   0.7157
method->result->implication->aim->hypothesis 0.7165 method->aim->result->implication->hypothesis   0.7152
method->result->implication->hypothesis->aim 0.7165 method->aim->result->hypothesis->implication   0.7152
method->result->aim->implication->hypothesis 0.7165 method->aim->hypothesis->result->implication   0.7152
method->result->aim->hypothesis->implication 0.7165 method->hypothesis->aim->result->implication   0.7152

POS dictionary + LLDA. Similar with previous method, we use the POS dictionary-
labeled testing data to be the testing data for LLDA prediction. We also set the same
three parameters in this strategy, where 𝑃𝐿𝐿𝐷𝐴2 = 0.18, 𝐶𝑜𝑢𝑛𝑡𝑃1 = 3 and 𝐶𝑜𝑢𝑛𝑡𝑃2 =
8. The top 20 F1 via different judging order is give in Table 13.
              Table 13. Top 20 Average F1 Generated via Different Judging Orders
              Judging Order                    F1                  Judging Order                    F1
method->implication->result->aim->hypothesis 0.7511 method->implication->aim->result->hypothesis 0.7498
method->implication->result->hypothesis->aim 0.7511 method->implication->aim->hypothesis->result 0.7498
method->implication->hypothesis->result->aim 0.7511 method->implication->hypothesis->aim->result 0.7498
method->hypothesis->implication->result->aim 0.7511 method->aim->implication->result->hypothesis 0.7498
method->result->implication->aim->hypothesis 0.7498 method->aim->implication->hypothesis->result 0.7498
method->result->implication->hypothesis->aim 0.7498 method->aim->hypothesis->implication->result 0.7498
method->result->aim->implication->hypothesis 0.7498 method->hypothesis->result->implication->aim 0.7498
method->result->aim->hypothesis->implication 0.7498 method->hypothesis->result->aim->implication 0.7498
method->result->hypothesis->implication->aim 0.7498 method->hypothesis->implication->aim->result 0.7498
method->result->hypothesis->aim->implication 0.7498 method->hypothesis->aim->implication->result 0.7498


3.3      Task 2
Summary generation is divided into two main steps. First is to group sentences into
different clusters based on its similarity with different parts of abstract. Second is using
several features to extract sentence from each cluster and combine them into a sum-
mary.
   Normally, abstract is a complete but concise description of the work. In particular,
different parts may be merged or spread among a set of sentences, like motivation,
problem statement, approach, results and conclusions. Therefore, we wants to organize
the abstract sentences of reference paper in advance, and group the identified cited
spans based on their similarities between different parts of abstract sentences. Basically,
we assume that abstract will contain motivation, approach and conclusion. In order to
split them into these three group, we apply rule-based method based on writing styles.
We find that when people write summaries like abstract, they will start with some fixed
phrases, such as “this paper”, “in this paper” or “we”. If the first sentence doesn’t have
12


these fixed phrases, it will be about motivation of this paper for most of the time. Mean-
while, the last sentence are usually about results or conclusions.
   Therefore, we firstly split abstract sentences into groups if they follow these rules.
Then, each identified text span is selected into different groups based on their similarity
with the grouped abstract sentences. Here we use the linear sum of Jaccard, IDF and
TFIDF similarities. After this, we rank the sentences within each group, using weighted
features of those three similarities, sentence length and sentence position. Formula is
shown below:
     𝑆𝑐𝑜𝑟𝑒𝑖 = 2.5𝑆𝐽𝑎𝑐𝑐𝑎𝑟𝑑 + 2.5𝑆𝐼𝐷𝐹 + 2.5𝑆𝑇𝐹𝐼𝐷𝐹 + 1.25𝑆𝐿𝑒𝑛𝑔𝑡ℎ + 1.25𝑆𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛          (1)

  Finally, for each time, we choose first one sentence from each cluster to build the
summary before the length of summary exceeds 250 words.


4       Experiments

4.1     Data and Tools
When doing corpora preprocessing, we remove the stop words and stem words to base
forms by Porter Stemmer algorithm6. Then, we applied Word2Vec and Doc2Vec model
in Genism7 and python package of LDA8 model to represent documents. All the classi-
fiers were done via Scikit-learn python package9. XGBOOST is obtained via a python
extension package website 10. Source code of our system will be made available at:
https://github.com/michellemashutian/NJUST-at-CLSciSumm/tree/master/NJUST-
2018.


4.2     Submission Results
Task 1A. After using the best feature combinations on 4-classifiers and 3-classifiers,
testing on different parameters, we obtain the average F1 shown in Figure 4. Proportion
of negative/positive samples, penalty factor are tested on 5.5(blue cross line), 6 (red
circle line), 6.5 (green triangle line) and 7 (purple square line). Thresholds range from
0.6 to 0.8, as 0.01 is the interval (x axis).


6 Available at: http://snowball.tartarus.org/algorithms/porter/stemmer.html
7 Available at: https://radimrehurek.com/gensim/
8 Available at: https://pypi.org/project/lda/
9 Available at: http://scikit-learn.org/stable/index.html
10 Available at: https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost
                                                                                                     13

   0.5                                                0.1
  0.45                                               0.09
   0.4                                               0.08
  0.35                                               0.07
   0.3                                               0.06
  0.25                                               0.05
   0.2                                               0.04
  0.15                                               0.03
   0.1                                               0.02
  0.05                                               0.01
        0                                                  0


   (a) Average F1 when using Precision-Oriented 3-   (b) Average F1 when using Precision-Oriented 4-
              Classifiers Voting System                         Classifiers Voting System
  0.7                                                0.7

  0.6                                                0.6

  0.5                                                0.5

  0.4                                                0.4

  0.3                                                0.3

  0.2                                                0.2

  0.1                                                0.1

   0                                                  0


 (c) Average F1 when using Recall-Oriented 3-Classi- (d) Average F1 when using Recall-Oriented 4-Clas-
                 fiers Voting System                               sifiers Voting System
  0.7                                                0.7

  0.6                                                0.6

  0.5                                                0.5

  0.4                                                0.4

  0.3                                                0.3

  0.2                                                0.2

  0.1                                                0.1

   0                                                  0


 (e) Average F1 when using F1-Oriented 3-Classifiers (f) Average F1 when using F1-Oriented 4-Classifi-
                  Voting System                                     ers Voting System

Fig. 5. Average F1 when using the Best Feature Combinations on 4-classifiers and 3-classifiers

   According to Figure 4, we pick the Top 10 performance of multi-classifiers and their
parameters are given in Table 14. Except voting system, we also submit another 10
running results which are obtained via single classifiers. Parameter and classifier fea-
tures are given in Table 15.
      Table 14. Parameter Settings for Task 1A Submissions Using Voting System.
Voting Voting     #Neg/#Pos       Thresh- Voting Voting         #Neg/#Pos     Thresh-
System Weights Penalty Factor        olds    System Weights Penalty Factor      olds
            Precision       5.5             0.68                 Precision       5.5            0.63
3 Classi-                                            4 Classi-
             Recall         5.5             0.65                  Recall         5.5            0.63
  fiers                                                fiers
               F1           5.5          0.61/0.63                  F1         5.5/6.5        0.6/0.61

        Table 15. Parameter Settings for Task 1A Submissions Using Single Classifier.
              #Neg/#Pos
Classifiers                                             Features
            Penalty Factor
14

  DT                  5.5       tf_idf_sim,idf_sim,jaccard_sim,sent_position,sid, inner_position,w2v_sim
  DT                  5.5       tf_idf_sim,idf_sim,jaccard_sim,sent_position,sid, None
  LG                  5.5       tf_idf_sim,idf_sim,jaccard_sim, sec_position,lda_sim
  LG                  5.5       tf_idf_sim,idf_sim,jaccard_sim, sec_position
SVM(RBF)              5.5       tf_idf_sim,idf_sim,sid, jaccard_sim
SVM(RBF)              5.5       tf_idf_sim,idf_sim,sid
XGBOOST               5.5       bigram,idf_sim,wordnet_sim, d2v_sim,lda_sim
XGBOOST               5.5       bigram,idf_sim,wordnet_sim, d2v_sim
XGBOOST               5.5       bigram,idf_sim,wordnet_sim, lda_sim
XGBOOST               5.5       bigram,idf_sim,wordnet_sim

Task 1B. Referring the five strategies using dictionary, based on the performance of
different judgment order (Table 9, Table 12 and Table 13), we select the specific order
according to their F1 results, when they generate the same facet identification on testing
data, we just move to next order which has lower F1. For LLDA strategy, we pick the
top 4 results with corresponding beta settings to run on test data. For XGBOOST strat-
egy, we also select top 4 results with corresponding feature selections to run on test
data. Table 16 shows the overall parameter settings of our Task 1B submission.
           Table 16. Parameter Settings for Task 1B Submissions Using Five Strategies.
        Strategy                                  Parameter Setting
                       𝛽𝐿𝐿𝐷𝐴 = 1.2, 𝑃𝐿𝐿𝐷𝐴2 = 0.2
                       𝛽𝐿𝐿𝐷𝐴 = 1.2, 𝑃𝐿𝐿𝐷𝐴2 = 0.195
          LLDA
                       𝛽𝐿𝐿𝐷𝐴 = 1.5, 𝑃𝐿𝐿𝐷𝐴2 = 0.2
                       𝛽𝐿𝐿𝐷𝐴 = 1.5, 𝑃𝐿𝐿𝐷𝐴2 = 0.195
                       implication->hypothesis->method->result->aim, 𝐶𝑜𝑢𝑛𝑡𝑀1 = 1 and 𝐶𝑜𝑢𝑛𝑡𝑀2 = 2
     Manual Dictionary implication->method->result->aim->hypothesis, 𝐶𝑜𝑢𝑛𝑡𝑀1 = 1 and 𝐶𝑜𝑢𝑛𝑡𝑀2 = 2
                       method->result->implication->aim->hypothesis, 𝐶𝑜𝑢𝑛𝑡𝑀1 = 1 and 𝐶𝑜𝑢𝑛𝑡𝑀2 = 2
                       𝛽𝐿𝐿𝐷𝐴 = 1.2, 𝑃𝐿𝐿𝐷𝐴2 = 0.18,
                       implication->method->result->aim->hypothesis, 𝐶𝑜𝑢𝑛𝑡𝑀1 = 1 and 𝐶𝑜𝑢𝑛𝑡𝑃2 = 2
      Manual Diction- 𝛽𝐿𝐿𝐷𝐴 = 1.2, 𝑃𝐿𝐿𝐷𝐴2 = 0.18,
        ary+LLDA       method->result->implication->aim->hypothesis, 𝐶𝑜𝑢𝑛𝑡𝑀1 = 1 and 𝐶𝑜𝑢𝑛𝑡𝑃2 = 2
                       𝛽𝐿𝐿𝐷𝐴 = 1.2, 𝑃𝐿𝐿𝐷𝐴2 = 0.18,
                       implication->method->result->aim->hypothesis, 𝐶𝑜𝑢𝑛𝑡𝑃1 = 1 and 𝐶𝑜𝑢𝑛𝑡𝑃2 = 3
                       𝛽𝐿𝐿𝐷𝐴 = 1.2, 𝑃𝐿𝐿𝐷𝐴2 = 0.18,
                       method->implication->aim->result->hypothesis, 𝐶𝑜𝑢𝑛𝑡𝑃1 = 3 and 𝐶𝑜𝑢𝑛𝑡𝑃2 = 8
       POS Diction- 𝛽𝐿𝐿𝐷𝐴 = 1.2, 𝑃𝐿𝐿𝐷𝐴2 = 0.18,
        ary+LLDA       method->implication->result->aim->hypothesis, 𝐶𝑜𝑢𝑛𝑡𝑃1 = 3 and 𝐶𝑜𝑢𝑛𝑡𝑃2 = 8
                       𝛽𝐿𝐿𝐷𝐴 = 1.2, 𝑃𝐿𝐿𝐷𝐴2 = 0.18,
                       method->result->implication->aim->hypothesis, 𝐶𝑜𝑢𝑛𝑡𝑃1 = 3 and 𝐶𝑜𝑢𝑛𝑡𝑃2 = 8
                       sid, sid_position
                       sid, inner_position
        XGBOOST
                       sid
                       sid, sid_position, inner_position, section_position


5         Conclusion

This document demonstrates our participant system NJUST on CL-SciSumm 2018.
Compared with previous system, we has added some semantic information like Word-
Net and Word2Vec similarities to improve the citance linkage and summarization per-
                                                                                             15


formance. We also optimize the bigram feature. When choosing feature and setting pa-
rameters, comparative experiments are finished systematically. New methods are pro-
posed in this paper to deal with facet identification and automatic summarizations. In
Task 1B, rule-based methods are combined with supervised topic modeling and
XGBOOST. As to Task 2, we take advantages of abstract structures.
   In the future work, more things can be done on these three tasks. For Task 1A and
Task 1B, we can try new classifiers to see the performance. For Task 2, we need to find
more features to calculate the sentence score for ranking, such as sentence position, etc.
We can also make use of the results in Task 1B to generate a more reasonable summary.


Acknowledgements

This work is supported by Major Projects of National Social Science Fund (No.
17ZDA291), Fujian Provincial Key Laboratory of Information Processing and Intelli-
gent Control (Minjiang University) (No. MJUKF201704) and Qing Lan Project.


References

1.       Stevenson, S. and E. Joanis. Semi-supervised verb class discovery using noisy features.
         in Proceedings of the seventh conference on Natural language learning at HLT-
         NAACL 2003-Volume 4. 2003. Association for Computational Linguistics.
2.       Chieu, H.L. and H.T. Ng. Named entity recognition: a maximum entropy approach
         using global information. in Proceedings of the 19th international conference on
         Computational linguistics-Volume 1. 2002. Association for Computational Linguistics.
3.       Qazvinian, V., et al., Generating extractive summaries of scientific paradigms. Journal
         of Artificial Intelligence Research, 2013. 46: p. 165-201.
4.       Waard, A.d. and H.P. Maat, Epistemic modality and knowledge attribution in scientific
         discourse: a taxonomy of types and overview of features, in Proceedings of the
         Workshop on Detecting Structure in Scholarly Discourse. 2012, Association for
         Computational Linguistics: Jeju, Republic of Korea. p. 47-55.
5.       Ma, S., et al. NJUST@ CLSciSumm-17. in Proc. of the 2nd Joint Workshop on
         Bibliometric-enhanced Information Retrieval and Natural Language Processing for
         Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017). 2017.
6.       Ma, S., J. Xu, and C. Zhang, Automatic identification of cited text spans: a multi-
         classifier approach over imbalanced dataset. Scientometrics, 2018.
7.       Ware, M. and M. Mabe, The STM report: An overview of scientific and scholarly
         journal publishing. 2015.
8.       Jaidka, K., et al., Insights from CL-SciSumm 2016: the faceted scientific document
         summarization Shared Task. International Journal on Digital Libraries, 2017: p. 1-9.
9.       Jaidka, K., et al. The CL-SciSumm shared task 2017: results and key insights. in
         Proceedings of the Computational Linguistics Scientific Summarization Shared Task
         (CL-SciSumm 2017), organized as a part of the 2nd Joint Workshop on Bibliometric-
         enhanced Information Retrieval and Natural Language Processing for Digital
         Libraries (BIRNDL 2017). 2017.
16


10.   Li, L., et al. CIST System for CL-SciSumm 2016 Shared Task. in BIRNDL@ JCDL.
      2016.
11.   Aggarwal, P. and R. Sharma. Lexical and Syntactic cues to identify Reference Scope of
      Citance. in BIRNDL@ JCDL. 2016.
12.   Cao, Z., W. Li, and D. Wu. PolyU at CL-SciSumm 2016. in BIRNDL@ JCDL. 2016.
13.   Prasad, A. WING-NUS at CL-SciSumm 2017: Learning from Syntactic and Semantic
      Similarity for Citation Contextualization. in Proc. of the 2nd Joint Workshop on
      Bibliometric-enhanced Information Retrieval and Natural Language Processing for
      Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017). 2017.
14.   Li, L., et al. CIST@ CLSciSumm-17: Multiple Features Based Citation Linkage,
      Classification and Summarization. in Proc. of the 2nd Joint Workshop on Bibliometric-
      enhanced Information Retrieval and Natural Language Processing for Digital
      Libraries (BIRNDL2017). Tokyo, Japan (August 2017). 2017.
15.   Abura’ed, A., et al., LaSTUS/TALN@ CLSciSumm-17: cross-document sentence
      matching and scientific text summarization systems. 2017.
16.   Moraes, L., et al. University of Houston at CL-SciSumm 2016: SVMs with tree kernels
      and Sentence Similarity. in Proceedings of the Joint Workshop on Bibliometric-
      enhanced Information Retrieval and Natural Language Processing for Digital
      Libraries (BIRNDL). 2016.
17.   Lauscher, A., G. Glavaš, and K. Eckert, University of Mannheim@ CLSciSumm-17:
      Citation-Based Summarization of Scientific Articles Using Semantic Textual
      Similarity. 2017.
18.   Zhang, D. and S. Li. PKU@ CLSciSumm-17: Citation Contextualization. in Proc. of
      the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural
      Language Processing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August
      2017). 2017.
19.   Klampfl, S., A. Rexha, and R. Kern. Identifying referenced text in scientific
      publications by summarisation and classification techniques. in Proceedings of the
      Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural
      Language Processing for Digital Libraries (BIRNDL). 2016.
20.   Saggion, H. and F. Ronzano. Trainable citation-enhanced summarization of scientific
      articles. in Proceedings of the Joint Workshop on Bibliometric-enhanced Information
      Retrieval and Natural Language Processing for Digital Libraries (BIRNDL). 2016.
21.   Friedman, J.H., Greedy function approximation: a gradient boosting machine. Annals
      of statistics, 2001: p. 1189-1232.
22.   Blei, D.M., A.Y. Ng, and M.I. Jordan, Latent dirichlet allocation. Journal of machine
      Learning research, 2003. 3(Jan): p. 993-1022.
23.   Mcauliffe, J.D. and D.M. Blei. Supervised topic models. in Advances in neural
      information processing systems. 2008.
24.   Chen, T. and C. Guestrin. Xgboost: A scalable tree boosting system. in Proceedings of
      the 22nd acm sigkdd international conference on knowledge discovery and data
      mining. 2016. ACM.