CIST@CLSciSumm-19: Automatic Scientific
 Paper Summarization with Citances and Facets

             Lei Li, Yingqi Zhu, Yang Xie, Zuying Huang, Wei Liu,
                          Xingyuan Li, and Yinan Liu

            Beijing University of Posts and Telecommunications (BUPT)
            No.10 Xitucheng Road, Haidian District, Beijing, P.R.China
             {leili,zhuyq,zoehuang,thinkwee,lynbngt}@bupt.edu.cn
                     xieyangsp@163.com kuukisann@gmail.com


      Abstract. Starting from its former version, CIST@CLSciSumm-18, our
      CIST@CLSciSumm-19 system is going to participate in the shared Task
      1A (citation linkage), Task 1B (facet classification) and Task 2 (summa-
      rization) in CLSciSumm-19@SIGIR2019. We mainly try to improve its
      methods for all the shared tasks. We build a new feature of Word2vec H
      for the CNN model to calculate sentence similarity for citation linkage.
      We plan to adopt CNN and RNN variants for facet classification. And
      in order to improve the performance of summarization, we develop more
      semantic representations for sentences based on neural network language
      models to construct new kernel matrix used in Determinantal Point Pro-
      cesses (DPPs).

      Keywords: Citation Linkage · Facet Classification · Summarization ·
      Word2vec H · Neural Network Language Model · DPPs · Determinantal
      Point Processes.


1   Introduction
As the scientific paper, computational linguistics has many characteristics such
as professional knowledge, rigorous writing and strong logic. Reading such arti-
cles is very meaningful, but manual reading takes a lot of time, so we need to
study how to extract good article summaries to reduce the workload of readers.
The main work of CLSciSumm-19 [1] is to explore automatic summary methods
based on the characteristics of the papers in the field of computational linguistics,
and to provide a comprehensive and readable summary for the thesis.
    We tried to solve the three tasks contained in CLSciSumm-19: Task 1A, Task
1B and Task2. The data set we use is the paper in the field of computational
linguistics provided by the organizer. There are some topics in the dataset. A
topic consisting of a Reference Paper (RP) and Citing Papers (CPs) that all
contain citations to the RP. In each CP, the text spans (i.e., citances) have
been identified that pertain to a particular citation to the RP. Task 1A: For
each citance, identify the spans of text (cited text spans, CTS) in the RP that
most accurately reflect the citance. These are of the granularity of a sentence
2       Lei Li. et al.

fragment, a full sentence, or several consecutive sentences (no more than 5).
Task 1B: For each cited text span, identify what facet of the paper it belongs to,
from a predefined set of facets. Task 2 (optional bonus task): Finally, generate a
structured summary of the RP from the cited text spans of the RP. The length
of the summary should not exceed 250 words.
    In this paper, based on previous work, we add the Word2vec H feature to
the Task 1A method and used CNN to get the result of the content linking. For
Task 1B, we use the improved CNN and RNN structures for classification. For
Task 2, we develop more semantic representations for sentences based on neural
network language models to construct new kernel matrix used in Determinantal
Point Processes (DPPs).


2    Related work

Task 1A acts as a content linkage task, and the common method is to calculate
similarity, which includes not only the Cosine similarity, the Jaccard similarity,
and so on, but also some semantic similarity calculation methods, such as BM25
and VSM [2]. In addition, the various characteristics of the words are also very
important, such as the position of the word, part of speech and frequency, etc.
The characteristics of the words in the two sentences are added to the similarity
calculation for the sentence-pair, and the similarity of the two sentences can be
judged at the word level [3]. With the continuous expansion of the corpus and
the increasing number of features, the machine learning method has begun to
emerge for the task. Firstly, the researchers try the basic classifiers, such as SVM
using a radial basis function kernel, Decision Tree and Logistic Regression to
identify the reference span [4]. Various classifiers can learn different text features,
integrating them together can reveal more text features. So the researchers use
ensemble models, such as the Random Forest [3]. Besides, in order to more deeply
explore the meaning of the sentence, deep neural networks are also applied, such
as CNN [5] [6] [9], Siamese Deep learning Networks [8].For Task1B, the rule-
based method [8] [10] and the classification method [7] can be used, both of
which focus on exploring good text features. The Rule-based methods, such as
building a dictionary for each discourse facet [2], are less adaptive. Most studies
combine the features of categories with classification algorithms to improve the
accuracy of the classification. [2] use a multi-features random forest classifier.
The others use a supervised topic model, and XGBOOST [4] and SVM with
tf-idf and naive bayes features [6].
    Task 2 is a summary task. [11] focus on exploring the sampling process. They
use WMD sentence similarity to construct new kernel matrix used in Determi-
nantal Point Processes (DPPs). [4] divide all sentences into three categories
(motivations, methods, and conclusions), and then extract sentences from each
cluster based on rules and severe features to form a summary. [9] generates a
summary by selecting the most relevant sentences from the RP using linguis-
tic and semantic features from RP and CPs. [10] built a summary generation
system using the OpenNMT tool.
                                  Title Suppressed Due to Excessive Length       3

3     Method

In our approach, we first obtain CTS through feature extraction and content
linkage method in the Citation Linage, which is RT (the sentence in RP) re-
lated to CT (the sentence in CPs). Then we judge the facet of CTS by feature
extraction and classification methods in the facet classification. Finally, a sum-
mary of the article is obtained through pre-processing, feature selection, sentence
sampling, and post-processing in the summary generation.The framework of our
system is shown in Fig. 1.


                         Fig. 1. Framework of our system


3.1   Task 1A

The Citation Linkage task consists of two stages: feature extraction and content
Linkage. In feature extraction, we have used some of the good-performing meth-
ods of the past, continuing to use word-cos and Word Vector, Sentence similarity
(IDF similarity and Jaccard similarity), Context similarities, WordNet. Besides,
we add CNN (Convolutional Neural Network) method and LDA-Jaccard. In [11]
the LDA vectors of sentences are sparse, that is, the distribution of sentences
on topics is sparse. And the LDA vectors pays more attention to whether two
sentences belong to the same topic. So we use Jaccard’s idea to express the rela-
tivity of the sentence-pair by the ratio of the topic intersection and union of the
two sentences, namely LDA-Jaccard.
    This paper used Word2vec H feature as the input of CNN. It is based on
word embedding, maps CT and RT information into dense features space, and
adds sentence similarity to better guide neural network training. Specifically, CT
is represented as an nd matrix CT Matrix. CT M atrix = [wv1 , ..., wvi , ...wvn ].
n is the number of words in CT, and d is Word embedding size. wvi refers to the
4      Lei Li. et al.

word vector of the i-th word in CT. Firstly, we decompose CT Matrix by SVD
to obtain three matrices, U, S, and V. Take the   top min(n,d) values in diagonal
of S as the weight set I1 = i1 , i2 , ...imin(n,d) , and take the top min(n,d) rows
of V to form CT V. RT V and I2 of RT can also be obtained in the same way
. Then the cosine similarity is calculated for each line of CT V and RT V to
obtain the Word2vec V. The calculation process is as Fig. 2.
    wvi,j = cosine li1 , lj2 , l1i and l2j are row vectors of CT V and RT V. The
cosine similarity is used here.


                        Fig. 2. Word2vec V building process


               Fig. 3. Word2vec H and Structure of CNN for Task1A

    Finally, we use li1 and lj2 to assign weights for rows and columns in Word2vec V
to get the Word2vec H. vali,j = i1i · i2j , as shown in Fig. 3(a).
    In content linkage, this paper uses the multi-feature fusion method and
the binary classification method by CNN. Multi-feature fusion methods include
voting1.1, voting2.0, Jaccard-Focused-new, and Jaccard-Cascade. We use the
Word2vec H feature composed of CT and RT as the input of CNN, and the
output is the related or unrelated category that CT and RT belong to. The
structure of CNN is showed in Fig. 3(b).

3.2   Task 1B
Facet Classification:Our system uses Rule-based methods and Machine Learn-
ing Methods for Task 1B. Rule-based methods construct rules based on features
extracted from CTS, RP and CPs. According to the results in last year, we
only use Subtitle and High Frequency Word Combining Rule (SubHFW) this
time. As for Machine Learning methods, we apply Random Forest (RF), a Vot-
ing Classifier consisting of 3 Gradient Boosting (GB) and Convolutional Neural
Network (CNN) to assign each CTS single or multiple facets. RF and GB take
Location of Paragraph, Document Position Ratio, Paragraph Position Ratio and
Number of Citations or References as input features while CNN takes the ma-
trix of word embedding of CTS as input. Finally, we combine all the results from
Rule-based methods and Machine Learning Methods to obtain a fusion result,
which is called Fusion method.
                                 Title Suppressed Due to Excessive Length      5

3.3   Task 2
For task2, we would like to present an original Quality-Diversity model for ex-
tractive automatic summarization based on the DPP sampling algorithm [12].
In general, a document can be represented as a ground set of items. Each sen-
tence is a minimum item, and the extractive summary can be regarded as a
subset from ground set with high quality and low redundancy. Figure 4 shows
the framework of our system. The main process for summary generation consists
of Pre-processing, Feature Selection, Sentence Sampling and Post-processing.


                      Fig. 4. System Framework for Task 2

Pre-processing First, we need to correct some xml-coding errors manually.
Latter, we have to make some preparations such as document merging, sentence
filtering and input file generation for hierarchical Latent Dirichlet Allocation
(hLDA). We merge the content of RP and the citations into a document for
CTS feature described below. Besides, all documents are converted to lowercase
letters. Then we filter the corpus for removing some equations, figures, tables
and generate input file for hLDA model which contains word index and their
corresponding frequencies.

Feature Selection When it comes to document representation, we to build
matrix L from both partial (Statistical Feature Method) and holistic (Neural
Network Language Model) perspectives to ensure better sentence sampling for
summaries. First, we build matrix L through Lij = qi Sij qj concretely, we adopt
Sentence Length (SL), Sentence Position (SP), Title similarity (TS), CTS, and
Hierarchical Topic Model (HTM) as features according to the work of Li L [13] for
quality and JACCARD similarity for diversity. We are looking forward to finding
a best linear combination of designed qualities in order to capture more obvious
characteristic for high-quality summary. Furthermore, we construct matrix L
through Lij = Bi> Bj by the vectors B representing sentences from Sent2Vec
and LSA directly, and call this framework Neural Network Language Model.

Sentence Sampling We use DPPs to select sentences, which are elegant prob-
abilistic models of global, negative correlations and mostly used in quantum
physics to study the reflected Brownian motions. In our method, we only con-
sider discrete DPPs and follow the definition of Kulesza A et al. [12]. We can
enhance the diversity of summary by using DPPs. In this way, given the L matrix
constructed on document sentences, the sampling method based on DPPs [13]
can automatically choose those diverse sentences with high quality as candidate
summary sentences.
6       Lei Li. et al.

Post-processing Since we have already had the candidate summary sentences,
we can truncate the output summary with sentences ranking high in quality, limit
the summary to 250 words, and remove some white spaces in Post-processing.


4     Implementation and Experiments
4.1   Task 1A
In our previous work, we obtained a lot of features. As shown in Table 2, Fea-
tures number indicates the number of features the method contains. The four
methods in [11] have different effects on the test data and the training data,
and the more features with good performance are used, the more stable the per-
formance of the testing set is. The more stable the performance is. Therefore, we
removed the features with poor performance on the training set, remaining the
features with good performance for fusion methods. We adjust the parameters of
the four fusion methods in [11]. The four new fusion methods are voting1.2, vot-
ing2.1, Jaccard-Focus-1.1, and Jaccard-Cascade-1.1. Since the LDA can discover
the topic information and LDA vector is sparse, lexicon (LDA) and LDA-cos
are removed and LDA-Jaccard is added. Since the lexicon (co-occurrence) only
includes words selected from the training set, when the difference between the
testing set and the training set is great, the lexicon (co-occurrence) is ineffective.
In the experiment, we chose 600 dimensions for LDA vector and 200 dimensions
for word vector. Table 1 shows the parameter settings for our method. As to ex-
periments, we choose 600 dimension as LDA vector and 200 dimension as word
vector. Table 1 shows the parameter settings of our methods.


                 Table 1. Parameter settings of Methods in Task 1A

                                           V-1.2 V-2.1 J-F-1.1 J-C-1.1
                Feature
                                           WP W P W P W P
                Idf similarity             1 12 0.5 5 0.6 16 0.5 16
                Idf context similarity           0.8 3 0.5 15 0.4 10
                Jaccard similarity         1 5 0.5 6 JS 7
                Jaccard context similarity       0.5 8 0.7 16 0.6 16
                Word vector                1 8 0.5 7 0.5 26
                word-cos                   1 10 0.7 7 0.5 26 0.5 10
                LDA-Jaccard                1 12 0.4 7
                lin                              0.5 5
                jcn                                    0.6 11


    In addition, with the increasing training data, we begin to try to solve task1A
with CNN. In this paper, we build the Word2vec H feature for the sentence-
pair, so that we could reduce the dimensionality of the input and add the co-
sine similarity to it.We use V-1.2, V-2.1, J-F-1.1, J-C-1.1, and W H-C to rep-
resent Voting-1.1, Voting-2.0, JacCard-Focused-1.1, and Jaccard-Cascade-1.1,
                                  Title Suppressed Due to Excessive Length       7

Word2vec H-CNN respectively.In Table 1,W and P are Weight and Proportion
respectively. JS means 10 fold of Jaccard Similarity.


               Table 2. Performance of Methods in Task 1A in 2018

      Method F1-train     F1-test (F1-train)-(F1-test) Features number
      V-1.1  0.147        0.113   0.034                4
      V-2.0  0.128        0.122   0.006                7
      J-F    0.132        0.114   0.018                8
      J-C    0.116        0.09    0.026                4


    According to Table 2 we predict that V-1.2 and J-F-1.1 will be more stable
on the testing set. The W H-C uses the data in ”Training-Set-2019”, and the
effect is the worst due to some problems, such as data imbalance of training set
and complex structure of CNN.


               Table 3. Performance of Methods in Task 1A in 2019

      Method F1-train     F1-test (F1-train)-(F1-test) Features number
      V-1.2   0.097       0.106   0.007                5
      V-2.1   0.105       0.104   0.001                8
      J-F-1.1 0.105       0.103   0.002                7
      J-C-1.1 0.099       0.087   0.026                4


   From Table 2 and Table 3 [14],we can get three conclusions:
   The number of features used in V-1.2 is less than V-2.1 and J-F-1.1, but
the result of V-1.2 is similar to V-2.1 and J-F-1.1. The number of features used
in V-1.2 is about the same with J-C-1.1, and the result of V-1.2 is better than
J-C-1.1. It shows that features used in V-1.2 play a leading role.
   The results of the runs in 2019 verify our prediction, that is, the more fea-
tures that are used, the more stable the performance on the test set is. So the
performance on the testing set and the training set of V-2.1 is very stable, as
well as J-F-1.1.
   After removing co-occurrence dictionary, (F-train) - (F-test) results are s-
maller, which indicates that co-occurrence dictionary has limitations and should
be removed.


4.2   Task 1B

In this section, well introduce our methods applied for Task 1B in detail.


Rule-based Methods: Subtitle Rule: We use subtitles of CTS and citance to
determine which facet they belong to. If subtitles contain five predefined classes,
8      Lei Li. et al.

we categorize CTS and citance to corresponding facet. High Frequency Word
Rule: We use high frequency words of each class to classify CTS and citance. We
first remove common words, and then set a threshold for each facet. Subtitle and
High Frequency Word Combining Rule: We first apply Subtitle Rule to obtain
the facet. If it doesnt give an explicit answer, then we use High Frequency Words
Rule to obtain facet.

Machine Learning Methods: Firstly, we extract features from CTS and ci-
tance consisting of Location of Paragraph, Document Position Ratio, Paragraph
Position Ratio and Number of Citations or References and concatenate these
features into an 8-dimension vector. Then we train RF and GB based on the
features. As for CNN, the content of CTS is transformed to a matrix where ith
row corresponds to the word embedding of ith word and jth column represents
the jth dimensionality of the embedding. Then, we stack a convolutional layer
with multiply kernel sizes followed by a max-pooling layer. The architecture of
CNN is shown in Fig 5.


                        Fig. 5. Architecture of CNN for Task 1B

                               Table 4. Results in 2019

               Method Train-set(F1 Score) Test-set(F1 Score)
               RF     0.3281              -
               SubHFW 0.3556              0.389
               Voting 0.3611              0.341
               CNN    0.2841              0.342


    Results on Train-Set-2019 are illustrated in Table 4. We find that Voting and
SubHFW methods have a better performance. CNN performs poorer than we
expected since the training data set is too small for a neural network to learn.
And the dataset is imbalanced where method facet has more samples than other
facets.
    As for Task 1B, the results on Test-set-2019 show that SubHFW method out-
performs than other method and ranks first among all methods, which indicates
that the features of subtitle and high frequency word are crucial to determine
                                       Title Suppressed Due to Excessive Length   9

the facet of each CTS. Moreover, textCNN method performs poorer than we
expected due to the demand of larger dataset.

4.3   Task 2
The results below utilize Manual ROUGE values to evaluate our system summa-
ry. During the evaluation phase, CL-SciSumm 2018 has provided THREE kinds
of criterion for option: the collection of citation sentences (the community sum-
mary), faceted summaries of the traditional self-summary (the abstract), and
ones written by well-trained annotators (the human summary).
    Take community summary for instance, we test each feature SP (ϕ0 ), SL
(ϕ1 ), TS (ϕ2 ), HTM (ϕ3 ), and CTS (ϕ4 ) described in subsection 3.3 on statistical
fearture model independently to figure out its own contribution at first. As the
CTS feature (ϕ4 ) is specially designed, we tend not to present its individual
performance, but record and observe the binary combination with every other
basic feature.

                   Table 5. Binary Combination Test on Quality

                Run ID ϕ0         ϕ1    ϕ2    ϕ3    ϕ4 ROUGE1 ROUGE2
                quality-0 1       0     0     0     1 0.43652 0.23824
                quality-1 0       1     0     0     1 0.42104 0.19574
                quality-2 0       0     1     0     1 0.52800 0.37682
                quality-3 0       0     0     1     1 0.41193 0.18440


    From TABLE 5, the best binary combination comes from TS (ϕ2 ) and CTS
(ϕ4 ) features. One possible explanation is that the community summary itself has
already included these citation sentences. With the title containing the essence
of a paper, selected sentences following this ranking rule will definitely guarantee
the overlapping on golden summaries.


      Table 6. Statistical Feature Model Performance on Community Summary

               Run ID        ϕ0    ϕ1    ϕ2    ϕ3    ϕ4 ROUGE1 ROUGE2
               Statistical-0 0     0     1     0     1 0.52552 0.37333
               Statistical-1 0     0     0     1     1 0.41283 0.18438
               Statistical-2 1     0     2     0     2 0.44104 0.24653
               Statistical-3 1     1     2     0     2 0.42219 0.22141


   Analogically, we conduct experiments on other two kinds of golden sum-
maries, where the weights of parameters appear slightly different. From TABLE 6
and TABLE 7, which present the results of the community summary and human
summary separately: the best binary combination goes to the same tendency.
The phenomena of same best combination maybe interpreted implicitly that no
10     Lei Li. et al.

matter whether the sentences are cited otherwise or the summaries are written
by annotators, they two both are from the perspective of readers. Community
summaries consist of those citation sentences, and the sentences themselves are
extracted from the original documents, thus there’s no wonder the ROUGE eval-
uation is far higher than other kinds of summaries. However, human summary
is based on comprehension of readers. In this case we do extra experiments on
human summaries besides the same parameter setting as community summaries.
The best new combination as TABLE 7 shows is a little bit different from the
previous mere copies of community summaries. When it comes to human sum-
mary, the more parameters are involved, the higher ROUGE F-score it reaches.
Unfortunately, for community summary, when we desire a further exploration
on binary combination, any additional attribute performs adversely. There are
a thousand Hamlets in a thousand people’s eyes.
       Table 7. Statistical Feature Model Performance on Human Summary

               Run ID        ϕ0   ϕ1   ϕ2   ϕ3   ϕ4 ROUGE1 ROUGE2
               Statistical-0 0    0    1    0    1 0.41900 0.19167
               Statistical-1 1    0    0    0    1 0.39866 0.15321
               Statistical-2 2    0    3    0    3 0.43504 0.25430
               Statistical-3 2    1    3    0    3 0.42219 0.22141


         Table 8. Statistical Feature Model Performance on Self-summary

               Run ID        ϕ0   ϕ1   ϕ2   ϕ3   ϕ4 ROUGE1 ROUGE2
               Statistical-0 1    0    0    0    0 0.39630 0.19019
               Statistical-1 0    1    0    0    0 0.35020 0.11507
               Statistical-2 0    0    1    0    0 0.38434 0.17296
               Statistical-3 0    0    0    1    0 0.30237 0.08561
               Statistical-4 2    0    3    0    3 0.42688 0.37234


    As for the self-summary (the abstract), things presented in TABLE 8 are op-
posite. Every binary combination with CTS (ϕ4 ) feature are not that satisfied,
so we present each individual contribution of other statistical or topic features.
Also, we try the best parameter setting for community summary and human
summary both on abstract summary. Perhaps although we have tried our best
to follow the writers, there always exists a narrow gap between our readers’ com-
prehension and writers’ original intention. This part of the experiment follows
a simple but practical principle that on the condition that we cannot fully un-
derstand latent semantics the writers want to express, we still manage to deal
with some statistical features which help to extract important sentences. If the
summarizer is developed through this approach, it is not limited in a familiar
language and does not require any additional linguistic knowledge or complex
linguistic processing.
    Furthermore, when extracting sentences from the Neural Network Language
Model (using Sent2Vec/LSA representation for sentences), we choose the best
                                 Title Suppressed Due to Excessive Length    11

quality combination for community summary, human summary and abstract
summary. TABLE 9 suggests the Neural Network Language Model performance.
Besides, TABLE 10 shows the best results of several runs in BIRNDL 2019.
Among all the systems in competiton, our system won the first prize for the
human summary, while the second place for abstract and community summary.

              Table 9. Neural Netwok Language Model Performance

          Run ID             ϕ0    ϕ1   ϕ2   ϕ3   ϕ4   ROUGE1 ROUGE2
          community-Sent2Vec 0     0    1    0    1    0.52254 0.31893
          human-Sent2Vec     2     0    3    0    3    0.41716 0.18210
          abstract-Sent2Vec  2     0    3    0    3    0.42114 0.21823
          community-LSA-1    0     0    1    0    1    0.56971 0.44228
          community-LSA-2    0     0    1    0    1    0.59240 0.46528
          human-LSA-1        2     0    3    0    3    0.37717 0.17825
          human-LSA-2        2     0    3    0    3    0.38568 0.17424
          abstract-LSA-1     2     0    3    0    3    0.39944 0.20009
          abstract-LSA-2     2     0    3    0    3    0.40051 0.18617


                    Table 10. Best Results on BIRNDL 2019

Run ID Abstract R2 Abstract SU4 Community R2 Community SU4 Human R2 Human SU4
run3     0.389        0.210        0.122        0.063        0.278    0.200
run19     0.386       0.227        0.121         0.063       0.257    0.189
run15     0.381       0.211        0.119         0.062       0.267    0.191


5   Conclusion and Future Work
This year, we have added neural networks to the methods of three tasks. We hope
to make use of large training corpus to give the advantages of neural networks,
that is, deeply mining the meaning of the text. Rule-based and statistics-based
methods have achieved good performance, so we try to combine them with neural
networks. In the future work, Task 1A is expected to automatically adjust the
weight of features through neural network and combine multiple features better.
For Task 1B, more study should be done to reduce the impact of imbalanced data
on neural networks. Besides, more curial features are expected to be found since
the performance of machine learning methods is the best so far. In Task 2, we
expect the neural network language models to make contributions to obtain more
meaningful semantic representation for sentences against statistical features.

Acknowledgements
This work was supported in part by the Beijing Municipal Commission of Sci-
enceand Technology under Grant Z181100001018035; National Social Science
12      Lei Li. et al.

Foundation of China under Grant 16ZDA055; National Natural Science Founda-
tion of China under Grant 91546121; Engineering Research Center of Information
Networks, Ministry of Education.


References
1. CL-SciSumm 2019 Homepage, http://wing.comp.nus.edu.sg/ cl-scisumm2019/.
2. Wang P, Li S, Wang T, et al. NUDT@ CLSciSumm-18In: Proceedings of the 3nd
   Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Lan-
   guage Processing for Digital Libraries[C]//BIRNDL@ SIGIR. 2018: 102-113.
3. Davoodi E, Madan K, Gu J. CLSciSumm Shared Task: On the Contribution of
   Similarity measure and Natural Language Processing Features for Citing Prob-
   lem[C]//BIRNDL@ SIGIR. 2018: 96-101.
4. Ma S, Zhang H, Xu J, et al. NJUST@ CLSciSumm-18[C]//BIRNDL@ SIGIR. 2018:
   114-129.
5. Kim Y. Convolutional neural networks for sentence classification[J]. arXiv preprint
   arXiv:1408.5882, 2014.
6. Agrawal K, Mittal A. IIIT-H@ CLScisumm-18 In: Proceedings of the 3nd Joint
   Workshop on Bibliometric-enhanced Information Retrieval and Natural Language
   Processing for Digital Libraries[C]//BIRNDL@ SIGIR. 2018: 130-133.
7. Baruah G, Kolla M. Klick Labs at CL-SciSumm 2018[C]//BIRNDL@ SIGIR. 2018:
   134-141.
8. Karimi S, Moraes L F T, Das A, et al. University of Houston@ CL-SciSumm 2017:
   Positional language Models, Structural Correspondence Learning and Textual En-
   tailment[C]//BIRNDL@ SIGIR (2). 2017: 73-85.
9. Aburaed A, Bravo A, Chiruzzo L, et al. LaSTUS/TALN+ INCO@ CL-SciSumm
   2018-Using Regression and Convolutions for Cross-document Semantic Linking and
   Summarization of Scholarly Literature[C]//Proceedings of the 3nd Joint Workshop
   on Bibliometric-enhanced Information Retrieval and Natural Language Processing
   for Digital Libraries (BIRNDL2018). Ann Arbor, Michigan (July 2018). 2018.
10. Debnath D, Achom A, Pakray P. NLP-NITMZ@ CLScisumm-18 In: Proceedings
   of the 3nd Joint Workshop on Bibliometric-enhanced Information Retrieval and
   Natural Language Processing for Digital Libraries[C]//BIRNDL@ SIGIR. 2018: 164-
   171.
11. Li L, Chi J, Chen M, et al. CIST@ CLSciSumm-18: Methods for Computa-
   tional Linguistics Scientific Citation Linkage, Facet Classification and Summariza-
   tion[C]//BIRNDL@ SIGIR. 2018: 84-95.
12. Alex Kulesza and Ben Taskar (2012), Determinantal Point Processes for Machine
   Learning, Foundations and Trends in Machine Learning: Vol. 5: No. 23, pp 123- 286.
   http://dx.doi.org/10.1561/2200000044.
13. Li L, Zhang Y, Chi J et al. UIDS: A Multilingual Document Summarization Frame-
   work Based on Summary Diversity and Hierarchical Topics [M] // Li L, Zhang Y,
   Chi J et al. Chinese Computational Linguistics and Natural Language Processing
   Based on Naturally Annotated Big Data. Springer, 2017: 2017: 343-354.
14. Chandrasekaran, M.K., Yasunaga, M., Radev, D., Freitag, D., Kan, M.-Y.
   ”Overview and Results: CL-SciSumm SharedTask 2019”, In Proceedings of the 4th
   Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Lan-
   guage Processing for Digital Libraries (BIRNDL 2019) @ SIGIR 2019, Paris, France.