-

CIST@CLSciSumm-19: Automatic Scienti c Paper Summarization with Citances and Facets

Lei Li

Yingqi Zhu

Yang Xie

xieyangsp@163.com 0

Zuying Huang

Wei Liu

Xingyuan Li

Yinan Liu

0 0 Beijing University of Posts and Telecommunications (BUPT) No. 10 Xitucheng Road, Haidian District, Beijing , P.R.China

Starting from its former version, CIST@CLSciSumm-18, our CIST@CLSciSumm-19 system is going to participate in the shared Task 1A (citation linkage), Task 1B (facet classi cation) and Task 2 (summarization) in CLSciSumm-19@SIGIR2019. We mainly try to improve its methods for all the shared tasks. We build a new feature of Word2vec H for the CNN model to calculate sentence similarity for citation linkage. We plan to adopt CNN and RNN variants for facet classi cation. And in order to improve the performance of summarization, we develop more semantic representations for sentences based on neural network language models to construct new kernel matrix used in Determinantal Point Processes (DPPs).

Citation Linkage Facet Classi cation Summarization Word2vec H Neural Network Language Model DPPs Determinantal Point Processes

As the scienti c paper, computational linguistics has many characteristics such as professional knowledge, rigorous writing and strong logic. Reading such articles is very meaningful, but manual reading takes a lot of time, so we need to study how to extract good article summaries to reduce the workload of readers. The main work of CLSciSumm-19 [ 1 ] is to explore automatic summary methods based on the characteristics of the papers in the eld of computational linguistics, and to provide a comprehensive and readable summary for the thesis.

We tried to solve the three tasks contained in CLSciSumm-19: Task 1A, Task 1B and Task2. The data set we use is the paper in the eld of computational linguistics provided by the organizer. There are some topics in the dataset. A topic consisting of a Reference Paper (RP) and Citing Papers (CPs) that all contain citations to the RP. In each CP, the text spans (i.e., citances) have been identi ed that pertain to a particular citation to the RP. Task 1A: For each citance, identify the spans of text (cited text spans, CTS) in the RP that most accurately re ect the citance. These are of the granularity of a sentence fragment, a full sentence, or several consecutive sentences (no more than 5). Task 1B: For each cited text span, identify what facet of the paper it belongs to, from a prede ned set of facets. Task 2 (optional bonus task): Finally, generate a structured summary of the RP from the cited text spans of the RP. The length of the summary should not exceed 250 words.

In this paper, based on previous work, we add the Word2vec H feature to the Task 1A method and used CNN to get the result of the content linking. For Task 1B, we use the improved CNN and RNN structures for classi cation. For Task 2, we develop more semantic representations for sentences based on neural network language models to construct new kernel matrix used in Determinantal Point Processes (DPPs). 2

Related work

Task 1A acts as a content linkage task, and the common method is to calculate similarity, which includes not only the Cosine similarity, the Jaccard similarity, and so on, but also some semantic similarity calculation methods, such as BM25 and VSM [ 2 ]. In addition, the various characteristics of the words are also very important, such as the position of the word, part of speech and frequency, etc. The characteristics of the words in the two sentences are added to the similarity calculation for the sentence-pair, and the similarity of the two sentences can be judged at the word level [ 3 ]. With the continuous expansion of the corpus and the increasing number of features, the machine learning method has begun to emerge for the task. Firstly, the researchers try the basic classi ers, such as SVM using a radial basis function kernel, Decision Tree and Logistic Regression to identify the reference span [ 4 ]. Various classi ers can learn di erent text features, integrating them together can reveal more text features. So the researchers use ensemble models, such as the Random Forest [ 3 ]. Besides, in order to more deeply explore the meaning of the sentence, deep neural networks are also applied, such as CNN [ 5 ] [ 6 ] [ 9 ], Siamese Deep learning Networks [ 8 ].For Task1B, the rulebased method [ 8 ] [ 10 ] and the classi cation method [ 7 ] can be used, both of which focus on exploring good text features. The Rule-based methods, such as building a dictionary for each discourse facet [ 2 ], are less adaptive. Most studies combine the features of categories with classi cation algorithms to improve the accuracy of the classi cation. [ 2 ] use a multi-features random forest classi er. The others use a supervised topic model, and XGBOOST [ 4 ] and SVM with tf-idf and naive bayes features [ 6 ].

Task 2 is a summary task. [ 11 ] focus on exploring the sampling process. They use WMD sentence similarity to construct new kernel matrix used in Determinantal Point Processes (DPPs). [ 4 ] divide all sentences into three categories (motivations, methods, and conclusions), and then extract sentences from each cluster based on rules and severe features to form a summary. [ 9 ] generates a summary by selecting the most relevant sentences from the RP using linguistic and semantic features from RP and CPs. [ 10 ] built a summary generation system using the OpenNMT tool.

Method

In our approach, we rst obtain CTS through feature extraction and content linkage method in the Citation Linage, which is RT (the sentence in RP) related to CT (the sentence in CPs). Then we judge the facet of CTS by feature extraction and classi cation methods in the facet classi cation. Finally, a summary of the article is obtained through pre-processing, feature selection, sentence sampling, and post-processing in the summary generation.The framework of our system is shown in Fig. 1. The Citation Linkage task consists of two stages: feature extraction and content Linkage. In feature extraction, we have used some of the good-performing methods of the past, continuing to use word-cos and Word Vector, Sentence similarity (IDF similarity and Jaccard similarity), Context similarities, WordNet. Besides, we add CNN (Convolutional Neural Network) method and LDA-Jaccard. In [ 11 ] the LDA vectors of sentences are sparse, that is, the distribution of sentences on topics is sparse. And the LDA vectors pays more attention to whether two sentences belong to the same topic. So we use Jaccard's idea to express the relativity of the sentence-pair by the ratio of the topic intersection and union of the two sentences, namely LDA-Jaccard.

This paper used Word2vec H feature as the input of CNN. It is based on word embedding, maps CT and RT information into dense features space, and adds sentence similarity to better guide neural network training. Speci cally, CT is represented as an nd matrix CT Matrix. CT M atrix = [wv1; :::; wvi; :::wvn]. n is the number of words in CT, and d is Word embedding size. wvi refers to the word vector of the i-th word in CT. Firstly, we decompose CT Matrix by SVD to obtain three matrices, U, S, and V. Take the top min(n,d) values in diagonal of S as the weight set I1 = i1; i2; :::imin(n;d) , and take the top min(n,d) rows of V to form CT V. RT V and I2 of RT can also be obtained in the same way . Then the cosine similarity is calculated for each line of CT V and RT V to obtain the Word2vec V. The calculation process is as Fig. 2.

wvi;j = cosine li1; lj2 , l1i and l2j are row vectors of CT V and RT V. The cosine similarity is used here.

Finally, we use li1 and lj2 to assign weights for rows and columns in Word2vec V to get the Word2vec H. vali;j = ii1 ij2, as shown in Fig. 3(a).

In content linkage, this paper uses the multi-feature fusion method and the binary classi cation method by CNN. Multi-feature fusion methods include voting1.1, voting2.0, Jaccard-Focused-new, and Jaccard-Cascade. We use the Word2vec H feature composed of CT and RT as the input of CNN, and the output is the related or unrelated category that CT and RT belong to. The structure of CNN is showed in Fig. 3(b). 3.2

Task 1B

Facet Classi cation:Our system uses Rule-based methods and Machine Learning Methods for Task 1B. Rule-based methods construct rules based on features extracted from CTS, RP and CPs. According to the results in last year, we only use Subtitle and High Frequency Word Combining Rule (SubHFW) this time. As for Machine Learning methods, we apply Random Forest (RF), a Voting Classi er consisting of 3 Gradient Boosting (GB) and Convolutional Neural Network (CNN) to assign each CTS single or multiple facets. RF and GB take Location of Paragraph, Document Position Ratio, Paragraph Position Ratio and Number of Citations or References as input features while CNN takes the matrix of word embedding of CTS as input. Finally, we combine all the results from Rule-based methods and Machine Learning Methods to obtain a fusion result, which is called Fusion method.

Task 2

For task2, we would like to present an original Quality-Diversity model for extractive automatic summarization based on the DPP sampling algorithm [ 12 ]. In general, a document can be represented as a ground set of items. Each sentence is a minimum item, and the extractive summary can be regarded as a subset from ground set with high quality and low redundancy. Figure 4 shows the framework of our system. The main process for summary generation consists of Pre-processing, Feature Selection, Sentence Sampling and Post-processing. Pre-processing First, we need to correct some xml-coding errors manually. Latter, we have to make some preparations such as document merging, sentence ltering and input le generation for hierarchical Latent Dirichlet Allocation (hLDA). We merge the content of RP and the citations into a document for CTS feature described below. Besides, all documents are converted to lowercase letters. Then we lter the corpus for removing some equations, gures, tables and generate input le for hLDA model which contains word index and their corresponding frequencies.

Feature Selection When it comes to document representation, we to build matrix L from both partial (Statistical Feature Method) and holistic (Neural Network Language Model) perspectives to ensure better sentence sampling for summaries. First, we build matrix L through Lij = qiSij qj concretely, we adopt Sentence Length (SL), Sentence Position (SP), Title similarity (TS), CTS, and Hierarchical Topic Model (HTM) as features according to the work of Li L [ 13 ] for quality and JACCARD similarity for diversity. We are looking forward to nding a best linear combination of designed qualities in order to capture more obvious characteristic for high-quality summary. Furthermore, we construct matrix L through Lij = Bi>Bj by the vectors B representing sentences from Sent2Vec and LSA directly, and call this framework Neural Network Language Model. Sentence Sampling We use DPPs to select sentences, which are elegant probabilistic models of global, negative correlations and mostly used in quantum physics to study the re ected Brownian motions. In our method, we only consider discrete DPPs and follow the de nition of Kulesza A et al. [ 12 ]. We can enhance the diversity of summary by using DPPs. In this way, given the L matrix constructed on document sentences, the sampling method based on DPPs [ 13 ] can automatically choose those diverse sentences with high quality as candidate summary sentences.

Post-processing Since we have already had the candidate summary sentences, we can truncate the output summary with sentences ranking high in quality, limit the summary to 250 words, and remove some white spaces in Post-processing. 4 4.1

Implementation and Experiments

In our previous work, we obtained a lot of features. As shown in Table 2, Features number indicates the number of features the method contains. The four methods in [ 11 ] have di erent e ects on the test data and the training data, and the more features with good performance are used, the more stable the performance of the testing set is. The more stable the performance is. Therefore, we removed the features with poor performance on the training set, remaining the features with good performance for fusion methods. We adjust the parameters of the four fusion methods in [ 11 ]. The four new fusion methods are voting1.2, voting2.1, Jaccard-Focus-1.1, and Jaccard-Cascade-1.1. Since the LDA can discover the topic information and LDA vector is sparse, lexicon (LDA) and LDA-cos are removed and LDA-Jaccard is added. Since the lexicon (co-occurrence) only includes words selected from the training set, when the di erence between the testing set and the training set is great, the lexicon (co-occurrence) is ine ective. In the experiment, we chose 600 dimensions for LDA vector and 200 dimensions for word vector. Table 1 shows the parameter settings for our method. As to experiments, we choose 600 dimension as LDA vector and 200 dimension as word vector. Table 1 shows the parameter settings of our methods.

In addition, with the increasing training data, we begin to try to solve task1A with CNN. In this paper, we build the Word2vec H feature for the sentencepair, so that we could reduce the dimensionality of the input and add the cosine similarity to it.We use V-1.2, V-2.1, J-F-1.1, J-C-1.1, and W H-C to represent Voting-1.1, Voting-2.0, JacCard-Focused-1.1, and Jaccard-Cascade-1.1, Word2vec H-CNN respectively.In Table 1,W and P are Weight and Proportion respectively. JS means 10 fold of Jaccard Similarity.

According to Table 2 we predict that V-1.2 and J-F-1.1 will be more stable on the testing set. The W H-C uses the data in "Training-Set-2019", and the e ect is the worst due to some problems, such as data imbalance of training set and complex structure of CNN. From Table 2 and Table 3 [ 14 ],we can get three conclusions:

The number of features used in V-1.2 is less than V-2.1 and J-F-1.1, but the result of V-1.2 is similar to V-2.1 and J-F-1.1. The number of features used in V-1.2 is about the same with J-C-1.1, and the result of V-1.2 is better than J-C-1.1. It shows that features used in V-1.2 play a leading role.

The results of the runs in 2019 verify our prediction, that is, the more features that are used, the more stable the performance on the test set is. So the performance on the testing set and the training set of V-2.1 is very stable, as well as J-F-1.1.

After removing co-occurrence dictionary, (F-train) - (F-test) results are smaller, which indicates that co-occurrence dictionary has limitations and should be removed. 4.2

Task 1B

In this section, well introduce our methods applied for Task 1B in detail. Rule-based Methods: Subtitle Rule: We use subtitles of CTS and citance to determine which facet they belong to. If subtitles contain ve prede ned classes, we categorize CTS and citance to corresponding facet. High Frequency Word Rule: We use high frequency words of each class to classify CTS and citance. We rst remove common words, and then set a threshold for each facet. Subtitle and High Frequency Word Combining Rule: We rst apply Subtitle Rule to obtain the facet. If it doesnt give an explicit answer, then we use High Frequency Words Rule to obtain facet.

Machine Learning Methods: Firstly, we extract features from CTS and citance consisting of Location of Paragraph, Document Position Ratio, Paragraph Position Ratio and Number of Citations or References and concatenate these features into an 8-dimension vector. Then we train RF and GB based on the features. As for CNN, the content of CTS is transformed to a matrix where ith row corresponds to the word embedding of ith word and jth column represents the jth dimensionality of the embedding. Then, we stack a convolutional layer with multiply kernel sizes followed by a max-pooling layer. The architecture of CNN is shown in Fig 5.

Results on Train-Set-2019 are illustrated in Table 4. We nd that Voting and SubHFW methods have a better performance. CNN performs poorer than we expected since the training data set is too small for a neural network to learn. And the dataset is imbalanced where method facet has more samples than other facets.

As for Task 1B, the results on Test-set-2019 show that SubHFW method outperforms than other method and ranks rst among all methods, which indicates that the features of subtitle and high frequency word are crucial to determine the facet of each CTS. Moreover, textCNN method performs poorer than we expected due to the demand of larger dataset. 4.3 The results below utilize Manual ROUGE values to evaluate our system summary. During the evaluation phase, CL-SciSumm 2018 has provided THREE kinds of criterion for option: the collection of citation sentences (the community summary), faceted summaries of the traditional self-summary (the abstract), and ones written by well-trained annotators (the human summary).

Take community summary for instance, we test each feature SP ('0), SL ('1), TS ('2), HTM ('3), and CTS ('4) described in subsection 3.3 on statistical fearture model independently to gure out its own contribution at rst. As the CTS feature ('4) is specially designed, we tend not to present its individual performance, but record and observe the binary combination with every other basic feature.

From TABLE 5, the best binary combination comes from TS ('2) and CTS ('4) features. One possible explanation is that the community summary itself has already included these citation sentences. With the title containing the essence of a paper, selected sentences following this ranking rule will de nitely guarantee the overlapping on golden summaries.

Analogically, we conduct experiments on other two kinds of golden summaries, where the weights of parameters appear slightly di erent. From TABLE 6 and TABLE 7, which present the results of the community summary and human summary separately: the best binary combination goes to the same tendency. The phenomena of same best combination maybe interpreted implicitly that no matter whether the sentences are cited otherwise or the summaries are written by annotators, they two both are from the perspective of readers. Community summaries consist of those citation sentences, and the sentences themselves are extracted from the original documents, thus there's no wonder the ROUGE evaluation is far higher than other kinds of summaries. However, human summary is based on comprehension of readers. In this case we do extra experiments on human summaries besides the same parameter setting as community summaries. The best new combination as TABLE 7 shows is a little bit di erent from the previous mere copies of community summaries. When it comes to human summary, the more parameters are involved, the higher ROUGE F-score it reaches. Unfortunately, for community summary, when we desire a further exploration on binary combination, any additional attribute performs adversely. There are a thousand Hamlets in a thousand people's eyes.

As for the self-summary (the abstract), things presented in TABLE 8 are opposite. Every binary combination with CTS ('4) feature are not that satis ed, so we present each individual contribution of other statistical or topic features. Also, we try the best parameter setting for community summary and human summary both on abstract summary. Perhaps although we have tried our best to follow the writers, there always exists a narrow gap between our readers' comprehension and writers' original intention. This part of the experiment follows a simple but practical principle that on the condition that we cannot fully understand latent semantics the writers want to express, we still manage to deal with some statistical features which help to extract important sentences. If the summarizer is developed through this approach, it is not limited in a familiar language and does not require any additional linguistic knowledge or complex linguistic processing.

Furthermore, when extracting sentences from the Neural Network Language Model (using Sent2Vec/LSA representation for sentences), we choose the best quality combination for community summary, human summary and abstract summary. TABLE 9 suggests the Neural Network Language Model performance. Besides, TABLE 10 shows the best results of several runs in BIRNDL 2019. Among all the systems in competiton, our system won the rst prize for the human summary, while the second place for abstract and community summary. This year, we have added neural networks to the methods of three tasks. We hope to make use of large training corpus to give the advantages of neural networks, that is, deeply mining the meaning of the text. Rule-based and statistics-based methods have achieved good performance, so we try to combine them with neural networks. In the future work, Task 1A is expected to automatically adjust the weight of features through neural network and combine multiple features better. For Task 1B, more study should be done to reduce the impact of imbalanced data on neural networks. Besides, more curial features are expected to be found since the performance of machine learning methods is the best so far. In Task 2, we expect the neural network language models to make contributions to obtain more meaningful semantic representation for sentences against statistical features.

Acknowledgements

This work was supported in part by the Beijing Municipal Commission of Scienceand Technology under Grant Z181100001018035; National Social Science

Foundation of China under Grant 16ZDA055; National Natural Science Foundation of China under Grant 91546121; Engineering Research Center of Information Networks, Ministry of Education.

1. CL-SciSumm 2019 Homepage , http://wing.comp.nus.edu.sg/ cl-scisumm2019/.

2. Wang

, Li

, Wang

, et al. NUDT@ CLSciSumm-18In: Proceedings of the 3nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries[C]//BIRNDL@ SIGIR. 2018 : 102 - 113 .

3. Davoodi

, Madan

, Gu J. CLSciSumm Shared Task: On the Contribution of Similarity measure and Natural Language Processing Features for Citing Problem [C]//BIRNDL@ SIGIR. 2018 : 96 - 101 .

4. Ma

, Zhang

, Xu

, et al. NJUST@ CLSciSumm-18[C]//BIRNDL@ SIGIR. 2018 : 114 - 129 .

5. Kim

Convolutional neural networks for sentence classi cation[J] . arXiv preprint arXiv:1408.5882 , 2014 .

6. Agrawal

, Mittal

. IIIT-H@ CLScisumm-18 In: Proceedings of the 3nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries[C]//BIRNDL@ SIGIR. 2018 : 130 - 133 .

7. Baruah

, Kolla M. Klick Labs at CL-SciSumm 2018 [C]//BIRNDL@ SIGIR. 2018 : 134 - 141 .

8. Karimi

, Moraes L F T , Das A , et al. University of Houston@ CL-SciSumm 2017 : Positional language Models, Structural Correspondence Learning and Textual Entailment[C]//BIRNDL@ SIGIR (2). 2017 : 73 - 85 .

9. Aburaed

, Bravo

, Chiruzzo

, et al. LaSTUS/TALN+ INCO@ CL-SciSumm 2018-Using Regression and Convolutions for Cross-document Semantic Linking and Summarization of Scholarly Literature[C]// Proceedings of the 3nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2018) . Ann Arbor, Michigan (July 2018 ). 2018 .

10. Debnath

, Achom

, Pakray

NLP-NITMZ@ CLScisumm-18 In: Proceedings of the 3nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries[C]//BIRNDL@ SIGIR. 2018 : 164 - 171 .

11. Li

, Chi

, Chen

, et al. CIST@ CLSciSumm-18: Methods for Computational Linguistics Scienti c Citation Linkage, Facet Classi cation and Summarization[C]//BIRNDL@ SIGIR. 2018 : 84 - 95 .

12.

Alex

Kulesza and Ben Taskar ( 2012 ), Determinantal Point Processes for Machine Learning , Foundations and Trends in Machine Learning : Vol. 5 : No. 23 , pp 123 - 286 . http://dx.doi.org/10.1561/2200000044.

13. Li

, Zhang

, Chi

et al. UIDS: A Multilingual Document Summarization Framework Based on Summary Diversity and

Hierarchical

Topics [M] // Li L, Zhang

, Chi

et al. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data . Springer, 2017 : 2017 : 343 - 354 .

14. Chandrasekaran , M.K. , Yasunaga , M. , Radev , D. , Freitag , D. , Kan , M.- Y. "Overview and Results: CL-SciSumm SharedTask 2019" , In Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019 ) @ SIGIR 2019 , Paris, France.