=Paper=
{{Paper
|id=Vol-2132/paper8
|storemode=property
|title=CIST@CLSciSumm-18: Methods for Computational Linguistics Scientific Citation Linkage, Facet Classification and Summarization
|pdfUrl=https://ceur-ws.org/Vol-2132/paper8.pdf
|volume=Vol-2132
|authors=Lei Li,Junqi Chi,Moye Chen,Zuying Huang,Yingqi Zhu,Xiangling Fu
|dblpUrl=https://dblp.org/rec/conf/sigir/LiCCHZF18
}}
==CIST@CLSciSumm-18: Methods for Computational Linguistics Scientific Citation Linkage, Facet Classification and Summarization==
CIST@CLSciSumm-18: Methods for Computational Linguistics Scientific Citation Linkage, Facet Classification and Summarization Lei Li, Junqi Chi, Moye Chen, Zuying Huang, Yingqi Zhu, and Xiangling Fu Beijing University of Posts and Telecommunications (BUPT) No.10 Xitucheng Road, Haidian District, Beijing, P.R.China {leili,cjq,myc,zoehuang,zhuyq, fuxiangling}@bupt.edu.cn Abstract. Our system makes contributions to the shared Task 1A (cita- tion linkage), Task 1B (facet classification) and Task 2 (summarization) in CLSciSumm-18@SIGIR2018. We develop it based on our former one called CIST@CLSciSumm-17 [7]. We try to improve the methods for all the shared tasks. We adopt Word Mover’s Distance (WMD) and im- prove LDA model to calculate sentence similarity for citation linkage. We try more methods for facet classification. And in order to improve the performance of summarization, we also add WMD sentence similarity to construct new kernel matrix used in Determinantal Point Processes (DPPs). Keywords: WMD · LDA · DPPs · Random Forest 1 Introduction With the development of science and network technology, more and more sci- entific literature appears, especially in Computational Linguistics (CL) domain. We all make literature surveys in our research for a specific topic to obtain inspi- ration and novel approaches. However, it’s time-consuming for human to analyze all the related contents. The goal of CLSciSumm-18 [1] is to explore summariza- tion of scientific research for CL domain, support research in automatic scientific document summarization and provide evaluation resources to push the current state-of-the-art [2]. CLSciSumm-18 contains Task 1A, Task 1B and Task 2. Each topic of the training dataset and test one consists of a Reference Paper (RP) and sev- eral Citing Papers (CPs) with citations to the RP. Task 1A is to identify the spans of text (cited text spans, CTS) in the RP for each citance given the RP and CPs. And all CTS might be a sentence fragment, a full sentence, or sev- eral consecutive sentences (no more than 5). Task 1B requires that for each CTS, we need to identify what facet it belongs to from a predefined set of facets (Aim Citation, Method Citation, Implication Citation, Results Citation and Hypothesis Citation). We will generate a structured summary of the RP in Task 2, in which there are two types: faceted summary of the traditional 2 Lei L. et al. self-summary and the community summary (the collection of citation sentences, ’citances’). In this paper we will introduce our methods, strategies and experiments of Task 1A, Task 1B and Task 2 based on our former one called CIST@CLSciSumm- 17 [7]. We try to apply new sentence similarity computed from WMD and im- proved LDA (Latent Dirichlet Allocation) model with better topic features for Task 1A. In Task 1B, we use more classification methods to obtain the facet of CTS. In Task 2, we try WMD sentence similarity to construct kernel matrix for improving the quality of Determinantal Point Processes (DPPs) sampling on the basis of our former work on summarization [3]. 2 Related Work Methods of information extraction and content linkage have sprung up recently, which attract the interest of researchers, especially in the last two years. Meth- ods as well as results of CLSciSumm-2016 and CLSciSumm-2017 are described in [4] [5]. The methods demonstrated in Task 1A are highly relevant to the methods of calculating similarity. For example, Ma S et al. [6] combine Similarity- based features (LDA/Jaccard/IDF/TF-IDF/Doc2Vec similarity) with Rule-based features to obtain citation linkage. Li L et al. [7] also propose many similarity methods. Zhang D et al. [8] utilize Search-based Similarity Scoring and Super- vised Method. The calculation the Cosine Similarity was used in [9]. Aburaed et al. [10] use Voting system to obtain the best result of Word Embeddings Distance system, Modified Jaccard system and BabelNet Embeddings Distance system. Methods based on measuring semantic textual similarity are used in [11]. Besides, other methods are also applied for citation linkage. Task 1A was trans- formed to a query problem in [12]. Different ranking models and query generation strategies were applied in their system. Karimi et al. [13] use the following ap- proaches: structural correspondence learning, positional language models and textual entailment. For Task 1B, we treat it as classification problem. So many classification methods are used in Task 1B. Classification methods are mainly divided into two parts: Rule-based methods and supervised machine learning methods [6] [7] [13] [11]. Besides, some other methods are also used in Task 1B. For example, Felber et al. [12] transform the span of text into a query problem, and then conduct a majority vote on the top five retrieved results to determine the discourse facet. Prasad et al. [14] use classification and ranking method. As for summary generation in Task 2, some teams submitted their results in BIRNDL 2017. Ma S et al. [6] divide the process into two main steps. They group sentences into different clusters by bisecting K-means, and then use maximal marginal relevance (MMR) to extract sentence from each cluster and combine them into a summary. Aburaed et al. [10] score the sentence using multi-features with different weights, and then get the summary according to the score. Li L et al. [7] make a linear combination of multiple features to compute sentence quality. Besides, they also sample sentences based on Jaccard similarity and CIST@CLSciSumm-18 3 sentence quality. We will try new similarity method to construct new kernel matrix of DPPs for better summary. 3 Methods The framework of our system is shown in Fig. 1. We first obtain the CTS in RP for each citance in CPs, then use features extracted from CTS to determine its facet, and finally we use CTS and its Facet to generate a summary (no more than 250 words). 7DVN 7DVN &LWDWLRQ )DFHW /LQNDJH &ODVVLILFDWLRQ 3UHSURFHVVLQJ 53DQG&3V )DFHW )HDWXUH 6HOHFWLRQ )HDWXUH &ODVVLILFDWLRQ ([WUDFWLRQ 0HWKRGV &RQWHQW/LQNDJH )HDWXUH 6HQWHQFH ([WUDFWLRQ 6DPSOLQJ 0HWKRGV &76 3RVWSURFHVVLQJ Fig. 1. Framework of Our System. 3.1 Word Mover’s Distance Word Mover’s Distance (WMD) is a method for calculating the distance of two sentences or texts based on word vector and Earth Mover’s Distance (EMD). WMD distance measures the dissimilarity between two textual documents as the minimum amount of distance that the embedded words of one document need to ”travel” to reach the embedded words of another document [15]. We apply WMD as the measurement for similarity of two sentences and two texts in our system. Where, N and M are word number of two textual documents D and D’. w is word vector, and dim represents word vector dimension. d and d0 are normalized bag-of-words vectors of D and D’. 4 Lei L. et al. !1 "11 # "1!$% !1) "11 # "1!$% !2 "21 # "2!$% !2) "21 # "2!$% & & ( & & & ( & 澳 !' "'1 # "'!$% !* ) "*1 # "*!$% 澳 ' 'ÿ Fig. 2. Representation of two documents D and D’ After removing stop words, we first represent D and D’ as two nBOW vectors 0 d and d . We then obtain word vector w of each word in D and D’. Finally we can obtain the representation of D and D’ shown in Fig. 2. The goal of WMD is to incorporate the semantic similarity between individual word pairs (e.g. President and Obama), and the Euclidean distance of two words in the word2vec embedding space. The distance between word i and word j is c(i, j) = ||wi −wj ||. 0 Word i and word j are from D and D’ respectively. After getting d, d and c(i, j) we can use EMD algorithm to obtain the minimum WMD. 3.2 Task 1 Citation Linkage (Task 1A): The main processes are extracting features from RP and CPs, and using Content Linkage Methods to obtain CTS for each citance. Feature Extraction: This is extracting features from RP and CPs, which con- tain Lexicons (high-frequency lexicon, LDA lexicon and co-occurrence lexicon), Sentence similarity (WMD similarity, IDF similarity and Jaccard similarity), Context similarities, Word vector, WordNet (jcn, lin, lch, res, wup and path similarity) and CNN (Convolutional Neural Network) similarity. We calculate the WordNet similarity between words in the two sentences to obtain a matrix. Then we select the maximum value in the matrix, and remove the corresponding row and column of the maximum value repeatedly until the matrix is null. Fi- nally we add up all maximum √ values selected in each iteration to a sum value and the result is divided by length1 length2 to be similarity between sentences. The process of computing Word vector similarity is the same as that of the WordNet similarity. CNN uses word vector as the input to obtain the probability of con- tent linking from its output, and the output probability represents the similarity of input sentences [7]. Most features are used in our former work [7] except for Lexicon obtained by LDA model and the WMD applied for calculating Sentence similarities and Context Similarities. In our previous work, we used LDA model only to train RP and CPs to obtain the LDA lexicon of 20 latent topics for files in each topic. We improve the LDA model to obtain better topic features. According to the LDA model CIST@CLSciSumm-18 5 we denote a sentence S as an n-dimensional vector (LDA vector), such as S = (x1 , ..., xi , ..., xn ). xi represents the probability of S which belongs to the ith topic. Every citance and CTS can be represented as n-dimensional vectors sep- arately so that we could calculate their cosine similarity. We represent cosine similarity of LDA vector as LDA-cos. The larger cosine similarity is, the more similar they are. Compared with the old LDA method, the new LDA method not only considers the number of same words belonging to the same topic in citance and CTS, but also preserves the cohesion of topic distribution in them. Besides, we use WMD to calculate the similarity of two texts for enriching similarity features. Content Linkage Methods: We use two methods which are Voting Method and WMD Method. Voting Method means that final results are obtained by voting of all runs (which are the results given by features described in Feature Extraction). WMD Method means that the results come from the similarity calculated by WMD (we can call it WMD similarity). In the WMD similarity method, first we represent sentences as word vectors. Then we calculate the WMD similarity between citance and CTS using word vectors. WMD refers to the distance one specific sentence requires to transform to another, so the smaller the WMD is, the more similar the two sentences are. Facet Classification (Task 1B): Our system mainly uses Rule-based meth- ods and Machine Learning Methods based on multiple features for Task 1B. Rule-based methods contain Subtitle Rule (Sub), High Frequency Word Rule (HFW) and Subtitle and High Frequency Word Combining Rule (SubHFW). Rule-based methods construct rules based on features got from CTS, RP and CPs. As for Machine Learning methods, we apply SVM, Decision Trees (DT) and K-Nearest Neighbor (KNN) to obtain facet. Besides, we also train Random Forest (RF), Gradient Boosting (GB) and Voting methods to obtain facet, which are based on the idea of Ensemble Leaning. The features used in machine learn- ing methods contain Location of Paragraph, Document Position Ratio, Paragraph Position Ratio and Number of Citations or References. Finally we combine all the results to obtain a fusion result, which is called Fusion method. 3.3 Task 2 The main process for summary generation consists of Pre-processing, Feature Selection, Sentence Sampling and Post-processing. Pre-processing: We need to correct some xml-coding errors firstly. Besides, we have to make some preparations such as document merging, sentence filtering and input file generation for hierarchical Latent Dirichlet Allocation (hLDA). We merge the content of RP and the citations into a document. And we will not extract the sentence in the abstract of RP except for that it is selected in Task 1A. Besides, all documents are converted to lowercase letters. We filter the corpus for removing some equations, figures, tables and so on. Then we generate input file for hLDA which contains word index and their corresponding frequencies. 6 Lei L. et al. 0HUJLQJ 6/63&76 :RUG/HQJWK 76+70 )LOWHULQJ '33V 5HPRYH :KLWH .HUQHO 6SDFH K/'$LQSXWILOH 0DWUL[ Fig. 3. Process of Task 2 Feature Selection: We choose Sentence Length (SL), Sentence Position (SP), CTS, Title similarity (TS) and Hierarchical Topic Model (HTM) as fea- tures in our system according to the work of Li L [3]. We use these features to calculate sentence quality. Besides, we use WMD similarity as sentence similar- ity, and combine it with sentence quality to construct kernel matrix of DPPs. Sentence Sampling: We use DPPs to select sentences, which are elegant probabilistic models of global, negative correlations and mostly used in quantum physics to study the reflected Brownian motions. In our method, we only consider discrete DPPs and follow the definition of Kulesza A et al. [16]. We can enhance the diversity of summary by using DPPs. Furthermore, we also use Jaccard similarity to construct kernel matrix as a comparison for the effectiveness of DPPs based on WMD similarity. Post-processing: We truncate the output summary to 250 words, and re- move some white spaces in Post-processing. 4 Implementation and Experiments We implement our system and use the official scripts to evaluate the training data using ten cross-validation in Task 1. Training-Set-2018 and Test-Set-2018 provided by official are training data and test data respectively in our system. 4.1 Task 1A In our previous work, for syntactic information, we have three lexicons, two sentence similarities and two context similarities. All of them can measure sen- tence similarity [7]. For semantic information, we use word vector [7], WordNet and CNN. In this paper, we combine two feature representations (LDA vector and word vector) and two similarity calculation methods (EMD similarity and cosine similarity). We obtain two new methods: LDA-cos and WMD. We used the corpus crawled from ”https://www.theguardian.com The Guardian” to train the word embeddings. The size of the corpus is 835 MB. As to experiments, we CIST@CLSciSumm-18 7 choose 600 dimensions for LDA vector and 300 dimensions for word vector. The Task 1A methods are unsupervised. We have done some experiments under conditions of different numbers of sentences in the result. Then we choose the number used in our runs, which shows the best performance. Besides, we also improve two feature fusion methods: Voting-1.0 and Jaccard- Focused in Li L et al. [7]. Except for some parameter changes, we add and delete some features of methods. Based on Voting-1.0 we obtain Voting-1.1, which replaces Jaccard context similarity with LDA-cos similarity. Based on Jaccard- Focused we obtain Jaccard-Focused-new, which adds jcn similarity and LDA-cos similarity. Table 1 shows the parameter settings of our methods. Table 1. Parameter settings of Methods in Task 1A V-1.1 V-1.0 V-2.0 J-F-new J-F J-C Feature W P W P W P W P W P W P IDF simmilarity 1 7 1 8 1 7 0.6 16 0.7 15 1.5 16 IDF context similarity - - - - 0.5 4 0.5 15 0.5 15 1 18 Jaccard similarity 1 7 1 12 1 3 JS 7 JS 7 - - Jaccard context similarity - - 1 12 0.5 8 0.7 16 0.7 15 1.5 15 word vector 1 6 1 10 0.5 8 0.5 26 0.5 25 - - Lexcion 2(LDA) - - - - 0.3 2 - - - - - - Lexcion 3(co-occurence) - - - - 0.4 2 0.2 23 0.2 25 0.5 15 jcn similarity - - - - - - 0.6 11 - - - - LDA-cos similarity 1 8 - - - - 0.5 26 - - - - WMD similarity - - - - - - - - - - - - In Table 1, W and P are Weight and Proportion respectively. V-1.1, V-1.0, V- 2.0, J-F-new, J-F, J-C are Voting-1.1, Voting-1.0, Voting-2.0, Jaccard-Focused- new, Jaccard-Focused and Jaccard-Cascade methods reprectively. JS means 10 fold of Jaccard Similarity. Owing to the performance of WMD similarity is very poor on the training data, WMD similarity is not adopted in our feature fusion methods. Table 2. Performance of Methods in Task 1A Method Precision Recall F Method Precision Recall F Voting-1.1 0.102 0.265 0.147 Jaccard-Focused-new 0.091 0.237 0.132 Voting-1.0 0.067 0.217 0.102 Jaccard-Focused 0.081 0.263 0.124 Voting-2.0 0.0838 0.271 0.128 Jaccard-Cascade 0.0.076 0.247 0.116 From Table 2, we find that the performance of Voting-1.1 method is better than Voting-1.0. It shows the validity of LDA-cos similarity. Besides, comparing to Jaccard-Focused method, the performance of Jaccard-Focused-new is much better. 8 Lei L. et al. 4.2 Task 1B Here, we mainly apply Rule-based Methods and Machine Learning Methods. Rule-based Methods: Subtitle Rule: We use the subtitles of CTS and citance to determine the facet. If the subtitles contain words of five predefined classes, we categorize CTS and citance as corresponding facet. High Frequency Word Rule: We apply high frequency words obtained from five classes to classify CTS and citance. We first remove the common words, and then set a threshold for each facet. Subtitle and High Frequency Word Combining Rule: We first apply Subtitle Rule to obtain the facet. If subtitles fail, we use High Frequency Words to obtain final facet. Machine Learning Methods: First we extract features from CTS and citance. The features are Location of Paragraph, Document Position Ratio, Paragraph Position Ratio and Number of Citations or References of CTS and citance, and they are put together in an 8-dimension vector. Second we train SVM, DT, KNN, RF, GB and Voting model with Training-Set-2016 and Training-Set-2017. Table 3. Performance of Methods in Task 1B Method F Score Method F Score Method F Score Sub 0.716 GB 0.548 SVM 0.473 HFW 0.542 KNN 0.525 Voting 0.603 SubHFW 0.716 DT 0.462 RF 0.647 From Table 3, we can find that Sub, SubHFW, RF and Voting methods show better performance in our experiments. Owing to Sub Methods are highly related to subtitle, the method is full of uncertainty. In our submitted runs, we use RF, SubHFW, Voting and Fusion methods as our final methods for Task 1B. Owing to the missing of some Citance XML files in Test-Set-2018 released by the official, we cannot extract features of CTS. In this situation, we set a fixed initial value as features for Task 1B in submitted Test-Set-2018 runs. 4.3 Task 2 In this part, our system provides a sample method based on DPPs [7] to extract sentences when constructing a brief summary with no more than 250 words. Determinantal point processes (DPPs) are elegant probabilistic models of repul- sion that origin in quantum physics and random matrix theory. The essential characteristic of a DPP is that these binary variables are negative correlated. As a result the sampling subset is a set of items that are diverse, this exactly encourages a number of techniques working with diverse sets, especially in the CIST@CLSciSumm-18 9 information retrieval community . A summary generated by an automatic system requires the analogous principles: coverage of information, information signifi- cance, redundancy in information and cohesion in text. Thus, we associate these two objects together to build informative summaries through a sampling method based on DPPs by selecting diverse sentences from documents. It takes not only the ranking of the sentence quality themselves into account, but also the corre- lation between these sentences. This approach was once fully described before in [7] and was proven a competitive method based on the result feedback from the CLSciSumm-17. As Task 2 requires a structured summary generated from the CTSs identified in Task 1A, we consider the CTS as one crucial feature described in section 3.3 to help select sentences. Besides, SP, SL, TS and HTM feature are also included. We try two specific metrics to measure the cohesion quantitatively: JACCARD calculates the proportion of same words precisely while WMD reflects the tran- sition cost from one sentence to another. During our contrast experiment, we are looking forward to finding a best linear combination of qualities in order to capture more obvious characteristic for high-quality summary, and exploring relationship between sentences through comparison of different metrics for its redundancy. The results below utilize Manual ROUGE values to evaluate our summaries. During the evaluation phase, CLSciSumm-18 has provided THREE kinds of cri- terion for option: the collection of citation sentences (the community summary), faceted summaries of the traditional self-summary (the abstract), and ones writ- ten by well-trained annotators (the human summary). Take community summary for instance, we test SP (ϕ0 ), SL (ϕ1 ), TS (ϕ2 ), HTM(ϕ3 ) and CTS (ϕ4 ) feature independently to figure out its own contribution at first. As the CTS feature (ϕ4 ) is specifically designed, we tend not to present its individual performance, but record and observe the binary combination with every other basic feature. Table 4. Binary Quality Combination Test Run ID ϕ0 ϕ1 ϕ2 ϕ3 ϕ4 ROUGE1 ROUGE2 W-D-0 1 0 0 0 1 0.43652 0.23824 W-D-1 0 1 0 0 1 0.42104 0.19574 W-D-2 0 0 1 0 1 0.52800 0.37682 W-D-3 0 0 0 1 1 0.41193 0.18440 From Table 4, the best binary combination comes from TS (ϕ2 ) and CTS (ϕ4 ) features for WMD metric. One possible explanation is that the community summary itself has already included these citation sentences. With the title containing the essence of a paper, selected sentences following this ranking rule will definitely guarantee the overlapping on golden summaries. Analogically, we conduct experiments on other two kinds of golden sum- maries, where the weights of parameters appear slightly different. Tables 5-7, 10 Lei L. et al. Table 5. Performance On Community Summary Run ID ϕ0 ϕ1 ϕ2 ϕ3 ϕ4 ROUGE1 ROUGE2 W-D-0 0 0 1 0 1 0.52800 0.37682 W-D-1 0 0 0 1 1 0.41193 0.18440 W-D-2 1 0 2 0 2 0.44158 0.25259 W-D-3 1 1 2 0 2 0.43992 0.24908 J-D-0 0 0 1 0 1 0.52552 0.37333 J-D-1 0 0 0 1 1 0.41283 0.18438 J-D-2 1 0 2 0 2 0.44104 0.24653 J-D-3 1 1 2 0 2 0.42219 0.22141 Table 6. Performance On Self-summary Run ID ϕ0 ϕ1 ϕ2 ϕ3 ϕ4 ROUGE1 ROUGE2 W-D-0 1 0 0 0 0 0.39625 0.18644 W-D-1 0 1 0 0 0 0.36460 0.14273 W-D-2 0 0 1 0 0 0.38662 0.17437 W-D-3 0 0 0 1 0 0.30555 0.07241 J-D-0 1 0 0 0 0 0.39630 0.19019 J-D-1 0 1 0 0 0 0.35020 0.11507 J-D-2 0 0 1 0 0 0.38434 0.17296 J-D-3 0 0 0 1 0 0.30237 0.08561 Table 7. Performance On Human Summary Run ID ϕ0 ϕ1 ϕ2 ϕ3 ϕ4 ROUGE1 ROUGE2 W-D-0 0 0 1 0 1 0.41884 0.19276 W-D-1 0 0 0 1 1 0.34337 0.09636 W-D-2 1 0 2 0 2 0.40278 0.16916 W-D-3 1 1 2 0 2 0.41598 0.18429 J-D-0 0 0 1 0 1 0.41900 0.19167 J-D-1 1 0 0 0 1 0.39866 0.15321 J-D-2 2 0 3 0 3 0.43504 0.25430 J-D-3 2 1 3 0 3 0.42219 0.22141 CIST@CLSciSumm-18 11 present the weights and results of the three golden summaries: the best binary combinations go to the same tendency. However, when it comes to human sum- mary, the more parameters are involved, the higher ROUGE F-score it reaches. Unfortunately, for community summary, when we desire a further exploration on binary combination, any additional attribute performs adversely. The phe- nomena of same best combination may be interpreted implicitly that no matter whether the sentences are cited otherwise or the summaries are written by an- notators, the two both are from the perspective of readers. There are a thousand Hamlets in a thousand people’s eyes. As for the self-summary (the abstract), every binary combination with CTS (ϕ4 ) feature are not that satisfied, so we present each individual contribution of other statistical or topic features. Per- haps although we have tried our best to follow the writers, there always exists a narrow gap between our readers’ comprehension and writers’ original intention. In general, despite the two diversity metrics are somehow evenly matched on this dataset, the best result in Table 5, the 1th row comes from WMD metric, thus we firmly believe the newly proposed algorithm is just on its way, still remains full potential to be discovered. 5 Conclusion and Future Work In this paper, we propose some new methods to improve the performance of Task 1 and Task 2 based on our former work, especially in similarity calculation. We apply WMD method and LDA-cos to calculate similarity and generate sum- maries. In future, we will continue to improve these methods and incorporate new methods based on the official results by CLSciSumm-18. Acknowledgements This work was supported by National Social Science Foundation of China [grant number 16ZDA055]; National Natural Science Foundation of China [grant num- bers 91546121, 71231002]; EU FP7 IRSES MobileCloud Project [grant number 612212]; the 111 Project of China [grant number B08004]; Engineering Research Center of Information Networks, Ministry of Education; Beijing BUPT Infor- mation Networks Industry Institute Company Limited; the project of Beijing Institute of Science and Technology Information; the project of CapInfo Com- pany Limited. References 1. CL-SciSumm 2018 Homepage, http://wing.comp.nus.edu.sg/ birndl-sigir2018/. 2. Chandrasekaran M K, Jaidka K, Mayr P. Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2017)[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2017: 1421-1422. 12 Lei L. et al. 3. Li L, Zhang Y, Chi J, et al. UIDS: A Multilingual Document Summarization Frame- work Based on Summary Diversity and Hierarchical Topics[M]//Chinese Computa- tional Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2017: 343-354. 4. Jaidka K, Chandrasekaran M K, Rustagi S, et al. Insights from CL-SciSumm 2016: the faceted scientific document summarization Shared Task[J]. International Journal on Digital Libraries, 2017: 1-9. 5. Jaidka K, Chandrasekaran M K, Jain D, et al. The CL-SciSumm shared task 2017: results and key insights[C]//Proceedings of the Computational Linguistics Scien- tific Summarization Shared Task (CL-SciSumm 2017), organized as a part of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2017). 2017. 6. Ma S, Xu J, Wang J, et al. NJUST@ CLSciSumm-17[C]//Proc. of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017). 7. Li L, Zhang Y, Mao L, et al. CIST@ CLSciSumm-17: Multiple Features Based Ci- tation Linkage, Classification and Summarization[C]//Proc. of the 2nd Joint Work- shop on Bibliometric-enhanced Information Retrieval and Natural Language Pro- cessing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017). 8. Zhang D, Li S. PKU@ CLSciSumm-17: Citation Contextualization[C]//Proc. of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017). 9. Pramanick, Aniket, et al. ”SciSumm 2017: Employing Word Vectors for Identifying, Classifying and Summarizing Scientific Documents.” 10. Aburaed, Ahmed, et al. ”LaSTUS/TALN@ CLSciSumm-17: cross-document sen- tence matching and scientific text summarization systems.” (2017). 11. Lauscher, Anne, Goran Glava, and Kai Eckert. ”University of Mannheim@ CLSciSumm-17: Citation-Based Summarization of Scientific Articles Using Seman- tic Textual Similarity.” (2017): tba. 12. Felber, Thomas, and Roman Kern. ”Graz University of Technology at CL-SciSumm 2017: Query Generation Strategies.” 13. Karimi, Samaneh, et al. ”University of Houston@ CL-SciSumm 2017: Positional language Models, Structural Correspondence Learning and Textual Entailment.” 14. Prasad, Animesh. ”WING-NUS at CL-SciSumm 2017: Learning from syntactic and semantic similarity for citation contextualization.” Proc. of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017). Tokyo, Japan (August 2017). 2017. 15. Kusner M, Sun Y, Kolkin N, et al. From word embeddings to document dis- tances[C]//International Conference on Machine Learning. 2015: 957-966. 16. Kulesza A, Taskar B. Determinantal point processes for machine learning[J]. Foun- dations and Trends in Machine Learning, 2012, 5(23): 123-286.